6,765 Matching Annotations
  1. Last 7 days
    1. Reviewer #2 (Public review):

      Summary:

      In this work, Ganesh and colleagues use experimental data from Hi-C and from live-cell imaging to evaluate different polymer models of 3D genome organization in Drosophila based on both structural and dynamic properties. The authors consider several leading hypotheses, which are examined sequentially in increasing level of complexity - from the minimal Rouse polymer, to a model combining sequence-specific compartmentalization and loop-extrusion without extrusion blockers. They conclude that the combination of both compartmentalization and loop-extrusion gives the best agreement with the data. Their analysis also leads to concrete predictions about the processivity of cohesin loop extrusion in Drosophila, and a conclusion that the compartmental interaction strength is poised near criticality in the coil-globule phase space.

      Strengths:

      There is considerable interest in the field in understanding the mechanisms responsible for the 3D spatial organization genome and the dynamic movement of the genome, which has major implications for our understanding of long-range transcriptional regulation and other genome behaviors. The live-cell experimental work on which this study draws highlights the limitations of existing models to explain even the dynamic behaviors observed in the data, further exciting interest in further exploration. Therefore, this paper seeks to address an important gap in the field. The work is written in a well-organized, well-illustrated fashion. The text and figures are nicely integrated, easy to read, and explain challenging concepts with elegance and brevity in a manner that will be accessible to a broad audience.

      Weaknesses:

      The validity and utility of these conclusions are, in my view, substantially undermined by what appears to be unappreciated peculiarities of the live-cell data set that was used to constrain the model. The live-cell data comes from embryos were edited in a way that intentionally substantively changed both the 3D genome structure and dynamics specifically at the loci which are imaged, a case which is not at all explained by any of the models suggested nor acknowledged in the current work, nor compatible with the Hi-C data that simultaneously used to explain these models. As these ignored synthetic alterations have been previously shown to be determinative of transcriptional activity, the relevance of the author's work to transcriptional control (a prime motivation in the introduction) is unclear.

      The agreement in 3D organization, as represented in chromosome-scale contact frequency heatmaps, is substantially less impressive than the agreement seen in prior work with similar models. This discrepancy appears to be due in part to the unappreciated effects of the mentioned in the previous limitation, as well as inappropriate choices in metrics used to evaluate agreement. It is also not particularly surprising that combining more models, with more free parameters, results in an improvement in the quality of fit.

      Some major results, including both theoretical works and experimental ones, are ignored, despite their relevance to the stated objective of the work. The current manuscript and analysis could be improved substantially by a consideration of these works.

      I describe these issues in more detail below.

      Major issues:

      (1) The genetic element "homie" is present in a subset of the data: The experimental data used in this analysis come from different fly lines, half of which have been edited explicitly to alter genome structure and consequent transcriptional behavior, yet the authors are trying to fit with a common model - a problem which substantially undermines the utility of the analysis.

      Specifically, the authors evaluate the various models/simulations by comparing them to Hi-C from wildtype Drosophila embryos on the chromosome scale and 3D distances and dynamics from live cell imaging in genetically edited embryos, to a series of models in turn. The exercise fatally overlooks a critical fact, (admittedly not easily noticed in the work from Bruckner et al), that the fly embryos used for nearly all their analyses contain not only fluorescent labels, but also contain two copies of a powerful genetic sequence, "homie", known for its ability to dramatically change the 3D organization and dynamics of the genome. Whether or not the fluorescent labels themselves used in the study further alter structure and dynamics is not entirely clear (and will require further work beyond the scope of either study), but at least these fluorescent labels aren't known to dramatically affect 3D structure and dynamics the way homie is. The critical problem is that adding or removing the "homie", as shown in a collection of prior works I describe below in more detail, dramatically affects structure, dynamics, and gene expression. Whether or not the genome contains two distal cis-linked copies of homie fundamentally changes genome structure and dynamics, so to use one dataset which has this edit (the live-cell data) and one dataset which lacks it (the Hi-C data) is, in some sense, to guarantee failure of any model to match all the data.

      If the authors had chosen instead to focus exclusively on the 'no homie' genetic lines in the Brukner data, they would have a much smaller dataset (just 2 distances), which would not cover all the length scales of interest, but it would at least be a dataset not known to be contradictory to the Hi-C. The two 'no homie' lines make much more plausible candidates for the sort of generalizable polymer dynamics these authors seek to explain, as will hopefully be made more clear by a brief review of what is known about homie. I next describe the published data that support these conclusions about how homie affects 3D genome spatial organization and dynamics:

      What is "homie" and how does it affect 3D genome distances, dynamics, and gene expression?

      The genetic element "homie" was named by James Jaynes' lab ( Fujioka...Jaynes 2009) in reference to its remarkable "homing" ability - a fascinating and still poorly understood biological observation that some genetic sequences from Drosophila, when cloned on plasmids and reintegrated into the genome with p-elements, had a remarkable propensity to re-integrate near their endogenous sequence, (Hama et al., 1990; Kassis, 2002; Taillebourg and Dura, 1999; Bender and Hudson, 2000; Fujioka...Jaynes 2009). By contrast, most genetic elements tend to incorporate at random across the genome in such assays (with some bias for active chromatin).

      The Jaynes lab subsequently showed that flies carrying two copies of homie, one integrated in cis, ~140 kb distal from the endogenous element, formed preferential cis contacts with one another. Indeed, if a promoter and reporter gene were included at this distal integration site, the reporter gene would activate gene expression in the pattern normally seen by the gene, even-skipped. The endogenous copy of homie marks one border of ~16 kb mini-TAD which contains the even-skipped gene, (eve), and its developmental enhancers, so this functional interaction provides further evidence of physical proximity (as was also shown by 3C by Jaynes (Fujioka..., Schedl, Jaynes 2016), and later with elegant live imaging, by Jaynes and Gregor (Chen 2018)).

      Critically, if either copy of homie is deleted or substantially mutated, the 3D proximity is lost (Fujioka 2016, Chen 2018, Bruckner 2023), and the expression of the transgene is dramatically reduced (at 58 kb) or lost. Given the author's motivation of understanding "E-P" interactions, the fact that the increased 3D proximity provided by homie is as essential for transcription as the promoter itself at the ~150 kb distance, underscores that these are not negligible changes.

      These effects can be seen by plotting the data from Bruckner 2023, which includes data from labels with separations of 58 kb and ~150 kb "no homie" as well as homie. Unfortunately, the authors don't plot this data in the manuscript in the comparison of 3D distances, though the two-point MSD can be seen in Figure S13C, and laudably, the data is made public in a well-annotated repository on Zenodo, noted in the study. Note that the distance data in Figure S13 were filtered to exclude the transcriptionally off state, and are thus not the quantity the current authors are interested in. If they plot the published data for no homie, they will see the clear effect on the average 3D distance, R(s), and a somewhat stronger effect on the contact frequency P(s), which causes significant deviation from the trend-line followed by the homie-containing data.

      (2) The agreement between the "best performing" simulations for all models and the Hi-C data is not on par with prior studies using similar approaches, apparently due to some erroneous choices in how the optimization is carried out:

      Hi-C-comparison

      The 'best fit' simulation Hi-C looks strikingly different from the biological data in all comparisons, with clearly lower agreement than other authors have shown using highly similar methods (e.g., Shi and Thirumalai 2023; Di Pierro et al. 2017; Nuebler et al. 2018; Esposito et al. 2022; Conte et al. 2022), among many others. I believe this results from a few issues with how the current authors select and evaluate the data in their work:

      (a) Most works have used Pearson's correlation rather than Spearman's correlation when comparing simulation and Hi-C contact frequencies. Pearson's correlation is more appropriate when we expect the values to be linearly related, which they should be in this case, as they are constructed indeed to be measuring the same thing (contact frequency), just derived from two different methods. Spearman's correlation would have been justifiable for comparing how transcription output correlates with contact frequency. This may fix the bafflingly low correlations reported at lower adhesion values in Figure S2C.

      (b) Choice of adhesion strengths - The Hi-C map comparison in Figure 3 strongly suggests that a much more striking visual agreement would have been achieved if much weaker (but still non-zero) homotypic monomer affinity had been selected. In the authors' simulation, the monomer state (A/B identity) strongly dominates polymer position, resulting in the visual appearance of an almost black-and-white checkerboard. The data, meanwhile, look like a weak checkerboard superimposed on the polymer.

      (c) A further confounding problem is the aforementioned issue that the Hi-C data don't come from the edited cell lines, and that the interaction of the two Homie sites is vastly stronger than the compartment interactions of this region of the genome.

      (3) Some important concepts from the field are ignored:

      The crumpled/fractal globule model is widely discussed in the literature (including the work containing the data used in this study) - its exclusion from this analysis thus appears as a substantial gap/oversight:

      A natural alternative to the much-discussed Rouse polymer model is the "crumpled polymer" (Grosberg et al. 1988; Grosberg 2016; Halverson et al. 2011; Halverson et al. 2011), also known as the "fractal globule" (Lieberman-Aiden et al. 2009; Mirny 2011; Dekker and Mirny 2016; Boettiger et al. 2016), much discussed for the way it captures the ⅓ scaling of R(s), found for much of the genome (or, equivalently, the -1 exponent of the probability of contact as a function of genome separation, P(s)). Given the 1/3rd scaling in the data, and the fact that the original authors highlighted the crumpled model in addition to the Rouse model, it seems that this comparison would be instructive and the lack of discussion an oversight. Moreover, while prior works (e.g., Buckner, Gregor, 2023) used some traditional simplifying assumptions to estimate the MSD and relaxation time scaling of this model, I believe a more rigorous analysis with explicit simulations (as in Figure 1 for the Rouse model) would be instructive for the crumpled polymer simulations. Note the crumpled globule is not necessarily the same as the globule in the coil-globule transition discussed here - it requires some assumptions about non-entanglement to stay trapped in the meta-stable state which has the 1/3rd R(s) scaling that is indicative of this model, and not the 1/2 exhibited by equilibrium globules (for s<< length of the polymer) and dilute polymers alike.

      While the fit in Figure 2 appears to get closer to the 1/3rd exponent (B= 0.32), this appears to be a largely coincidental allusion of agreement - the simulation data in truth shows a systematic deviation, returning to the 1/2 scaling for distances from 500 kb to whole chromosomes. This feature is not very evident as the authors restrict the analysis to only the few points available in the experimental data, though had they tested intervening distances I expect they would show log-log P(s) is nonlinear (non-powerlaw) for distances less than the typical loop length up to a few fold larger than the loop length, and thereafter returns to the scaling provided by the 'base' polymer behavior. This appears to be Rouse-like in these authors' model, with R(s) going like 1/2, even though the data are closer to 1/3rd, as indeed most published simulated P(s) curves based on loop extrusion - e.g., (Fudenberg et al. 2016; Nuebler et al. 2018). In this vein, it would be instructive to the readers if the authors would include additional predictions from the simulation on the plot that lie at genomic separation distances not tested in the data, to better appreciate the predictions.

      Minor issues

      (1) I think it is too misleading to only describe the experimental data from Brukner as "E-P" interactions from Drosophila. It is important to note somewhere that this is not an endogenous interaction with a functional role in Drosophila - it is a synthetic interaction between enhancers in the vicinity of the eve gene and a synthetic promoter placed at a variable distance away. The uniformity is elegant - (it is the same pair of elements being studied at all distances), but also provides limited scope for generalization as suggested by the current text. Moreover, the enhancers were not directly labeled; rather, the 3D position of nascent RNA transcribed from eve was tracked with an RNA-binding protein and used as a proxy for the 3D position of the enhancers. There is not an individual enhancer at the eve locus that interacts with the transgene, but rather a collection of enhancers is distributed at different positions throughout the entire TAD, which contains eve, and must form separate loops to reach eve. Indeed, it was previously reported that differences in the local position of these enhancers, relative to eve, affect their ability to interact with the distal reporter gene and the endogenous eve gene (Chen 2018). There is also reported competition between these enhancers and the distal gene, which further complicates the analysis (especially since the state of eve and of its enhancers varies among the different cells as a function of stripe position) - see Chen 2018. All of this is ignored in the current work, despite the assertion of the application to understanding E-P interaction. A detailed discussion of these issues is not necessary, but I fear that ignoring them entirely is to invite further confusion and error.

      (2) I believe this sentence is overstated, given available data: " TAD borders are characterized by transitions between epigenetic states rather than by preferentially-bound CTCF [4, 23, 24]." Indeed, this claim has been repeatedly made in the literature as cited here. However, other data clearly demonstrate a strong enrichment of CTCF at TAD borders (and at epigenetic borders, which in Drosophila have a high correspondence with TAD borders, as the authors have already appropriately noted). See, for example, Figure 4 of Sexton Cell 2012, and compare to Figure 2 of Dixon 2012. Of minor note, CTCF peaks co-occupied by the Zinc Finger TF CP190 are more likely to be TAD borders than CTCF alone. How big a species-specific difference this is remains unclear, as it appears some mammalian CTCF-marked TAD boundaries may be co-occupied by additional ZNFs. While plenty of Drosophila TAD boundaries indeed lack CTCF, many are marked by CTCF, this is enriched relative to what would be expected by chance (or relative to the alignment of other TFs, like Twist or Eve with TAD boundaries), and it has been shown that CTCF loss is sufficient to remove a subset of these, see for example Figure 5 of (Kaushal et al. 2021) (though it is possible, most will require mutation of the all the border-associated factors that collectively bind many of the borders, dCTCF, CP190, mod(mdg4) and others).

      (3) This assertion is overstated given available data: "Although TAD boundaries in Drosophila are often associated with insulator proteins [20], there is no direct evidence that these elements block LEFs in vivo. Therefore, we did not impose boundary constraints in our simulations; LEFs were allowed to move freely unless stalled by collisions with other LEFs, with the possibility of crossover.". Deletion of insulator in Drosophila that lie within a common epigenetic state leads to fusion of TADs (e.g., Mateo et al., 2019 - deletion of the CTCF-marked Fub insulator, in posterior tissues where both flanks of Fub are active; Kaushal, 2021, has examples as well). Loss of CTCF causes a small number of TADs to fuse as measured by Hi-C. This is far from 'direct evidence that insulators block LEFs' - as the authors have already noted, even the idea that cohesin extrudes loops in Drosophila in the first place is indeed controversial. However, LEF activity and stalling at insulators would provide a very natural explanation of why chromatin in a shared epigenetic state should form distinct TADs, and why these TADs should fuse upon insulator deletion. Justifying the lack of stalling sites based on empirical data is thus not very convincing to this reviewer. I believe it would be more apt to simply describe this as a simplifying assumption, rather than the above phrase, which may be misleading.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary

      In this investigation Kapustin et al. demonstrate that vascular smooth muscle cells (VSMCs) exposed to the extracellular matrix fibronectin stimulates the release of small extracellular vesicles (sEVs). The authors provide experimental evidence that stimulation of the actin cytoskeleton boosts sEV secretion and posit that sEVs harbor both fibronectin and collagen IV protein themselves which also, in turn, alter cell migration parameters. It is well established that fibronectin is associated with increased cell migration and adherence; therefore, this association with VSMCs is not novel.

      The reviewer is correct that FN has been associated with migration and adherence in previous studies.  However we have extended these observations to show that the extracellular fibronectin matrix stimulates small extracellular vesicle (sEVs) secretion by modulating the actin cytoskeleton. We also showed that sEVs are trapped in the extracellular matrix and that by presenting collagen VI induce early focal adhesion formation, reduce excessive cellular spreading and guide cell invasion directionality though a 3D matrix. Hence, sEVs mediate cell-matrix cross talk and change cell behaviour in the context of fibronectin matrix. This is critically important for vasculature where regulated VSMC invasion is essential for repair with its deregulation leading to pathology.

      The authors purport that sEV are largely born of filopodia origin; however, this data is not well executed and seems generally at odds with the presented data.

      Our experimental data showed that CD63 MVs are associated with filopodia in fixed and live cells (Fig 2E, 2F and Video S1) and that inhibition of filopodia formation using the formin inhibitor, SMIFH2 reduced sEV secretion on FN (Fig 2B). However, we agree with the reviewer that further studies are required to connect sEV secretion to filopodia.  To address this we have provided further data analysis but also toned down our conclusions regarding this point: . Changes include:

      (1) Title: Matrix-associated extracellular vesicles modulate smooth muscle cell adhesion and directionality by presenting collagen VI.

      (2) Results, section title: 2. FN-induced sEV secretion is modulated by Arp2/3 and formin-dependent actin cytoskeleton remodelling

      (3) Results, page 6 Line 27-44 and conclusion page 7, Ln 3 “Interestingly, CD63+ MVBs can be observed in filopodia-like structures suggesting that sEV secretion can also occur spatially via cellular protrusion-like filopodia but more studies are needed to confirm this hypothesis.”

      (4) Discussion, page 12, line 19. “Curiously we observed CD63+ MVB transport toward the filopodia tips as well as inhibition of sEV-secretion with filopodia formation inhibitors suggesting that sEV secretion can be directly linked to filopodia but further studies are needed to define the contribution of this pathway to the overall sEV secretion by cells.”

      Similarly, the effect of sEVs on parameters of cell migration has almost no magnitude of effect, making mechanism exploration somewhat nebulous.

      VSMC are mesenchymal-type cells with a low migration rate and we agree that the changes in the motility are not of great magnitude even for the positive controls suggesting that this is a complex, multifactorial process for VSMCs. In our experiments we collected data from >5000 individual cells to measure the average speed and found that fibronectin matrix on its own increased VSMC speed from ~0.61 um/min to ~0.68 μm/min (~12% raise) which was statistically significant (Fig 5A). Addition of a sEV inhibitor caused a modest but significant decrease in cellular speed. Interestingly, addition of ECM-associated sEVs did not influence cell speed in 2D or 3D assays. However in a 3D model we observed a 22% change in cell directionality (Fig 5G) and  a 235% change in cell alignment index (FMI, Fig 5H) which we believe is very strong evidence that VSMC-derived sEVs are involved in a regulation of VSMC invasion directionality.  These data are also in agreement with sEV effects in tumour cells (Sung et al., 2015) though this previous study did not identify the factor driving the directionality and we think our Collagen VI data extends significantly these previous observations. 

      Results, page 9: “Hence, ECM-associated sEVs have modest influence on VSMC speed but influence VSMC invasion directionality.”.  

      Lastly, the proposed mechanism of VSMCs responding to, and depositing, ECM proteins via sEVs was not rigorously executed; again, making the conclusions challenging for the reader to interpret.

      We appreciate the reviewer’s comment regarding the mechanistic aspects of VSMCs responding to and depositing ECM proteins via sEVs. In our revised manuscript, we have expanded the data demonstrating that sEVs can be retained within the extracellular matrix (see Figs 3A, 3B, S3A, S3B). Additionally, we show that collagen VI is present on the surface of sEVs, where it may modulate cell adhesion and influence the directionality of cell invasion (Fig 7E). Our results further indicate that both fibronectin (FN) and collagen VI can be recycled through multivesicular bodies (see Figs S3C, S3D, S3E–S3G). However, we acknowledge that the precise mechanisms governing the selective loading of ECM proteins onto sEVs, as well as the specific contributions of sEVs to overall ECM organization, remain to be fully elucidated and warrant further investigation. Based on our current evidence, we propose that collagen VI–loaded sEVs act primarily in a signaling capacity by modulating focal adhesion formation but are not directly involved in ECM structural remodeling.

      Results, page 7: To quantify ECM-trapped sEVs we applied a modified protocol for the sequential extraction of extracellular proteins using salt buffer (0.5M NaCl) to release sEVs which are loosely-attached to ECM via ionic interactions, followed by 4M guanidine HCl buffer (GuHCl) treatment to solubilize strongly-bound sEVs (Fig S3A) [42]. We quantified total sEV and characterised the sEV tetraspanin profile in conditioned media, and the 0.5M NaCl and GuHCl fractions using ExoView. The total particle count showed that EVs are both loosely bound and strongly trapped within the ECM. sEV tetraspanin profiling showed differences between these 3 EV populations.  While there was close similarity between the conditioned media and the 0.5M NaCl fraction with high abundance of CD63+/CD81+ sEVs as well as CD63+/CD81+/CD9+ in both fractions (Fig S3A). In contrast, the GuHCl fraction was particularly enriched with CD63+ and CD63+/CD81+ sEVs with very low abundance of CD9+ EVs (Fig S3A). The abundance of CD63+/CD81+ sEVs was confirmed independently by a CD63+ bead capture assay in the media and loosely bound fractions (Fig S3B).

      Results, page 7: We previously found that the serum protein prothrombin binds to the sEV surface both in the media and MVB lumen showing it is recycled in sEVs and catalyses thrombogenesis being on the sEV surface43. So we investigated whether FN can also be associated with sEV surface where it can be directly involved in sEV-cell cross-talk43.   We treated serum-deprived primary human aortic VSMCs with FN-Alexa568 and found that it was endocytosed and subsequently delivered to early and late endosomes together with fetuin A, another abundant serum protein that is a recycled sEV cargo and elevated in plaques (Figs S3C and S3D). CD63 visualisation with a different fluorophore (Alexa488) confirmed FN colocalization with CD63+ MVBs (Fig S3E). Next, we stained non-serum deprived VSMC cultured in normal growth media (RPMI supplemented with 20% FBS) with an anti-FN antibody and observed colocalization of CD63 and serum-derived FN.  Co-localisation was reducd likely due to competitive bulk protein uptake by non-deprived cells (Fig S3F). Notably, when we compared FN distribution in sparsely growing VSMCs versus confluent cells we found that FN intracellular spots, as well as colocalization with CD63, completely disappeared in the confluent state (Fig S3F and S3G). This correlated with nearly complete loss of CD63+/CD81+ sEV secretion by the confluent cells indicating that confluence abrogates intracellular FN trafficking as well as sEV secretion by VSMCs (Fig S3H). Finally, FN could be co-purified with sEVs from VSMC conditioned media (Fig S3I) and detected on the surface of sEVs by flow cytometry confirming its loading and secretion via sEVs (Fig 3C).

      Results: page 10  Collagen VI was the most abundant protein in VSMC-derived sEVs (Fig 7B, Table S7) and  was previously implicated in the interaction with the proteoglycan NG2[53] and suppression of cell spreading on FN[54]. To confirm the presence of collagen VI in ECM-associated sEVs we analysed sEVs extracted from the 3D matrix using 0.5M NaCl treatment and showed that both collagen VI and FN are present (Fig 7D). Next, we analysed the distribution of collagen VI using dot-blot. Alix staining was bright only upon permeabilization of sEV indicating that it is preferentially a luminal protein (Fig 7E). On the contrary, CD63 staining was similar in both conditions showing that it is surface protein (Fig 7E). Interestingly, collagen VI staining revealed that 40% of the protein is located on the outside surface with 60% in the sEV lumen (Fig 7E). 

      Discussion page 12. “In fact, we observed that an extensive secretion of sEVs effectively ceased protrusion activity; also VSMCs acquired a rounded morphology when “hovering” over the FN matrix decorated with sEVs (data not shown). Hence, it will be interesting in future studies to investigate whether sEVs can stimulate Rho activity by presenting adhesion modulators—particularly collagen VI—on their surface, thereby guiding cell directionality during invasion..”

      Discussion, page 14 “In summary, cooperative activation of integrin signalling and F-actin cytoskeleton pathways results in the secretion of sEVs which associate with the ECM and play a signalling role by controling FA formation and cell-ECM crosstalk. Further studies are needed to test these mechanisms across various cell types and ECM matrices.     

      Strengths

      The authors provide a comprehensive battery of cytoskeletal experiments to test how fibronectin and sEVs impact both sEV release and vascular smooth muscle cell migratory activation.

      We appreciate this comment reflecting our efforts to apply a range of orthogonal methods to show the role of the integrin/actin cytoskeleton in ECM-stimulated sEV secretion.

      Weaknesses

      Unfortunately, this article suffers from many weaknesses. First, the rigor of the experimental approach is low, which calls into question the merit of the conclusions. In this vein, there is a lack of proper controls or inclusion of experiments addressing alternative explanations for the phenotype or lack thereof.

      We acknowledge this comment and agree that there was not sufficient evidence to conclude that sEV secretion occurs via filopodia despite the microscopy/inhibitory data so this claim has now been excluded from the study. However we believe that our experimental data does clearly show that FN stimulates the secretion of collagenVI-loaded sEVs which are trapped by the ECM and have the capacity to modulate VSMC adhesion and invasion directionality. To support this, we have now extended the dataset in the revised version:

      (1) In addition to the use of inhibitors and live cell analysis we have added quantitative data confirming that a large proportion of CD63+ endosomes are associated with F-actin/cortactin tails and this colocalization is increased upon the inhibition of sEV secretion with 3-OMS (Fig  2D, Fig S2B).

      (2) We developed a method to extract ECM-associated sEVs and quantified/characterized these using ExoView Assays further confirming significant sEV entrapment by the ECM (Figs 3B, S3A, S3B).    

      (3) We extended the controls to confirm FN delivery to CD63+ endosomes and showed that FN recycling is stopped upon reaching cell confluence (Figs S3F, S3G and Fig S3H).

      (4) We included more intensive characterisation of human atherosclerotic plaque morphology (H&E, Masson’s trichrome staining, Orcein, elastin fibers staining) to confirm predominant accumulation of sEV in the neointima (Figs S4A, S4B and S4C). We also excluded an endothelial origin for the  CD81+ sEVs (Fig 4G).

      (5) We included individual cellular tracks to the 2D migration analysis to confirm the statistical significance and concluded that ECM-associated sEVs regulate cell invasion directionality but not the cell speed (Figs 5A and 5B).

      (6) We showed surface localisation of collagen VI on sEVs confirming that it can activate signalling pathways leading to early FA formation on the FN matrix  (Figs 7D and 7E).

      (7) We included alternative explanations for some of our data in the discussion.      

      Reviewer #2 (Public Review):

      Extracellular vesicles have recently gained significant attention across a wide variety of fields, and they have therefore been implicated in numerous physiological and pathophysiological processes. When such a discovery and an explosion of interest occur in science, there is often much excitement and hope for answers to mechanisms that have remained elusive and poorly understood. Unfortunately, there is an equal amount of hype and overstatement that may also be put forth in the name of "impact", but this temptation must be avoided so that scientists and the broader public are not misled by overreaching interpretations and statements that lack rigorous and fully convincing evidence.

      Thank you for your comment and we agree that investigating sEVs is particularly challenging due to the their heterogeneity and nano-size, as well as complex biogenesis mechanisms. ECM-associated sEVs is a very new direction for the EV field but one that is particularly relevant to the vasculature where cells must invade through a thick ECM and where the accumulation of ECM-bound EVs is a unique and documented phenomenon.  To further strengthen out conclusions we have included new data to support our statements but also excluded statements re: filopodia as the origin of sEVs, that are out of scope of our study and need to be investigated further.

      The study presented by Kapustin et al. is certainly intriguing and timely, and it offers an interesting working hypothesis for the fields of extracellular vesicles and vascular biology to consider. The authors do a reasonable job at detecting these small extracellular vesicles, though some aspects of data presentation are missing such as full Western blots with accompanying size markers for the viewer to more fully appreciate that data and comparisons being made (see Figures 1 and 7).

      We agree with the reviewer and have now included molecular weight markers (Fig 1F, 7C, 7D, S3I, S4E) and provided all original western blot scans (uncropped and unedited) to the eLife editor. 

      Much of the imaging data from cell-based experiments is strong and conducted with many cutting-edge tools and approaches. That said, the static images and the dynamic imaging fall short of being fully convincing that the small extracellular vesicles found in the neighboring extracellular matrix are indeed being deposited there via the smooth muscle cell filopodia. Many of the lines of evidence presented suggest that this could occur, but alternative hypotheses also exist that were not fully ruled out, such as the ECM-deposited vesicles were secreted more from the soma and/or the lamellipodia that are also emitted and retracted from the cells. In particular, the authors show very nice dynamic imaging (Supplementary Figure S2A and Supplemental Video S1) that is interpreted as "extracellular vesicles being released from the cell" and these are seen as "bursts" of fluorescent signal; however, none of these appear to occur in filopodia as they appear within the cell proper (a "burst" of signal vs. a more intense "streak" of signal), which would be a stronger and more consistent observation predicted by the working model proposed by the authors.

      Our live and fixed cell microscope data as well as inhibitor analysis showed that sEV secretion can be associated with the filopodia. However we agree with the reviewer that the data generated using pHluoron GFP marker clearly indicate that the majority of sEVs are secreted from the cell soma toward the ECM:

      To reflect this, we have added further changes:

      (1) Title: Matrix-associated extracellular vesicles modulate smooth muscle cell adhesion and directionality by presenting collagen VI.

      (2) Results, section title: 2. FN-induced sEV secretion is modulated by Arp2/3 and formin-dependent actin cytoskeleton remodelling

      (3)  Results, page 6 Line 27-36 “Formins and the Arp2/3 complex play a crucial role in the formation of filopodia, a cellular protrusion required for sensing the extracellular environment and cell-ECM interactions36. To test whether MVBs can be delivered to filopodia, we stained VSMCs for Myosin-10 (Myo10)37. We observed no difference between total filopodia number per cell on plastic or FN matrices (n=18±8 and n=14±3, respectively) however the presence of endogenous CD63+ MVBs along the Myo10-positive filopodia were observed in both conditions (Fig 2E, arrows). Filopodia have been implicated in sEV capture and delivery to endocytosis “hot-spots”38, so next we examined the directionality of CD63+ MVB movement in filopodia by overexpressing Myo10-GFP and CD63-RFP in live VSMCs. Importantly, we observed anterograde MVB transport toward the filopodia tip (Fig 2F and Supplementary Video S2) indicative of MVB secretion”.

      (4) Results, page 6, Ln 37-44 “We also attempted to visualise sEV release in filopodia using CD63-pHluorin where fluorescence is only observed upon the fusion of MVBs with the plasma membrane39. Using total internal reflection fluorescence microscopy (TIRF) we observed the typical “burst”-like appearance of sEV secretion at the cell-ECM interface in full agreement with an earlier report showing MVB recruitment to invadopodia-like structures in tumor cells18 (Fig S2B and Supplementary Video S1). Although we also observed an intense CD63-pHluorin staining along filopodia-like structures we were not able to detect typical “burst”-like events to confirm sEV secretion in filopodia. (Fig S2C and Supplemental Video S1)”.

      (5) Results, page 7 Ln 3 “Interestingly, CD63+ MVBs can be observed in filopodia-like structures suggesting that sEV secretion can also occur spatially via cellular protrusion-like filopodia but more studies are needed to confirm this hypothesis.”

      (6) Discussion, page 12, line 19. “Curiously we observed CD63+ MVB transport toward the filopodia tips as well as inhibition of sEV-secretion with filopodia formation inhibitors suggesting that sEV secretion can be directly linked to filopodia but further studies are needed to define the contribution of this pathway to the overall sEV secretion by cells.”

      Imaging of related human samples is certainly a strength of the paper, and the authors are commended for attempting to connect the findings from their cell culture experiments to an important clinical scenario. However, the marker selected for marking extracellular vesicles is CD81, which has been described as present on the endothelium of atherosclerotic plaques with a proposed role in the recruitment of monocytes into diseased arteries (Rohlena et al. Cardiovasc Res 2009). More data should address this potentially confounding interpretation of the signals presented in images within Figure 4.

      We thank the reviewer for this insightful comment that the  sEV marker CD81 can originate from endothelial cells in agreement with Rohlena et al., 2009.   To address this we investigated the spatial overlap between CD81 and the endothelial marker, CD31. We observed very strong CD81 staining in the intact endothelial cell (intima) layer and occasional CD31 positive cells in the neointima. Importantly, quantification of colocalization confirmed that 80% of CD81 in the neointima does not overlap with CD31 excluding an endothelial origin of these sEVs. (Fig 4G).  Moreover, we included complete morphological characterisation of the atherosclerotic plaques confirming that CD81 sEVs were primarily observed in the neointima where VSMCs constitute the cellular majority (Fig S4A, S4B, S4C and S4D).

      On a conceptual level, the idea that the small extracellular vesicles contain Type VI Collagen, and this element of their cargo is modulating smooth muscle cell migration, is an intriguing aspect of the authors' working model. Nevertheless, the evidence supporting this potential mechanism does not quite fit together as presented. It is not entirely clear how the collagen VI within the vesicles is somehow accessed by the smooth muscle cell filopodia during migration. Are the vesicles lysed open once on the extracellular matrix? If so, what is the proposed mechanism for that to occur? If not, how are the adhesion molecules on the smooth muscle cell surface engaging the collagen VI fibers that are contained within the vesicles? This aspect of the model does not quite fit together with the proposed mechanism and may be an interesting speculative interpretation, warranting further investigation, but it should not be considered a strong conclusion with sufficient convincing data supporting this idea.

      We thank the reviewer for their insightful comments regarding the mechanism by which collagen VI associated with sEVs could modulate smooth muscle cell adhesion and migration. To clarify, our new data suggest that collagen VI is predominantly present on the surface of the sEVs, as evidenced by Fig 7E. This surface localization strongly implies that collagen VI can be directly accessed by cell surface adhesion receptors, without the need for vesicle lysis or opening. While we cannot entirely rule out all alternative mechanisms, we consider vesicle rupture or lysis within the extracellular matrix to be a highly unlikely route for collagen VI exposure, given the known stability of sEVs under physiological conditions. We have added these points to clarify:

      (1) Results, page 10, Ln 45 “To confirm the presence of collagen VI in ECM-associated sEVs we analysed sEVs extracted from the 3D matrix using 0.5M NaCl treatment and showed that both collagen VI and FN are present (Fig 7D). Next, we analysed the distribution of collagen VI using dot-blot. Alix staining was bright only upon permeabilization of sEV indicating that it is preferentially a luminal protein (Fig 7E). On the contrary, CD63 staining was similar in both conditions showing that it is surface protein (Fig 7E). Interestingly, collagen VI staining revealed that 40% of the protein is located on the outside surface with 60% in the sEV lumen (Fig 7E).”

      (2) Discussion, page 13, Ln 2 “Hence, it will be interesting in future studies to investigate whether sEVs can stimulate Rho activity by presenting adhesion modulators—particularly collagen VI—on their surface, thereby guiding cell directionality during invasion..”

      (3) Discussion, page 14, Ln 30: In addition to collagen VI the unique adhesion cluster in VSMC-derived sEVS also includes EGF-like repeat and discoidin I-like domain-containing protein (EDIL3), transforming growth factor-beta-induced protein ig-h3 (TGFBI) and the lectin galactoside-binding soluble 3 binding protein (LGALS3BP) and these proteins are also directly implicated in activation of integrin signalling and cellular invasiveness85-87. Although we found that collagen VI plays the key role in sEV-induced early formation of FAs in VSMCs, it is tempting to speculate that the high sEV efficacy in stimulating FA formation is driven by cooperative action of this unique adhesion complex on the sEVs surface and targeting this novel sEV-dependent mechanism of VSMC invasion may open-up new therapeutic opportunities to modulate atherosclerotic plaque development or even to prevent undesired VSMC motility in restenosis.    .   

      (4) Abstract Figure

      On a technical level, some of the statistical analysis is not readily understood from the data presented. It is very much appreciated that the authors show many of the graphs with technical and biological replicate values in addition to the means and standard deviations (though this is not clearly stated in all figure legends). However, in figures such as Figure 5, there are bars shown and indicated to be different by statistical comparison (see panel B in Figure 5). It is not clear how the values for Group 1 (no FN, no 3-OMS, no sEV) are statistically different (denoted by three asterisks but no p value provided in the legend) than Group 3 (no FN, 3-OMS added, no sEV), when their means and standard deviations appear almost identical. If this is an oversight, this needs to be corrected. If this is truly the outcome, further explanation is warranted. A higher level of transparency in such instances would certainly go a long way in helping address the current crisis of mistrust within the scientific community and at the interface with society at-large.

      We thank the reviewer for their careful reading and important comments on the statistical analysis. We acknowledge that the technical and biological replicate data were not clearly reported in all figure legends and that the statistical approach for Figures 5A and 5B required clarification. In response, we have made several changes for greater transparency and rigor:

      First, we have now explicitly included the numbers of biological replicates (N) and technical replicates (n) in all relevant figure legends for Figures 1–7. In addition, the number of individual cell tracks is now annotated for the migration/invasion analyses, along with the mean values for each dataset.

      Upon review, we found that the original statistical analyses for Figures 5A and 5B were conducted using pooled averaged data. To address this, we have repeated the statistical tests using pooled individual cell track data, applying the Kruskal–Wallis test with Dunn’s multiple comparison correction. This more stringent approach revealed revised p-values, which are now indicated in Figures 5A and 5B.

      With these corrections, we reconfirm our major findings: In the 2D model, fibronectin (FN) coating promotes VSMC velocity, while inhibition of sEV secretion with 3-OMS leads to reduced cell speed (Fig. 5A). Addition of sEVs to the ECM had no effect on VSMC speed at baseline but did rescue cell speed and distance in the presence of 3-OMS, consistent with EVs acting primarily on invasion directionality rather than speed in both 2D and 3D models (Fig. 5A, 5D). Furthermore, sEVs continue to significantly impact VSMC invasion directionality (Figs. 5G, 5H), in agreement with previous reports in tumor cells (Sung et al., 2015).

      In summary, we have implemented the following revisions:

      (1) Figures 5A and 5B: Individual cell track data are now shown, and statistical analyses have been repeated using the Kruskal–Wallis test with Dunn’s multiple comparisons.

      (2) Figure legends and results sections: Numbers of biological and technical replicates, as well as individual data points, are now clearly stated.

      Results, page 9, line 14: The text has been updated to clarify the statistical approach and major findings as described above.

      We hope that these changes address the reviewer’s concerns and improve the transparency and reproducibility of our data presentation

      Reviewer #1 (Recommendations For The Authors):

      We are very thankful for the comprehensive review and comments which helped to improve our data.

      Figure 1.<br /> The authors clearly show that FN stimulation (immobilized or cell-derived) promotes sEV secretion via canonical integrin pathways. FN is a promigratory substrate, hence its extensive use as a cell adhesion aid; thus one could assume that simply plating on FN induces a pro-migratory phenotype (later data supports this notion). Does the addition of growth factors also increase sEV release? An endogenous function of FN is siloing of various GFs during clot formation. Also, FAK and SRC networks intersect with canonical RTK signaling in terms of promoting Rac1, CDC42 and other migration mediators. The reason I believe this is important is because the data could be interpreted in two ways: 1) FN induces pro-migration signaling and then sEVs are released, or visa versa, FN induces sEV release and migration is initiated. GF supplementation in the absence of FN would clarify this relationship.

      We thank the reviewer for this insightful comment regarding the possible role of growth factors (GFs) and the mechanistic relationship between FN stimulation, sEV secretion, and cell migration. We agree that FN is a well-established promoter of cell migration, and it is important to distinguish whether FN directly induces a pro-migratory phenotype or does so via sEV-mediated signaling.

      Our data show that FN stimulation markedly increases VSMC motility, as reflected by enhanced cell speed (Fig. 5A), an increased number of focal adhesions (Fig. 6E), and facilitated centripetal movement of FAs (Fig. 6F). Interestingly, ECM-associated sEVs appear to play a complementary but distinct role: they do not significantly affect cell migration speed (Fig. 5A) but instead guide cell invasion directionality (Figs. 5G, 5H), reduce the number of FAs per cell (Fig. 6E), and promote early peripheral FA formation (Fig. 6F). In light of these findings, we have updated our graphical abstract to reflect the unique cross-talk mediated by sEVs between VSMCs and the ECM.

      Regarding the influence of growth factors, we acknowledge that FN can bind and present different GFs, which could also contribute to changes in sEV secretion. Although our inhibition studies and integrin-blocking antibody results support a primary role for β1 integrin activation and actin assembly in triggering sEV secretion, we cannot entirely exclude the possibility that FN-bound growth factors play a role in this process. We have now incorporated this point into the discussion to address the reviewer’s suggestion.

      Discussion, page 14 , Ln 7 “Although our small inhibitors and integrin modulating antibody data clearly indicate that β1 activation triggers sEV secretion via activation of actin assembly we cannot fully rule out that FN may also be modulating growth factor activity which in turn contributes to sEV secretion by VSMCs<sup>23</sup>.  Excessive collagen and elastin matrix breakdown in atheroma has been tightly linked to acute coronary events hence it will be interesting to study the possible link between sEV secretion and plaque stability as sEV-dependent invasion is also likely to influence the necessary ECM degradation induced by invading cells<sup>96</sup>

      Figure 2.<br /> • The authors provide no evidence (or references) that SMIFH2 or CK666 halts filopodia extensions.

      Thank you for this important note. We have included the corresponding references:

      Results, page 5: “So next we tested the contribution of Arp2/3 and formins by using the small molecule inhibitors, CK666 and SMIFH2, respectively31, 32”.  

      • Is there an increase in filopodia density when plated on FN vs plastic? Similarly, if there are more filopodia present is that associated with more sEV? Please provide evidence in this regard.

      We agree that connecting the number of filopodia with the secretion level of sEVs may be an important clue if sEV secretion can be driven by FN-induced filopodia formation. However, Myosin10 staining to quantify filopodia (Fig 2E) showed no difference between VSMCs plated on plastic versus FN matrix. Therefore, we agree with the reviewer that the filopodia contribution to sEV secretion needs to be investigated further.  This idea is reflected in the following comments:

      (1) Results, page 6, Ln 29 “We observed no difference between total filopodia number per cell on plastic or FN matrices (n=18±8 and n=14±3, respectively) however the presence of endogenous CD63+ MVBs along the Myo10-positive filopodia were observed in both conditions (Fig 2E, arrows).

      (2) Results, page 6, Ln 37 “We also attempted to visualise sEV release in filopodia using CD63-pHluorin where fluorescence is only observed upon the fusion of MVBs with the plasma membrane39. Using total internal reflection fluorescence microscopy (TIRF) we observed the typical “burst”-like appearance of sEV secretion at the cell-ECM interface in full agreement with an earlier report showing MVB recruitment to invadopodia-like structures in tumor cells18 (Fig S2B and Supplementary Video S1). Although we also observed an intense CD63-pHluorin staining along filopodia-like structures we were not able to detect typical “burst”-like events to confirm sEV secretion in filopodia. (Fig S2C and Supplemental Video S1)..”

      (3) Discussion, page 12, Ln 15 : “Focal complexes either disassemble or mature into the elongated centripetally located FAs48. In turn, these mature FAs anchor the ECM to actin stress fibres and the traction force generated by actomyosin-mediated contractility pulls the FAs rearward and the cell body forward12, 13. Here we report that β1 integrin activation triggers sEV release followed by sEV entrapment by the ECM. Curiously we observed CD63+ MVB transport toward the filopodia tips as well as inhibition of sEV-secretion with filopodia formation inhibitors suggesting that sEV secretion can be directly linked to filopodia but further studies are needed to define the contribution of this pathway to the overall sEV secretion by cells..”

      As hinted above, this data could be interpreted in the light of generally inhibiting cell migration to blunt sEV shedding. Does cell confluence affect sEV release? If cells are cultured to 100% confluency this would limit filopodia formation regardless of ECM type. If sEV secretion remains elevated on FN in this culture condition it would suggest a lack of dependency on filopodia.

      We thank the reviewer for this thoughtful suggestion regarding the influence of cell confluence on sEV release and filopodia formation. To directly address this hypothesis, we performed additional experiments comparing VSMCs cultured at low and high confluency. As described in the revised Results (page 7, line 39), we found that high cellular confluency reduced FN recycling, as indicated by the marked decrease in intracellular FN-positive spots and loss of colocalization with CD63 (Figs S3F, S3G). Importantly, this was accompanied by a significant reduction in CD63+/CD81+ sEV secretion by confluent cells (Fig S3H). These results suggest that VSMC confluence, which suppresses filopodia formation, nearly abolishes both intracellular FN trafficking and sEV secretion, even in the presence of FN. Thus, under our experimental conditions, sEV secretion by VSMCs appears to be closely linked to dynamic cell–matrix interactions and is dramatically reduced when these processes are limited by confluence:

      (1) Results, page 7, Ln 39 : “Notably, when we compared FN distribution in sparsely growing VSMCs versus confluent cells we found that FN intracellular spots, as well as colocalization with CD63, completely disappeared in the confluent state (Fig S3F and S3G). This correlated with nearly complete loss of CD63+/CD81+ sEV secretion by the confluent cells indicating that confluence abrogates intracellular FN trafficking as well as sEV secretion by VSMCs (Fig S3H)..  

      • Inhibition of branched actin polymerization has been shown to reduce both exocytic and endocytic activity. Thus, it is hard to interpret the results of Fig. 2B than anything more than a generalized effect of losing actin.

      We thank the reviewer for this important point regarding the broad cellular functions of branched actin polymerization, and agree that generalized actin loss can influence both exocytic and endocytic pathways. To address this, we performed additional experiments and analyses to better define the relationship between branched actin structures and sEV-related processes in VSMCs.

      As described in the revised Results (page 6), we overexpressed ARPC2-GFP (an Arp2/3 subunit) together with F-tractin-RFP in VSMCs and carried out live-cell imaging. This approach revealed that Arp2/3 and F-actin organize into lamellipodial scaffolds at the cell cortex, as expected (Fig. S2A; Supplementary Video S2). Additionally, and more unexpectedly, we observed numerous Arp2/3– and F-actin–positive dynamic spots within the VSMC cytoplasm. These structures resemble actin comet tails seen in other systems, previously implicated in endosomal propulsion (Fig. S2A, arrow; Supplementary Video S2).

      Quantitative analysis confirmed that a substantial fraction of these dynamic F-actin/cortactin spots colocalized with CD63+ endosomes (Fig. 2D), and that these structures are indeed branched actin tails based on cortactin immunostaining. Furthermore, inhibition of SMPD3 (with 3-OMS) induced enlarged cortactin/F-actin/CD63+ complexes, morphologically similar to invadopodia (Fig. 2D, arrowheads), supporting a functional link between actin branching and MVB dynamics.

      To quantify the association, we calculated Manders’ colocalization coefficients for F-actin tails and CD63+ endosomal structures in fixed VSMCs, observing that ~50% of F-actin tails were associated with ~13% of endosomes. Upon 3-OMS treatment, this overlap increased further (Fig. S2B).

      Finally, using live-cell imaging (Fig 2C; Supplementary Video S4), we directly observed CD63+ MVBs being propelled through the cytoplasm by Arp2/3-driven actin tails, suggesting a mechanistic role for branched actin assembly in MVB intracellular transport, rather than a generalized effect of actin disruption alone.

      We believe these combined data reinforce a more specific mechanistic role for Arp2/3-mediated branched actin in MVB/endosome transport and, consequently, in sEV secretion in VSMCs—over and above an indirect effect of global actin loss. We hope these additional experiments and quantitative analyses address the reviewer’s concern and clarify the functional relevance of branched actin structures to sEV trafficking:

      (1) Results, page 6, Ln 3 “As regulators of branched actin assembly, the Arp2/3 complex and cortactin are thought to contribute to sEV secretion in tumour cells by mediating MVB intracellular transport and plasma membrane docking[28, 33]. Therefore, we overexpressed the Arp2/3 subunit, ARPC2-GFP and the F-actin marker, F-tractin-RFP in VSMCs and performed live-cell imaging. As expected, Arp2/3 and F-actin bundles formed a distinct lamellipodia scaffold in the cellular cortex (Fig S2A and Supplementary Video S2). Unexpectedly, we also observed numerous  Arp2/3/F-actin positive spots moving  through the VSMC cytoplasm that resembled previously described endosome actin tails observed in Xenopus eggs[33] and parasite infected cells where actin comet tails propel parasites via filopodia to neighbouring cells[34, 35] (Fig S2A, arrow, and Supplementary Video S2). Analysis of the intracellular distribution of Arp2/3 and CD63-positive endosomes in VSMCs showed CD63-MVB propulsion by the F-actin tail in live cells (Fig 2C and Supplementary Video S4).”

      (2) Results, New data Fig 2D, page 6, Ln 14. “we observed numerous F-actin spots in fixed VSMCs that were positive both for F-actin and cortactin indicating that these are branched-actin tails (Fig 2D). Moreover, cortactin/F-actin spots colocalised with CD63+ endosomes and addition of the SMPD3 inhibitor, 3-OMS, induced the appearance of enlarged doughnut-like cortactin/F-actin/CD63 complexes resembling invadopodia-like structures similar to those observed in tumour cells (Fig 2D, arrowheads)[18].”

      (3) Results, New data Fig S2B, page 6, Ln 19 “To quantify CD63 overlap with the actin tail-like structures, we extracted round-shaped actin structures and calculated the thresholded Manders colocalization coefficient (Fig S2B).  We observed overlap between F-actin tails and CD63 as well as close proximity of these markers in fixed VSMCs (Fig S2B). Approximately 50% of the F-actin tails were associated with 13% of all endosomes (tM1=0.44±0.23 and tM2= 0.13±0.06, respectively, N=3). Addition of 3-OMS enhanced this overlap further (tM1=0.75±0.18 and tM2=0.25±0.09) suggesting that Arp2/3-driven branched F-actin tails are involved in CD63+ MVB intracellular transport in VSMCs”

      • In video 1 the author states (lines 8-9; pg6) "intense CD63 staining along filopodia" Although, there is some fluorescence (not strong) in these structures, there was no visible exocytic activity. This data is more suggestive that sEVs (marked by CD63) are not associated with filopodia. The following conclusion statement the authors make is overreaching given this result.

      We thank the reviewer for this careful observation and agree that the previous conclusion regarding sEV release from filopodia was overstated. In response, we have revised both the Results and Discussion sections to more accurately reflect the data..

      (1) Results, page 6, Ln37 “We also attempted to visualise sEV release in filopodia using CD63-pHluorin where fluorescence is only observed upon the fusion of MVBs with the plasma membrane39. Using total internal reflection fluorescence microscopy (TIRF) we observed the typical “burst”-like appearance of sEV secretion at the cell-ECM interface in full agreement with an earlier report showing MVB recruitment to invadopodia-like structures in tumor cells18 (Fig S2B and Supplementary Video S1). Although we also observed an intense CD63-pHluorin staining along filopodia-like structures we were not able to detect typical “burst”-like events to confirm sEV secretion in filopodia. (Fig S2C and Supplemental Video S1)..”

      (2) Discussion, page 12, Ln19 “Curiously we observed CD63+ MVB transport toward the filopodia tips as well as inhibition of sEV-secretion with filopodia formation inhibitors suggesting that sEV secretion can be directly linked to filopodia but further studies are needed to define the contribution of this pathway to the overall sEV secretion by cells.”. 

      • Fig 2D and video 2 are wholly unconvincing with regard to sEV secretion sites. The authors could use their CD63-pHluroin construct to count exocytic events in the filopodia vs the whole cell. Given the movie, I have a suspicion this would not be significant. The authors could also perform staining CD63 in non-permeabilized cells to capture and count exocytic events at the plasma membrane as well as their location between groups.

      We thank the reviewer for these constructive suggestions and their critical assessment of our current data regarding the sites of sEV secretion. We agree that our CD63-pHluorin approach clearly indicates sEV secretion events in the soma at the cell–ECM interface, while we did not observe comparable events in filopodia. Accordingly, we have clarified these points in the revised manuscript.

      (1) Results, page 6, Ln37 “We also attempted to visualise sEV release in filopodia using CD63-pHluorin where fluorescence is only observed upon the fusion of MVBs with the plasma membrane39. Using total internal reflection fluorescence microscopy (TIRF) we observed the typical “burst”-like appearance of sEV secretion at the cell-ECM interface in full agreement with an earlier report showing MVB recruitment to invadopodia-like structures in tumor cells18 (Fig S2B and Supplementary Video S1). Although we also observed an intense CD63-pHluorin staining along filopodia-like structures we were not able to detect typical “burst”-like events to confirm sEV secretion in filopodia. (Fig S2C and Supplemental Video S1)..”

      (2) Discussion, page 12, Ln19 “Curiously we observed CD63+ MVB transport toward the filopodia tips as well as inhibition of sEV-secretion with filopodia formation inhibitors suggesting that sEV secretion can be directly linked to filopodia but further studies are needed to define the contribution of this pathway to the overall sEV secretion by cells.”. 

      • Fig. 2E and video 4. Again, the conclusions drawn from this data are very strained. First, no co-localization quantification is presented on the proportion of CD63 vesicles with actin. Once again, the movie, if anything convinces the reader that 95-99% of all CD63 vesicles are not associated with actin; therefore, this is an unlikely mechanism of transport.

      We thank the reviewer for this valuable comment and for highlighting the need for quantitative co-localization analysis. In response, we developed a method to systematically quantify F-actin and CD63 co-localization in fixed VSMCs, as now presented in new Figures 2D and S2B. We acknowledge that the majority of CD63+ endosomes are not associated with F-actin, consistent with the reviewer’s interpretation. However, our quantitative data now show that a specific subpopulation of MVBs appears to utilize this actin-based mechanism for transport. We believe this addresses the concern and more accurately reflects the prevalence and significance of the mechanism described.

      (1) Results, page 6 , Ln 19. “To quantify CD63 overlap with the actin tail-like structures, we extracted round-shaped actin structures and calculated the thresholded Manders colocalization coefficient (Fig S2B).  We observed overlap between F-actin tails and CD63 as well as close proximity of these markers in fixed VSMCs (Fig S2B). Approximately 50% of the F-actin tails were associated with 13% of all endosomes (tM1=0.44±0.23 and tM2= 0.13±0.06, respectively, N=3). Addition of 3-OMS enhanced this overlap further (tM1=0.75+/-0.18 and tM2=0.25+/-0.09) suggesting that Arp2/3-driven branched F-actin tails are involved in CD63+ MVB intracellular transport in VSMCs.”

      • Are there perturbations that increase filopodia numbers? A gain of function experiment would be valuable here.

      We thank the reviewer for this important suggestion regarding the potential value of gain-of-function experiments to clarify filopodia’s contribution to sEV release. In agreement with the reviewer’s scepticism, we have removed statements linking filopodia to sEV release from both the title and abstract to avoid overinterpretation. At present, our understanding of filopodia biology and the lack of robust tools to selectively and substantially increase filopodia numbers in VSMCs prevent us from directly addressing this question through gain-of-function assays. We acknowledge that future studies using established methods—such as overexpression of filopodia-inducing proteins (e.g., mDia2 or fascin)—could provide insight into whether an increased number of filopodia affects sEV release. However, such experiments are beyond the scope of the current manuscript. We have made the following changes to clarify these points:

      (1) Results, page 6, Ln37 “We also attempted to visualise sEV release in filopodia using CD63-pHluorin where fluorescence is only observed upon the fusion of MVBs with the plasma membrane39. Using total internal reflection fluorescence microscopy (TIRF) we observed the typical “burst”-like appearance of sEV secretion at the cell-ECM interface in full agreement with an earlier report showing MVB recruitment to invadopodia-like structures in tumor cells18 (Fig S2B and Supplementary Video S1). Although we also observed an intense CD63-pHluorin staining along filopodia-like structures we were not able to detect typical “burst”-like events to confirm sEV secretion in filopodia. (Fig S2C and Supplemental Video S1)..”

      (2) Discussion, page 12, Ln19 “Curiously we observed CD63+ MVB transport toward the filopodia tips as well as inhibition of sEV-secretion with filopodia formation inhibitors suggesting that sEV secretion can be directly linked to filopodia but further studies are needed to define the contribution of this pathway to the overall sEV secretion by cells.”. 

      Figure 3<br /> • Fig 3A. The CD63 staining is strongly associated with the entire plasma membrane. How are the authors distinguishing between normal membrane shedding and bona fida sEVs based on this staining alone (?)- this is insufficient as all membrane structures are seemingly positive. Additionally, there are very few sEVs in scrutinizing the provided images. For the "sEV secretion, fold change" graphs in previous figures, could the authors provide absolute values, or an indication of what these values are in absolute terms?

      We thank the reviewer for raising this important point regarding the specificity of CD63 staining and the need to distinguish bona fide sEVs from membrane fragments or general membrane shedding. We agree that CD63 staining alone at the plasma membrane or in the extracellular matrix is not sufficient to unequivocally identify sEVs. To address this, we employed several complementary approaches to rigorously characterize ECM-associated sEVs:

      First, using high-resolution iSIM imaging, we confirmed the association of CD63-positive particles specifically with the FN-rich matrix, and demonstrated that SMPD3 knockdown significantly reduced the number of CD63+ particles in the matrix (Fig. 3B; revised from Fig. S3A).

      Second, by incubating FN matrices with purified and fluorescently labeled sEVs, we directly observed efficient entrapment of these labeled sEVs within the matrices (Fig. 3E), confirming that sEVs can interact with and be retained by the ECM.

      Third, we developed and applied a sequential extraction protocol using mild salt buffer (0.5M NaCl) and strong denaturant (4M guanidine HCl) to selectively extract ECM-associated sEVs based on the strength of their association (see new Figs. S3A and S3B). Extracted vesicles were then characterized by ExoView analysis, which demonstrated a tetraspanin profile (CD63+/CD81+/CD9+) closely matching that of sEVs from conditioned media, providing evidence that these particles are true sEVs and not merely membrane debris. We also found that the more weakly bound (NaCl-extracted) fraction closely resembles media-derived sEVs, while the strongly bound (GuHCl-extracted) fraction is more enriched in CD63+ and CD63+/CD81+ sEVs but contains very few CD9+ vesicles, further supporting distinct extracellular vesicle subpopulations within the ECM.

      In addition, the abundance of CD63+/CD81+ sEVs in both media and ECM-derived fractions was independently validated by CD63 bead-capture assay (Fig. S3B).

      We hope these clarifications and the expanded data set address the reviewer’s concerns about sEV identification and quantification in the extracellular matrix:

      (1) Results, page 7, Ln 16. To quantify ECM-trapped sEVs we applied a modified protocol for the sequential extraction of extracellular proteins using salt buffer (0.5M NaCl) to release sEVs which are loosely-attached to ECM via ionic interactions, followed by 4M guanidine HCl buffer (GuHCl) treatment to solubilize strongly-bound sEVs (Fig S3A) 42. We quantified total sEV and characterised the sEV tetraspanin profile in conditioned media, and the 0.5M NaCl and GuHCl fractions using ExoView. The total particle count showed that EVs are both loosely bound and strongly trapped within the ECM. sEV tetraspanin profiling showed differences between these 3 EV populations.  While there was close similarity between the conditioned media and the 0.5M NaCl fraction with high abundance of CD63+/CD81+ sEVs as well as CD63+/CD81+/CD9+ in both fractions (Fig S3A). In contrast, the GuHCl fraction was particularly enriched with CD63+ and CD63+/CD81+ sEVs with very low abundance of CD9+ EVs (Fig S3A). The abundance of CD63+/CD81+ sEVs was confirmed independently by a CD63+ bead capture assay in the media and loosely bound fractions (Fig S3B).

      • A control of fig 3b would be helpful to parse out random uptake of extracellular debris verses targeted sEV internalization. It would be helpful if the authors added particles of similar size to that of the sEVs to test whether these structures are endocytosed/micropinocytosed at similar levels.

      We thank the reviewer for this useful suggestion regarding the need for better controls to distinguish specific sEV uptake from nonspecific internalization of extracellular debris or similarly sized particles. As a comparison, in our study we analyzed the uptake of both sEVs and serum proteins such as fibronectin and fetuin-A (Figs S3C and S3D), and observed similar patterns of intracellular trafficking. However, we acknowledge that inert nanoparticles or beads of a similar size to sEVs could serve as potential controls to assess nonspecific micropinocytosis or endocytosis.

      It is important to note, however, that the uptake of sEVs is strongly influenced by their surface protein composition and the so-called “protein corona.” Recent work from Prof. Khuloud T. Al-Jamal’s group underscores that exosome uptake mechanisms may be highly specific (Liam-Or et al., 2024), and studies from Mattias Belting’s lab have also shown the importance of heparan sulfate proteoglycans in exosome endocytosis (Cerezo-Magana et al., 2021). As a result, uptake comparisons with inert particles or beads may not fully recapitulate the specificity of sEV internalization, and distinct nanoparticle classes may rely on different uptake pathways.

      Figure 4<br /> • Fig. 4E,F,G. How are the authors determining the neointima and media compartments without ancillary staining for basement membrane or endothelial markers? Anatomic specific markers need to be incorporated here for the reader to evaluate the specificity of the FN and CD81 staining. It is also hard to understand the severity of the atherosclerotic lesion without a companion H&E cross section.

      We thank the reviewer for highlighting the need for more rigorous characterization of atherosclerotic lesion architecture and anatomical compartments in our study. In response, we have incorporated additional histological analyses and now provide ancillary staining and companion images to enable clear identification of the neointima and medial compartments, as well as to assess lesion severity (see new Figs S4A–S4D):

      (1)Results, page  8, Ln 28. . “To test if FN associates with sEV markers in atherosclerosis, we investigated the spatial association of FN with sEV markers using the sEV-specific marker CD81. Staining of atherosclerotic plaques with haematoxylin and eosin revealed well-defined regions with the neointima as well as tunica media layers formed by phenotypically transitioned or contractile VSMCs, respectively (Fig S4A). Masson's trichrome staining of atherosclerotic plaques showed abundant haemorrhages in the neointima, and sporadic haemorrhages in the tunica media (Fig S4B). Staining of atherosclerotic plaques with orcein indicated weak connective tissue staining in the atheroma with a confluent extracellular lipid core, and strong specific staining at the tunica media containing elastic fibres which correlated well with the intact elastin fibrils in the tunica media (Figs S4C and S4D). Using this clear morphological demarcation, we found that FN accumulated both in the neointima and the tunica media where it was significantly colocalised with the sEV marker, CD81 (Fig. 4D, 4E and 4F). Notably CD81 and FN colocalization was particularly prominent in cell-free, matrix-rich plaque regions (Figs. 4E and 4F).”

      • Figs s4c, S4d- proper controls are not provided. Again, a non-FN internalization control as well as a 4oC cold block negative control is required to interpret this data.

      We thank the reviewer for this valuable suggestion. To enhance the rigor of our internalization assays, we have now included several additional controls using alternative treatments, fluorophore combinations, and internalization conditions:

      a) We performed FN-Alexa568 uptake assays, followed by immunostaining for CD63 with a distinct fluorophore (Alexa488), to confirm the colocalization of internalized FN with CD63+ endosomal compartments in VSMCs (new Fig. S3E).

      b) We also stained VSMCs, cultured under normal growth conditions, with an anti-FN antibody to visualize intracellular serum-derived FN and again observed colocalization with CD63 (new Figs. S3F and S3G). Notably, in cells grown to confluence, we observed a complete loss of intracellular FN staining and FN/CD63 colocalization, suggesting that FN recycling is prominent in sparse, motile cells, but not in confluent populations.

      These additional controls strengthen our conclusions regarding FN internalization pathways and the conditions under which FN trafficking to the endosomal system occurs:

      (1) Results, page 7, Ln 31  We treated serum-deprived primary human aortic VSMCs with FN-Alexa568 and found that it was endocytosed and subsequently delivered to early and late endosomes together with fetuin A, another abundant serum protein that is a recycled sEV cargo and elevated in plaques (Figs S3C and S3D). CD63 visualisation with a different fluorophore (Alexa488) confirmed FN colocalization with CD63+ MVBs (Fig S3E). Next, we stained non-serum deprived VSMC cultured in normal growth media (RPMI supplemented with 20% FBS) with an anti-FN antibody and observed colocalization of CD63 and serum-derived FN.  Co-localisation was reduced likely due to competitive bulk protein uptake by non-deprived cells (Fig S3F). Notably, when we compared FN distribution in sparsely growing VSMCs versus confluent cells we found that FN intracellular spots, as well as colocalization with CD63, completely disappeared in the confluent state (Fig S3F and S3G)..

      • Can the authors please provide live and fixed imaging of FN and CD63-mediate filopodial secretion to amply support their conclusions.

      We have observed CD63 MVBs in both fixed (Fig 2E) and live VSMCs (Fig 2F) yet we agree that further studies are required to establish the contribution of filopodia to sEV secretion. Therefore, we have added the following changes:

      (1) Results, page 6, Ln37 “We also attempted to visualise sEV release in filopodia using CD63-pHluorin where fluorescence is only observed upon the fusion of MVBs with the plasma membrane39. Using total internal reflection fluorescence microscopy (TIRF) we observed the typical “burst”-like appearance of sEV secretion at the cell-ECM interface in full agreement with an earlier report showing MVB recruitment to invadopodia-like structures in tumor cells18 (Fig S2B and Supplementary Video S1). Although we also observed an intense CD63-pHluorin staining along filopodia-like structures we were not able to detect typical “burst”-like events to confirm sEV secretion in filopodia. (Fig S2C and Supplemental Video S1)..”

      (2) Discussion, page 12, Ln19 “Curiously we observed CD63+ MVB transport toward the filopodia tips as well as inhibition of sEV-secretion with filopodia formation inhibitors suggesting that sEV secretion can be directly linked to filopodia but further studies are needed to define the contribution of this pathway to the overall sEV secretion by cells.”. 

      Figure 5

      • Fig. 5A,B. The authors claim that sEV supplementation enhances VSMC migration speed and distance. The provided graphs show only a marginal increase in speed with sEV addition (A) but, concerningly, there is a four-star significant difference between the FN condition compared with FN+sEV (B) while the means appear the same. How are these conditions statistically different? The statistics seem off for these comparisons.

      We thank the reviewer for highlighting concerns regarding the statistical analysis in Figures 5A and 5B. In response, we have carefully re-examined our data and statistical approach to ensure accuracy and transparency.

      First, we have now included all individual cell migration tracks in the data representation for these figures. The statistical tests were repeated using the Kruskal–Wallis test with Dunn’s multiple comparison correction across all groups. This more stringent analysis confirmed our key findings: fibronectin (FN) stimulates VSMC migration speed, while inhibition of sEV secretion (with 3-OMS) reduces cellular speed (Fig. 5A). Addition of exogenous ECM-associated sEVs modestly restored cell speed in the presence of 3-OMS, but had no effect on baseline migration speed in 2D or 3D models (Figs. 5A, 5D).

      Regarding the four-star significance observed in the original Fig. 5B, the previous result reflected an analysis based on pooled group averages, which may have overstated marginal differences. The revised analysis, based on individual cell tracks, does not support a substantial difference between FN and FN+sEV groups. The revised p-values and comparisons are now provided directly on the figures and described in the figure legends. We also clearly report the numbers of biological replicates, technical replicates, and individual data points for every condition.

      Further, the modest effect of ECM-associated sEVs on speed is consistent with our observation that sEVs influence invasion directionality rather than baseline migration velocity, in agreement with previous findings in tumor models (Sung et al., 2015).

      The manuscript has been revised accordingly, with updates in:

      (1) Figures 5A and 5B: Individual cell track data are now shown, and statistical analyses have been repeated using the Kruskal–Wallis test with Dunn’s multiple comparisons.

      (2) Figure legends and results sections: Numbers of biological and technical replicates, as well as individual data points, are now clearly stated.

      (3) Results, page 9, line 14:  “FN as a cargo in sEVs promotes FA formation in tumour cells and increases cell speed14, 15. As we found that FN is loaded into VSMC-derived sEVs we hypothesized that ECM-entrapped sEVs can enhance cell migration by increasing cell adhesion and FA formation in the context of a FN-rich ECM. Therefore, we tested the effect of sEV deposition onto the FN matrix on VSMC migration in 2D and 3D models. We found that FN coating promoted VSMC velocity and inhibition of bulk sEV secretion with 3-OMS reduced VSMC speed in a 2D single-cell migration model (Figs. 5A, 5B) in agreement with previous studies using tumour cells14, 15. However, addition of sEVs to the ECM had no effect on VSMC speed at baseline but rescued cell speed and distance in the presence of the sEV secretion inhibitor, 3-OMS suggesting the EVs are not primarily regulating cell speed (Figs 5A and 5B).”

      (4) Results, page 9, Ln 29 “Hence, ECM-associated sEVs have modest influence on VSMC speed but influence VSMC invasion directionality.”.

      We hope that these changes address the reviewer’s concerns and improve the transparency and reproducibility of our data presentation

      • Fig d-h. Generally, the magnitude of the difference between the presented conditions are biologically insignificant. Several of the graphs show a four-star difference with means that appear equivalent with overlapping error bars. Do the authors conclude that a 0.1%, or less, effect between groups is biologically meaningful?

      We thank the reviewer for drawing attention to the apparent mismatch between statistical significance and biological relevance in Figures 5d–h. In response, we have reanalyzed the data using individual cell tracks and more stringent non-parametric statistical tests, as described above. This reanalysis confirmed that the magnitude of differences in migration speed and related parameters between the groups is minimal and not biologically meaningful. Thus, we no longer claim that sEVs significantly affect VSMC migration speed under these conditions in either 2D or 3D assays. Our revised manuscript now accurately reflects this finding in both the Results and Discussion sections, and the updated figures and legends clarify the true extent of any differences observed.

      Figure 6

      • Generally, the author's logic for looking into adhesion, focal adhesion and traction forces is hard to follow. If there are sEV-mediated migration differences, then there would inexorably be focal adhesion alterations. However, the data indicates few differences brought on by sEVs, which speaks to the lack of migration differences presented in Fig. 5. Overall, the sEV migration phenotype has so little of an effect, to then search for a mechanism seems destine to not turn up anything significant.

      We thank the reviewer for highlighting the importance of connecting the observed phenotypic effects of sEVs to the investigation of adhesion and focal adhesion mechanisms. While our revised analysis confirms that sEVs have little to no effect on VSMC migration speed or distance in 2D and 3D models, we did observe a robust effect of sEVs on the directionality of cell invasion (Figs. 5G and 5H). This prompted us to look more closely at pathways involved in cell guidance rather than bulk cell motility.

      Our proteomic comparison between larger EVs (10K fraction) and sEVs (100K fraction) revealed a unique adhesion complex present specifically on the sEVs—comprising collagen VI, TGFBI, LGALS3BP, and EDIL3 (Figs. 7A–C)—each of which has previously been implicated in integrin signaling, cell adhesion, or invasion. Functional blocking and knockdown studies further identified collagen VI as a key mediator in the regulation of cell adhesion and invasion directionality influenced by sEVs (Figs. 7F and 7I).

      In response to this mechanistic insight, we have modified the graphical abstract and discussion to clarify our approach:

      We now explicitly state that our focus has shifted from analyzing baseline migration speed to mechanisms guiding invasion directionality, in line with our key phenotypic findings.We highlight that the unique adhesion cluster identified on sEVs—including collagen VI and its cooperative partners—provides a strong mechanistic rationale for examining focal adhesion dynamics and ECM interactions, even in the absence of changes in migration velocity.Discussion excerpts (pages 13–14) have been updated to reflect this rationale and to summarize the potential significance of these findings for vascular biology and disease.

      We hope this clarifies the logic underlying our approach and justifies the mechanistic studies performed in this context:

      (1) Discussion, page 13, Ln 2  “Hence, it will be interesting in future studies to investigate whether sEVs can stimulate Rho activity by presenting adhesion modulators—particularly collagen VI—on their surface, thereby guiding cell directionality during invasion.”

      (2) Discussion, page 13, Ln 30  “In addition to collagen VI the unique adhesion cluster in VSMC-derived sEVS also includes EGF-like repeat and discoidin I-like domain-containing protein (EDIL3), transforming growth factor-beta-induced protein ig-h3 (TGFBI) and the lectin galactoside-binding soluble 3 binding protein (LGALS3BP) and these proteins are also directly implicated in activation of integrin signalling and cellular invasiveness85-87. Although we found that collagen VI plays the key role in sEV-induced early formation of FAs in VSMCs, it is tempting to speculate that the high sEV efficacy in stimulating FA formation is driven by cooperative action of this unique adhesion complex on the sEVs surface and targeting this novel sEV-dependent mechanism of VSMC invasion may open-up new therapeutic opportunities to modulate atherosclerotic plaque development or even to prevent undesired VSMC motility in restenosis”.    . 

      (3) Discussion, page 14, Ln 14 “In summary, cooperative activation of integrin signalling and F-actin cytoskeleton pathways results in the secretion of sEVs which associate with the ECM and play a signalling role by controlling FA formation and cell-ECM crosstalk. Further studies are needed to test these mechanisms across various cell types and ECM matrices.     ”.    

      Figure 7<br /> • The authors need to provide additional evidence Col IV is harbored in sEVs and not a contaminant of sEV isolation as VSMCs secrete a copious amount of this in culture. For instance, IHC of isolated sEVs stained for CD63 and Col IV as well as single cell staining of the same sort.

      We thank the reviewer for this important comment regarding the specificity of collagen VI detection in sEVs. To ensure that collagen VI is associated with bona fide sEVs—rather than being a contaminant resulting from high extracellular abundance—we performed a comparative analysis of vesicles isolated from the same conditioned media. Both proteomic mass spectrometry and western blotting revealed that collagen VI was exclusively present in the small EV (100K pellet) fraction and not in the larger EVs (10K pellet), as shown in Figs. 7B and 7C. Collagen VI was further identified in sEVs extracted from the ECM using our salt/guanidine protocol (new Fig. 7D).

      Reviewer #2 (Recommendations For The Authors):

      The authors have presented a nice collection of data with strong approaches to address their hypotheses. Nevertheless, an additional section within the Discussion would be welcome in addressing the potential limitations and important caveats to be considered alongside their study. These caveats and limitations could be reshaped by additional data supporting the ideas that: (1) small extracellular vesicles can be directly observed during their secretion from filopodia, (2) CD81 labeling in tissue can be interpreted clearly as extracellular vesicles and not the cell surface of other cell types (co-staining with an endothelial cell marker such as PECAM-1 perhaps), and (3) collagen VI within the vesicles is somehow accessed by adhesion molecules on the cell surface of migrating cells.

      We thank the reviewer for these important suggestions and we have now added further studies and modified our conclusions to reflect the data more accurately:

      (1) Results. Page 6, Ln37  “We also attempted to visualise sEV release in filopodia using CD63-pHluorin where fluorescence is only observed upon the fusion of MVBs with the plasma membrane39. Using total internal reflection fluorescence microscopy (TIRF) we observed the typical “burst”-like appearance of sEV secretion at the cell-ECM interface in full agreement with an earlier report showing MVB recruitment to invadopodia-like structures in tumor cells18 (Fig S2B and Supplementary Video S1). Although we also observed an intense CD63-pHluorin staining along filopodia-like structures we were not able to detect typical “burst”-like events to confirm sEV secretion in filopodia. (Fig S2C and Supplemental Video S1)”..  

      (2) Discussion, page 12, Ln18: “Here we report that β1 integrin activation triggers sEV release followed by sEV entrapment by the ECM. Curiously we observed CD63+ MVB transport toward the filopodia tips as well as inhibition of sEV-secretion with filopodia formation inhibitors suggesting that sEV secretion can be directly linked to filopodia but further studies are needed to define the contribution of this pathway to the overall sEV secretion by cells”..

      We quantified the colocalization of CD81 and CD31 to exclude the endothelial cell origin of sEVs and extended the characterisation of the atherosclerotic matrix as well as highlighting any limitations to interpretation ie re  CD81 ECM localisation: 

      (1) Results, page 8, Ln 43 “An enhanced expression of CD81 by endothelial cells in early atheroma has been previously reported so to study the contribution of CD81+ sEVs derived from endothelial cells  we investigated the localisation of CD31 and CD8145. In agreement with a previous study, we found that the majority of CD31 colocalises with CD81 (Thresholded Mander's split colocalization coefficient 0.54±0.11, N=6) indicating that endothelial cells express CD81 (Fig 4G)45. However, only a minor fraction of total CD81 colocalised with CD31 (Thresholded Mander's split colocalization coefficient 0.24±0.06, N=6) confirming that the majority of CD81 in the neointima is originating from the most abundant VSMCs.. 

      (2) Results, page 8, Ln 28: “To test if FN associates with sEV markers in atherosclerosis, we investigated the spatial association of FN with sEV markers using the sEV-specific marker CD81. Staining of atherosclerotic plaques with haematoxylin and eosin revealed well-defined regions with the neointima as well as tunica media layers formed by phenotypically transitioned or contractile VSMCs, respectively (Fig S4A). Masson's trichrome staining of atherosclerotic plaques showed abundant haemorrhages in the neointima, and sporadic haemorrhages in the tunica media (Fig S4B). Staining of atherosclerotic plaques with orcein indicated weak connective tissue staining in the atheroma with a confluent extracellular lipid core, and strong specific staining at the tunica media containing elastic fibres which correlated well with the intact elastin fibrils in the tunica media (Figs S4C and S4D). Using this clear morphological demarcation, we found that FN accumulated both in the neointima and the tunica media where it was significantly colocalised with the sEV marker, CD81 (Fig. 4D, 4E and 4F). Notably CD81 and FN colocalization was particularly prominent in cell-free, matrix-rich plaque regions (Figs. 4E and 4F). .”

      We showed that collagen VI is presented on the surface of sEVs:

      (1) Results, page 10, Ln43: “Collagen VI was the most abundant protein in VSMC-derived sEVs (Fig 7B, Table S7) and  was previously implicated in the interaction with the proteoglycan NG253 and suppression of cell spreading on FN54. To confirm the presence of collagen VI in ECM-associated sEVs we analysed sEVs extracted from the 3D matrix using 0.5M NaCl treatment and showed that both collagen VI and FN are present (Fig 7D). Next, we analysed the distribution of collagen VI using dot-blot. Alix staining was bright only upon permeabilization of sEV indicating that it is preferentially a luminal protein (Fig 7E). On the contrary, CD63 staining was similar in both conditions showing that it is surface protein (Fig 7E). Interestingly, collagen VI staining revealed that 40% of the protein is located on the outside surface with 60% in the sEV lumen (Fig 7E)

    1. Reviewer #3 (Public review):

      Summary:

      The present manuscript investigates and proposes different mechanisms for the effects of two therapeutic approaches - cognitive distancing technique and use of antidepressants - on subjective ratings of happiness, confidence, and task engagement, and on the influence of such subjective experiences on choice behavior. Both approaches were found to link to changes in affective state dynamics in a choice task, specifically reduced drift (cognitive distancing) and increased baseline (antidepressant use). Results also suggest that cognitive distancing may reduce the weighing of recent expected values in the happiness model, while antidepressant use may reduce forgetting of choices and outcomes.

      Strengths:

      This is a timely topic and a significant contribution to ongoing efforts to improve our mechanistic understanding of psychopathology and devise effective novel interventions. The relevance of the manuscript's central question is clear, and the links to previous literature and the broader field of computational psychiatry are well established. The modelling approaches are thoughtful and rigorously tested, with appropriate model checks and persuasive evidence that modelling complements the theoretical argument and empirical findings.

      Weaknesses:

      Some vagueness and lack of clarity in theoretical mechanisms and interpretation of results leave outstanding questions regarding (a) the specific links drawn between affective biases, therapies aimed at mitigating them, and mental health function, and (b) the structure and assumptions of the modelling, and how they support the manuscript's central claims. Broadly, I do not fully understand the distinction between how choice behavior vs. affect are impacted separately or together by cognitive distancing. Clarification on this point is needed, possibly through a more explicit proposal of a mechanism (or several alternative mechanisms?) in the introduction and more explicit interpretation of the modelling results in the context of the cyclical choice-affect mechanism.

      (1) Theoretical framework and proposed mechanisms

      The link between affective biases and negative thinking patterns is a bit unclear. The authors seem to make a causal claim that "affective biases are precipitated and maintained by negative thinking patterns", but it is unclear what precisely these negative patterns are; earlier in the same paragraph, they state that affective biases "cause low mood" and possibly shift choices toward those that maintain low mood. So the directionality of the mechanism here is unclear - possibly explaining a bit more of the cyclic nature of this mechanism, and maybe clarifying what "negative thinking patterns" refer to will be helpful.

      More generally, this link between affect and choices, especially given the modelling results later on, should be clarified further. What is the mechanism by which these two impact each other? How do the models of choice and affect ratings in the RL task test this mechanism? I'm not quite sure the paper answers these questions clearly right now.

      The authors also seem to implicitly make the claim that symptoms of mental ill-health are at least in part related to choice behavior. I find this a persuasive claim generally; however, it is understated and undersupported in the introduction, to the point where a reader may need to rely on significant prior knowledge to understand why mitigating the impact of affective biases on choice behavior would make sense as the target of therapeutic interventions. This is a core tenet of the paper, and it would be beneficial to clarify this earlier on.

      It would be helpful to interpret a bit more clearly the findings from 3.4. on decreased drift in all three subjective assessments in the cognitive distancing group. What is the proposed mechanism for this? The discussion mentions that "attenuated declines [...] over time, [add] to our previously reported findings that this psychotherapeutic technique alters aspects of reward learning" - but this is vague and I do not understand, if an explanation for how this happens is offered, what that explanation is. Given the strong correlation of the drift with fatigue, is the explanation that cognitive distancing mitigates affect drift under fatigue? Or is this merely reporting the result without an interpretation around potential mechanisms?

      (Relatedly, aside from possibly explaining the drift parameter, do the fatigue ratings link with choice behavior in any way? Is it possible that the cognitive distancing was helping participants improve choices under fatigue?)

      (2) Task Structure and Modelling

      It is unclear what counted as a "rewarding" vs. "unrewarding" trial in the model. From my understanding of the task description, participants obtained positive or no reward (no losses), and verbal feedback, Correct/Incorrect. But given the probabilistic nature of the task, it follows that even some correct choices likely had unrewarding results. Was the verbal feedback still "Correct" in those cases, but with no points shown? I did not see any discussion on whether it is the #points earned or the verbal feedback that is considered a reward in the model. I am assuming the former, but based on previous literature, likely both play a role; so it would be interesting - and possibly necessary to strengthen the paper's argument - to see a model that assigns value to positive/negative feedback and earned points separately.

      From a theory perspective, it's interesting that the authors chose to assume separate learning rates for rewarding and non-rewarding trials. Why not, for example, separate reward sensitivity parameters? E.g., rather than a scaling parameter on the PE, a parameter modifying the r term inside the PE equation to, perhaps, assign different values to positive and zero points? (While I think overall the math works out similarly at the fitting time, this type of model should be less flexible on scaling the expected value and more flexible on scaling the actual #points / the subjective experience of the obtained verbal feedback, which seems more in line with the theoretical argument made in the introduction). The introduction explicitly states that negative biases "may cause low mood by making outcomes appear less rewarding" - which in modelling equations seems more likely to translate to different reward-perception biases, and not different learning rates. Alternatively, one might incorporate a perseveration parameter (e.g., similar to Collins et al. 2014) that would also accomplish a negative bias. Either of these two mechanisms seems perhaps worth testing out in a model - especially in a model that defines more clearly what rewarding vs. unrewarding may mean to the participant.

      If I understand correctly, the affect ratings models assume that the Q-value and the PE independently impact rating (so they have different weights, w2 and w3), but there is no parameter allowing for different impact for perceived rewarding and unrewarding outcomes? (I may be misreading equations 4-5, but if not, Q-value and PE impact the model via static rather than dynamic parameters.) Given the joint RL-affect fit, this seems to carry the assumption that any perceptual processing differences leading to different subjective perceptions of reward associated with each outcome only impact choice behavior, but not affect? (whereas affect is more broadly impacted, if I'm understanding this correctly, just by the magnitude of the values and PEs?) This is an interesting assumption, and the authors seem to have tested it a bit more in the Supplementary material, as shown in Figure S4. I'm wondering why this was excluded from the main text - it seems like the more flexible model found some potentially interesting differences which may be worth including, especially as they might shed additional insight into the influence of cognitive distancing on the cyclical choice-affect mechanisms proposed.

      Minor comments:

      If fatigue ratings were strongly associated with drift in the best-fitting model (as per page 13), I wonder if it would make sense to use those fatigue ratings as a proxy rather than allow the parameter to vary freely? (This does not in any way detract from the winning model's explanatory power, but if a parameter seems to be strongly explained by a variable we have empirical data for, it's not clear what extra benefit is earned by having that parameter in the model).

    1. and that the Lord may behold us as a People offering Praise and thereby glorifying Him

      They want to receive praise from God for offering a day of peace where pilgrims and natives can feast together. I interpret this to mean that they view God as kind and forgiven and therefore think he will favor them if they behave similarly even though they view them as heathens. This helps me understand how they interacted with the Native Americans and how they thought about God at the time and what interpretations they used. This provides changes over time with what we know celebrate Thanksgiving as and how it originally started.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #1 (Public review): 

      Summary: 

      Ferreiro et al. present a method to simulate protein sequence evolution under a birth-death model where sequence evolution is guided by structural constraints on protein stability. The authors then use this model to explore the predictability of sequence evolution in several viral proteins. In principle, this work is of great interest to molecular evolution and phylodynamics, which has struggled to couple non-neutral models of sequence evolution to phylodynamic models like birth-death processes. Unfortunately, though, the model shows little improvement over neutral models in predicting protein sequence evolution, although it can predict protein stability better than models assuming neutral evolution. It appears that more work is needed to determine exactly what aspects of protein sequence evolution are predictable under such non-neutral phylogenetic models. 

      We thank the reviewer for the positive comments about our work. We agree that further work is needed in the field of substitution models of molecular evolution to enable more accurate predictions of specific amino acid sequences in evolutionary processes.

      Major concerns: 

      (1) The authors have clarified the mapping between birth-death model parameters and fitness, but how fitness is modeled still appears somewhat problematic. The authors assume the death rate = 1 - birth rate. So a variant with a birth rate b = 1 would have a death rate d = 0 and so would be immortal and never die, which does not seem plausible. Also I'm not sure that this would "allow a constant global (birth-death) rate" as stated in line 172, as selection would still act to increase the population mean growth rate r = b - d. It seems more reasonable to assume that protein stability affects only either the birth or death rate and assume the other rate is constant, as in the Neher 2014 model. 

      The model proposed by Neher, et al. (2014), which incorporates a death rate (d) higher than 0 for any variant, was implemented and applied in the present method. In general, this model did not yield results different from those obtained using the model that assumes d = 1 – b, suggesting that this aspect may not be crucial for the study system. Next, the imposition of arbitrary death events based on an arbitrary death rate could be a point of concern. Regarding the original model, a variant with d = 0 can experience a decrease in fitness through the mutation process. In an evolutionary process, each variant is subject to mutation, and Markov models allow for the incorporation of mutations that decrease fitness (albeit with lower probability than beneficial ones, but they can still occur). All this information is included in the manuscript.

      (2) It is difficult to evaluate the predictive performance of protein sequence evolution. This is in part due to the fact that performance is compared in terms of percent divergence, which is difficult to compare across viral proteins and datasets. Some protein sequences would be expected to diverge more because they are evolving over longer time scales, under higher substitution rates or under weaker purifying selection. It might therefore help to normalize the divergence between predicted and observed sequences by the expected or empirically observed amount of divergence seen over the timescale of prediction. 

      AU: The study protein datasets showed different levels of sequence divergence over their evolutionary times, as indicated for each dataset in the manuscript. For some metrics, we evaluated the accuracy (or error) of the predictions through direct comparisons between real and predicted protein variants using percentages to facilitate interpretation: 0% indicates a perfect prediction (no error), while 100% indicates a completely incorrect prediction (total error). Regarding normalization of these evaluations, we respectfully disagree with the suggestion because diverse factors can affect (not only the substitution rate, but also the sample size, structural features of the protein that may affect stability when accommodating different sequences, among others) and this complicates defining a consistent and meaningful normalization criterion. Given that the manuscript provides detailed information for each dataset, we believe that the presentation of the prediction accuracy through direct comparisons between real and predicted protein variants, expressed as percentages of similarity, is the clearest way.

      (3) Predictability may also vary significantly across different sites in a protein. For example, mutations at many sites may have little impact on structural stability (in which case we would expect poor predictive performance) while even conservative changes at other sites may disrupt folding. I therefore feel that there remains much work to be done here in terms of figuring out where and when sequence evolution might be predictable under these types of models, and when sequence evolution might just be fundamentally unpredictable due to the high entropy of sequence space. 

      We agree with this reflection. Mutations can have different effects on folding stability, which are accounted for by the model presented in this study. However, accurately predicting the exact sequences of protein variants with similar stability remains difficult with current structurally constrained substitution models, and therefore, further work is needed in this regard. This aspect is indicated in the manuscript.

      We want to thank the reviewer again for taking the time to revise our work and for the insightful and helpful comments.

      Reviewer #2 (Public review): 

      In this study, the authors aim to forecast the evolution of viral proteins by simulating sequence changes under a constraint of folding stability. The central idea is that proteins must retain a certain level of structural stability (quantified by folding free energy, ΔG) to remain functional, and that this constraint can shape and restrict the space of viable evolutionary trajectories. The authors integrate a birth-death population model with a structurally constrained substitution (SCS) model and apply this simulation framework to several viral proteins from HIV-1, SARS-CoV-2, and influenza.

      The motivation to incorporate biophysical constraints into evolutionary models is scientifically sound, and the general approach aligns with a growing interest in bridging molecular evolution and structural biology. The authors focus on proteins where immune pressure is limited and stability is likely to be a dominant constraint, which is conceptually appropriate. The method generates sequence variants that preserve folding stability, suggesting that stability-based filtering may capture certain evolutionary patterns. 

      Correct. We thank the reviewer for the positive comments about our study.

      However, the study does not substantiate its central claim of forecasting. The model does not predict future sequences with measurable accuracy, nor does it reproduce observed evolutionary paths. Validation is limited to endpoint comparisons in a few datasets. While KL divergence is used to compare amino acid distributions, this analysis is only applied to a single protein (HIV-1 MA), and there is no assessment of mutation-level predictive accuracy or quantification of how well simulated sequences recapitulate real evolutionary paths. No comparison is made to real intermediate variants available from extensive viral sequencing datasets which gather thousands of sequences with detailed collection date annotation (SARS-CoV-2, Influenza, RSV). 

      There are several points in this comment.

      The presented method accurately predicts folding stability of forecasted variants, as shown through comparisons between real and predicted protein variants. However, as the reviewer correctly indicates, predicting the exact amino acid sequences remains challenging. This limitation is discussed in detail in the manuscript, where we also suggest that further improvements in substitution models of protein evolution are needed to better capture the evolutionary signatures of amino acid change at the sequence level, even between amino acids with similar physicochemical properties. Regarding the time points used for validation, the studied influenza NS1 dataset included two validation points. A key limitation in increasing the number of time points is the scarcity of datasets derived from monitoring protein evolution with sufficient molecular diversity between samples collected at consecutive time points (i.e., at least more than five polymorphic amino acid sites). 

      As described in the manuscript, calculating Kullback-Leibler (KL) divergence requires more than one sequence per studied time point. However, most datasets in the literature include only a single sequence per time point, typically a consensus sequence derived from bulk population sequencing. Generating multiple sequences per time point is experimentally more demanding, often requiring advanced methods such as single-virus sequencing or amplification of sublineages in viral subpopulations, as was done for the first dataset used in the study (Arenas, et al. 2016), which enabled the calculation of KL divergence. The extent to which the simulated sequences resemble real evolution is evaluated in the method validation. As noted, intermediate time point validation was performed using the influenza NS1 protein dataset. Although, as the reviewer indicates, thousands of viral sequences are available, these are usually consensus sequences from bulk sequencing. Indeed, many viral variants mainly differ through synonymous mutations, where the number of accumulated nonsynonymous mutations is small. For example, from the original Wuhan strain to the Omicron variant, the SARS-CoV-2 proteins Mpro and PLpro accumulated only 10 and 22 amino acid changes, respectively.

      Analyzing intermediate variants of concern (i.e., Gamma or Delta) would reduce this number affecting statistics. In addition, many available viral sequences are not consecutive in evolutionary terms (one dataset does not represent the direct origin of another dataset at a subsequent time point), which further limits their applicability in this study. There is little data from monitored protein evolution with consecutive samples. The most suitable studies usually involve in vitro virus evolution, but the data from these studies often show low genetic variability between samples collected at different time points. Finally, it is important to note that the presented method can only be applied to proteins with known 3D structures, as it relies on selection based on folding stability. Non-structural proteins cannot be analyzed using this approach. Future work could incorporate additional selection constraints, which may improve the accuracy of predictions. These considerations and limitations are indicated in the manuscript.

      The selection of proteins is narrow and the rationale for including or excluding specific proteins is not clearly justified. 

      The viral proteins included in the study were selected based on two main criteria, general interest and data availability. In particular, we included proteins from viruses that affect humans and for which data from monitored protein evolution, with sufficient molecular diversity between consecutive time points, is available. These aspects are indicated in the manuscript.

      The analyzed datasets are also under-characterized: we are not given insight into how variable the sequences are or how surprising the simulated sequences might be relative to natural diversity. Furthermore, the use of consensus sequences to represent timepoints is problematic, particularly in the context of viral evolution, where divergent subclades often coexist - a consensus sequence may not accurately reflect the underlying population structure. 

      The manuscript indicates the sequence identity among protein datasets of different time points, along with other technical details. Next, the evaluation based on comparisons between simulated and real sequences reflects how surprising the simulated sequences might be relative to natural diversity, considering that the real dataset is representative. We believe that the diverse study real datasets are useful to evaluate the accuracy of the method in predicting different molecular patterns. Regarding the use of consensus sequences, we agree that they provide an approximation. However, as previously indicated, most of the available data from monitored protein evolution consist of consensus sequences obtained through bulk sequencing. Additionally, analyzing every individual viral sequence within a viral population, which is typically large, would be ideal but computationally intractable.

      The fitness function used in the main simulations is based on absolute ΔG and rewards increased stability without testing whether real evolutionary trajectories tend to maintain, increase, or reduce folding stability over time for the particular systems (proteins) that are studied. While a variant of the model does attempt to center selection around empirical ΔG values, this more biologically plausible version is underutilized and not well validated.

      The applied fitness function, based on absolute ΔG, is well stablished in the field (Sella and Hirsh 2005; Goldstein 2013). The present study independently predicts ΔG for the real and simulated protein variants at each sampling point. This ΔG prediction accounts not only for negative design, informed by empirical data, but also for positive design based on the study data (Arenas, et al. 2013; Minning, et al. 2013), thereby enabling the detection of variation in folding stability among protein variants. These aspects are indicated in the manuscript. Therefore, in our view, the study provides a proper comparison of real and predicted evolutionary trajectories in terms of folding stability.

      Ultimately, the model constrains sequence evolution to stability-compatible trajectories but does not forecast which of these trajectories are likely to occur. It is better understood as a filter of biophysically plausible outcomes than as a predictive tool. The distinction between constraint-based plausibility and sequence-level forecasting should be made clearer. Despite these limitations, the work may be of interest to researchers developing simulation frameworks or exploring the role of protein stability in viral evolution, and it raises interesting questions about how biophysical constraints shape sequence space over time. 

      The presented method estimates the fitness of each protein variant, which can reflect the relative survival capacity of the variant. Therefore, despite the error due to evolutionary constraints not considered by the method, it indicates which variants are more likely to become fixed over time. In our view, the method does not merely filter plausible variants, rather, it generates predictions of variant survival through predicted fitness based on folding stability and simulations of protein evolution under structurally constrained substitution models integrated with birth-death population genetics approaches. The use of simulation-based approaches for prediction is well established in population genetics. For example, approaches such as approximate Bayesian computation (Beaumont, et al. 2002) rely on this strategy, and it has also been applied in other studies of forecasting evolution (e.g., Neher, et al. 2014). We believe that the distinction between forecasting folding stability and amino acid sequence is clearly shown in the manuscript, including the main text and the figures.

      Reviewer #2 (Recommendations for the authors): 

      I thank the authors for addressing the question about template switching, their clarification was helpful. However, the core concerns I raised remain unresolved: the claim that the method is useful for forecasting is not substantiated.  In order to support the paper's central claims or to prove its usefulness, several key improvements could be incorporated: 

      (1) Systematic analysis of more proteins: 

      The manuscript would be significantly strengthened by a systematic evaluation of model performance across a broader set of viral proteins, beyond the examples currently shown. Many human influenza and SARS-CoV-2 proteins have wellcharacterized structures or high-quality homology templates, making them suitable candidates. In the light of limited success of the method, presenting the model's behavior across a more comprehensive protein set, including those with varying structural constraints and immune pressures, would help assess generalizability and clarify the specific conditions under which the model is applicable. 

      Following a comment from the reviewer in a previous revision of the study, we included the analysis of an influenza NS1 protein dataset that contains two evaluation time points. Next, to validate the prediction method, it is necessary to have monitored protein sequences collected at least at two consecutive time points, with sufficient divergence between them to capture evolutionary signatures that allow for proper evaluation. Additionally, many data involve sequences that are not consecutive in evolutionary terms (one dataset is not a direct ancestor of another dataset existing at a posterior time point), which disallows their applicability in this study. Little data from monitored protein evolution with trustable consecutive (ancestor-descendant) samples exist. The most suitable studies often involve in vitro virus evolution, but they usually show low genetic variability between samples collected at different time points. Although thousands of sequences are available for some viruses, they are usually consensus sequences from bulk sequencing and often show a low number of nonsynonymous mutations at the study protein-coding gene between time points. For example, from the original Wuhan strain and the Omicron variant, the SARS-CoV-2 proteins Mpro and PLpro accumulated only 10 and 22 amino acid changes, respectively. Analyzing intermediate variants of concern (i.e., Gamma or Delta) would reduce this number affecting statistics. Thus, in practice, we found scarcity of data derived from monitoring protein evolution, with trustable ancestor and corresponding descendant data at consecutive time points and with sufficient molecular diversity between them (i.e., at least more than five polymorphic amino acid sites). In all, we believe that the diverse viral protein datasets used in the present study, along with the multiple analyzed datasets collected from monitored HIV-1 populations present in different patients, provide a representative application of the method, since notice that similar patterns were generally generated from the analysis of the different datasets.

      (2) Present clear data statistics: For each analyzed dataset, the authors should provide basic information about the number of unique sequences, levels of variability, and evolutionary divergence between start and end sequences. This would contextualize the forecasting task and clarify whether the simulations are non-trivial. In particular, it should be shown that the consensus sequence is indeed representative of the viral population at a given time point. In viral evolution we frequently observe co-circulation of subclades and the consensus sequence is then not representative. 

      For each dataset analyzed, the manuscript provides the sequence identity between samples at the study time points (which also informs about sequence variability), sample sizes, representative protein structure, and other technical details. The study assumes that consensus sequences, typically generated by bulk sequencing, are representative of the viral population. Next, samples at different time points should involve ancestor-descendant relationships, which is a requirement and one of the limitations to find appropriate data for this study, as noted in our previous response.

      (3) Explore other metrics for population level sequence comparison: 

      In the light of possible existence of subclades, mentioned above, the currently used metrics for sequence comparison may underestimate performance of the simulations. It would be sufficient to see some overlap of simulated clades and and the observed clades. 

      We found this to be a good idea. However, in practice, we believe that the criteria used to define subclades could introduce biases into the results. For some metrics, we evaluated the accuracy of the predictions through direct comparisons between all real and predicted protein variants, using percentages to facilitate interpretation. We believe that using subclades could potentially reduce the current prediction errors, but this would complicate the interpretation of the results, as they would be influenced by the subjective criteria used to define the subclades.

      Currently, the manuscript presents a plausible filtering framework rather than a predictive model. Without these additional analyses, the main claims remain only partially supported. 

      Please see our reply to the comment of the reviewer just before the section titled “Recommendations for the authors”.

      Response to some rebuttal statements: 

      (1) "Sequence comparisons based on the KL divergence require, at the studied time point, an observed distribution of amino acid frequencies among sites and an estimated distribution of amino acid frequencies among sites. In the study datasets, this is only the case for the HIV-1 MA dataset, which belongs to a previous study from one of us and collaborators where we obtained at least 20 independent sequences at each sampling point (Arenas, et al. 2016)" 

      The available Influenza and SARS-CoV-2 data gathers isolates annotated with exact collection dates, providing reach datasets for such analysis. 

      The available influenza and SARS-CoV-2 sequences are typically derived from bulk sequencing and, therefore, they are consensus sequences. As a result, they cannot be used to calculate KL divergence. Additionally, many of the indicated sequences from databases are not demonstrated to be consecutive in evolutionary terms (one dataset is not a direct ancestor of another dataset existing at a posterior time point), which disallows their applicability in this study. The most suitable studies often involve in vitro virus evolution, but they usually show low genetic variability between samples collected at different time points.

      (2) "Regarding extending the analysis to other time points (other variants of concern), we kindly disagree because Omicron is the variant of concern with the highest genetic distance to the Wuhan variant, and a high genetic distance is  required to properly evaluate the prediction method." 

      There have been many more variants of concern subsequent to Omicron which circulated in 2021. 

      A key aspect is the accumulation of diversity in the study proteins across different time points. The SARS-CoV-2 proteins Mpro and PLpro accumulated only 10 and 22 amino acid changes from the original Wuhan variant to Omicron, respectively.

      Analyzing intermediate variants of concern (e.g., Gamma or Delta) or those closely related to Omicron would reduce the number of accumulated mutations even further.   

      We want to thank the reviewer again for taking the time to revise our work and for the insightful and helpful comments.


      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      Ferreiro et al. present a method to simulate protein sequence evolution under a birth-death model where sequence evolution is constrained by structural constraints on protein stability. The authors then use this model to explore the predictability of sequence evolution in several viral structural proteins. In principle, this work is of great interest to molecular evolution and phylodynamics, which have struggled to couple non-neutral models of sequence evolution to phylodynamic models like birth-death. Unfortunately, though, the model shows little improvement over neutral models in predicting protein evolution, and this ultimately appears to be due to fundamental conceptual problems with how fitness is modeled and linked to the phylodynamic birth-death model. 

      AU: We thank the reviewer for the positive comments about our work.

      Regarding predictive power, the study showed a good accuracy in predicting the real folding stability of forecasted protein variants under a selection model, but not under a neutral model. Next, predicting the exact sequences was more challenging. In this revised version, where we added additional real data, we found that the accuracy of this prediction can vary among proteins (i.e., the SCS model was more accurate than the neutral model in predicting sequences of the influenza NS1 protein at different time points). Still, we consider that efforts are required in the field of substitution models of molecular evolution. For example, amino acids with similar physicochemical properties can result in predictions with appropriate folding stability while different specific sequence. The development of accurate substitution models of molecular evolution is an active area of research with ongoing progress, but further efforts are still needed. Next, forecasting the folding stability of future real proteins is fundamental for proper forecasting protein evolution, given the essential role of folding stability in protein function and its variety of applications. Regarding the conceptual concerns related to fitness modeling, we clarify them in detail in our responses to the specific comments below.

      Major concerns:

      (1) Fitness model: All lineages have the same growth rate r = b-d because the authors assume b+d=1. But under a birth-death model, the growth r is equivalent to fitness, so this is essentially assuming all lineages have the same absolute fitness since increases in reproductive fitness (b) will simply trade off with decreases in survival (d). Thus, even if the SCS model constrains sequence evolution, the birthdeath model does not really allow for non-neutral evolution such that mutations can feed back and alter the structure of the phylogeny. 

      We thank the reviewer for this comment that aims to improve the realism of our model. In the model presented (but see later another model, derived from the proposal of the reviewer, that we have now implemented into the framework and applied it to the study data), the fitness predicted from a protein variant is used to obtain the corresponding birth rate of that variant. In this way, protein variants with high fitness have high birth rates leading to overall more birth events, while protein variants with low fitness have low birth rates resulting in overall more extinction events, which has biological meaning for the study system. The statement “All lineages have the same growth rate r = b-d” in our model is incorrect because, in our model, b and d can vary among lineages according to the fitness. For example, a lineage might have b=0.9, d=0.1, r=0.8, while another lineage could have b=0.6, d=0.4, r=0.2. Indeed, the statement “this is essentially assuming all lineages have the same absolute fitness” is incorrect. Clearly, assuming that all lineages have the same fitness would not make sense, in that situation the folding stability of the forecasted protein variants would be similar under any model, which is not the case as shown in the results. In our model, the fitness affects the reproductive success, where protein variants with a high fitness have higher birth rates leading to more birth events, while those with lower fitness have higher death rates leading to more extinction events. This parameterization is meaningful for protein evolution because the fitness of a protein variant can affect its survival (birth or extinction) without necessarily affecting its rate of evolution. While faster growth rate can sometimes be associated with higher fitness, a variant with high fitness does not necessarily accumulate substitutions at a faster rate. Regarding the phylogenetic structure, the model presented considers variable birth and death events across different lineages according to the fitness of the corresponding protein variants, and this affects the derived phylogeny (i.e., protein variants selected against can go extinct while others with high fitness can produce descendants). We are not sure about the meaning of the term “mutations can feed back” in the context of our system. Note that we use Markov models of evolution, which are well-stablished in the field (despite their limitations), and substitutions are fixed mutations, which still could be reverted later if selected by the substitution model (Yang 2006). Altogether, we find that the presented birth-death model is technically correct and appropriate for modeling our biological system. Its integration with structurally constrained substitution (SCS) models of protein evolution as Markov models follows general approaches of molecular evolution in population genetics (Yang 2006; Carvajal-Rodriguez 2010; Arenas 2012; Hoban, et al. 2012). We have now provided a more detailed description of the models in the manuscript.

      Apart from these clarifications about the birth-death model used, we could understand the point of the reviewer and following the suggestion we have now incorporated an additional birth-death model that accounts for variable global birth-death rate among lineages. Specifically, we followed the model proposed by Neher et al (2014), where the death rate is considered as 1 and the birth rate is modeled as 1 + fitness. In this model, the global birth-death rate can vary among lineages. We implemented this model into the computer framework and applied it to the data used for the evaluation of the models. The results indicated that, in general, this model yields similar predictive accuracy compared to the previous birth-death model. Thus, accounting for variability in the global birth-death rate does not appear to play a major role in the studied systems of protein evolution. We have now presented this additional birth-death model and its results in the manuscript.

      (2) Predictive performance: Similar performance in predicting amino acid frequencies is observed under both the SCS model and the neutral model. I suspect that this rather disappointing result owes to the fact that the absolute fitness of different viral variants could not actually change during the simulations (see comment #1). 

      As indicated in our previous answer, our study shows a good accuracy in predicting the real folding stability of forecasted protein variants under a selection model, but not under a neutral model. Next, predicting the exact sequences was more challenging, which was not surprising considering previous studies. In particular, inferring specific sequences is considerably challenging even for ancestral molecular reconstruction (Arenas, et al. 2017; Arenas and Bastolla 2020). Indeed, observed sequence diversity is much greater than observed structural diversity (Illergard, et al. 2009; Pascual-Garcia, et al. 2010), and substitutions between amino acids with similar physicochemical properties can yield modeled protein variants with more accurate folding stability, even when the exact amino acid sequences differ. As indicated, further work is demanded in the field of substitution models of molecular evolution. Next, in this revised version, where we included analyses of additional real datasets, we found that the accuracy of sequence prediction can vary among datasets. Notably, the analysis of an influenza NS1 protein dataset, with higher diversity than the other datasets studied, showed that the SCS model was more accurate than the neutral model in predicting sequences across different time points. Datasets with relatively high sequence diversity can contain more evolutionary information, which can improve prediction quality. In any case, as previously indicated, we believe that efforts are required in the field of substitution models of molecular evolution. Apart from that, forecasting the folding stability of future real proteins is an important advance in forecasting protein evolution, given the essential role of folding stability in protein function (Scheiblhofer, et al. 2017; Bloom and Neher 2023) and its variety of applications.

      Next, also as indicated in our previous response, the birth-death model used in this study accounts for variation in fitness among lineages producing variable reproductive success. The additional birth-death model that we have now incorporated, which considers variation of the global birth-death rate among lineages, produced similar prediction accuracy, suggesting a limited role in protein evolution modeling. Molecular evolution parameters, particularly the substitution model, appear to be more critical in this regard. We have now included these aspects in the manuscript.

      (3) Model assessment: It would be interesting to know how much the predictions were informed by the structurally constrained sequence evolution model versus the birth-death model. To explore this, the authors could consider three different models: 1) neutral, 2) SCS, and 3) SCS + BD. Simulations under the SCS model could be performed by simulating molecular evolution along just one hypothetical lineage. Seeing if the SCS + BD model improves over the SCS model alone would be another way of testing whether mutations could actually impact the evolutionary dynamics of lineages in the phylogeny. 

      In the present study, we compared the neutral model + birth-death (BD) with the SCS model + BD. Markov substitution models Q are applied upon an evolutionary time (i.e., branch length, t) and this allows to determine the probability of substitution events during that time period [P(t) = exp (Qt)]. This approach is traditionally used in phylogenetics to model the incorporation of substitution events over time. Therefore, to compare the neutral and SCS models in terms of evolutionary inference, an evolutionary time is required, in this case it is provided by the birth-death process. Thus, the cases 1) and 2) cannot be compared without an underlined evolutionary history. Next, comparisons in terms of likelihood, and other aspects, between models that ignore the protein structure and the implemented SCS models are already available in previous studies based on coalescent simulations or given phylogenetic trees (Arenas, et al. 2013; Arenas, et al. 2015). There, SCS models outperformed models that ignore evolutionary constraints from the protein structure, and those findings are consistent with the results obtained in the present study where we explored the application of these models to forecasting protein evolution. We would like to emphasize that forecasting the folding stability of future real proteins is a significant finding, folding stability is fundamental to protein function and has a variety of applications. We have now indicated these aspects in the manuscript.

      (4) Background fitness effects: The model ignores background genetic variation in fitness. I think this is particularly important as the fitness effects of mutations in any one protein may be overshadowed by the fitness effects of mutations elsewhere in the genome. The model also ignores background changes in fitness due to the environment, but I acknowledge that might be beyond the scope of the current work. 

      AU: This comment made us realize that more information about the features of the implemented SCS models should be included in the manuscript. In particular, the implemented SCS models consider a negative design based on the observed residue contacts in nearly all proteins available in the Protein Data Bank (Arenas, et al. 2013; Arenas, et al. 2015). This data is distributed with the framework, and it can be updated to incorporate new structures (further details are provided in the distributed framework documentation and practical examples). Therefore, the prediction of folding stability is a combination of positive design (direct analysis of the target protein) and negative design (consideration of background proteins from a database to improve the predictions), thus incorporating background molecular diversity. We have now indicated this important aspect in the manuscript. Regarding the fitness caused by the environment, we agree with the reviewer. This is a challenge for any method aiming to forecast evolution, as future environmental shifts are inherently unpredictable and may affect the accuracy of the predictions. Although one might attempt to incorporate such effects into the model, doing so risks overparameterization, especially when the additional factors are uncertain or speculative. We have now mentioned this aspect in the manuscript.

      (5) In contrast to the model explored here, recent work on multi-type birth-death processes has considered models where lineages have type-specific birth and/or death rates and therefore also type-specific growth rates and fitness (Stadler and Bonhoeffer, 2013; Kunhert et al., 2017; Barido-Sottani, 2023). Rasmussen & Stadler (eLife, 2019) even consider a multi-type birth-death model where the fitness effects of multiple mutations in a protein or viral genome collectively determine the overall fitness of a lineage. The key difference with this work presented here is that these models allow lineages to have different growth rates and fitness, so these models truly allow for non-neutral evolutionary dynamics. It would appear the authors might need to adopt a similar approach to successfully predict protein evolution. 

      We agree with the reviewer that robust birth-death models have been developed applying statistics and, in many cases, the primary aim of those studies is the development and refinement of the model itself. Regarding the study by Rasmussen and Stadler 2019, it incorporates an external evaluation of mutation events where the used fitness is specific for the proteins investigated in that study, which may pose challenges for users interested in analyzing other proteins. In contrast, our study takes a different approach. We implement a fitness function that can be predicted and evaluated for any type of structural protein (Goldstein 2013), making it broadly applicable. Actually, in this revised version we added the analysis of additional data of another protein (influenza NS1 protein) with predictions at different time points. In addition, we provide a freely available and well-documented computational framework to facilitate its use. The primary aim of our study is not the development of novel or complex birthdeath models. Rather, we aim to explore the integration of a standard birth-death model with SCS models for the purpose of predicting protein evolution. In the context of protein evolution, substitution models are a critical factor (Liberles, et al. 2012; Wilke 2012; Bordner and Mittelmann 2013; Echave, et al. 2016; Arenas, et al. 2017; Echave and Wilke 2017), and the presented combination with a birth-death model constitutes a first approximation upon which next studies can build to better understand this evolutionary system. We have now indicated these considerations in the manuscript.

      Reviewer #2 (Public review): 

      Summary: 

      In this study, "Forecasting protein evolution by integrating birth-death population models with structurally constrained substitution models", David Ferreiro and coauthors present a forward-in-time evolutionary simulation framework that integrates a birth-death population model with a fitness function based on protein folding stability. By incorporating structurally constrained substitution models and estimating fitness from ΔG values using homology-modeled structures, the authors aim to capture biophysically realistic evolutionary dynamics. The approach is implemented in a new version of their open-source software, ProteinEvolver2, and is applied to four viral proteins from HIV-1 and SARS-CoV-2. 

      Overall, the study presents a compelling rationale for using folding stability as a constraint in evolutionary simulations and offers a novel framework and software to explore such dynamics. While the results are promising, particularly for predicting biophysical properties, the current analysis provides only partial evidence for true evolutionary forecasting, especially at the sequence level. The work offers a meaningful conceptual advance and a useful simulation tool, and sets the stage for more extensive validation in future studies.

      We thank the reviewer for the positive comments on our study. Regarding the predictive power, the results showed good accuracy in predicting the folding stability of the forecasted protein variants. In this revised version, where we included analyses of additional real datasets, we found that the accuracy of sequence prediction can vary among datasets. Notably, the analysis of an influenza NS1 protein dataset, with higher diversity than the other datasets studied, showed that the SCS model was more accurate than the neutral model in predicting sequences across different time points. Datasets with relatively high sequence diversity can contain more evolutionary information, which can improve prediction quality. Still, we believe that further efforts are required in the field in improving the accuracy of substitution models of molecular evolution. Altogether, accurately forecasting the folding stability of future real proteins is fundamental for predicting their protein function and enabling a variety of applications. Also, we implemented the models into a freely available computer framework, with detailed documentation and a variety of practical examples.

      Strengths: 

      The results demonstrate that fitness constraints based on protein stability can prevent the emergence of unrealistic, destabilized variants - a limitation of traditional, neutral substitution models. In particular, the predicted folding stabilities of simulated protein variants closely match those observed in real variants, suggesting that the model captures relevant biophysical constraints. 

      We agree with the reviewer and appreciate the consideration that forecasting the folding stability of future real proteins is a relevant finding. For instance, folding stability is fundamental for protein function and affects several other molecular properties.

      Weaknesses: 

      The predictive scope of the method remains limited. While the model effectively preserves folding stability, its ability to forecast specific sequence content is not well supported. 

      Our study showed a good accuracy in predicting the real folding stability of forecasted protein variants under a selection model, but not under a neutral model. Predicting the exact sequences was more challenging, which was not surprising considering previous studies. In particular, inferring specific sequences is considerably challenging even for ancestral molecular reconstruction (Arenas, et al. 2017; Arenas and Bastolla 2020). Indeed, observed sequence diversity is much greater than observed structural diversity (Illergard, et al. 2009; Pascual-Garcia, et al. 2010), and substitutions between amino acids with similar physicochemical properties can yield modeled protein variants with more accurate folding stability, even when the exact amino acid sequences differ. As indicated, further work is demanded in the field of substitution models of molecular evolution. Next, in this revised version, where we included analyses of additional real datasets, we found that the accuracy of sequence prediction can vary among datasets. Notably, the analysis of an influenza NS1 protein dataset, with higher diversity than the other datasets studied, showed that the SCS model was more accurate than the neutral model in predicting sequences across different time points. Datasets with relatively high sequence diversity can contain more evolutionary information, which can improve prediction quality. In any case, as previously indicated, we believe that efforts are required in the field of substitution models of molecular evolution. Apart from that, forecasting the folding stability of future real proteins is an important advance in forecasting protein evolution, given the essential role of folding stability in protein function (Scheiblhofer, et al. 2017; Bloom and Neher 2023) and its variety of applications. We have now expanded these aspects in the manuscript.

      Only one dataset (HIV-1 MA) is evaluated for sequence-level divergence using KL divergence; this analysis is absent for the other proteins. The authors use a consensus Omicron sequence as a representative endpoint for SARS-CoV-2, which overlooks the rich longitudinal sequence data available from GISAID. The use of just one consensus from a single time point is not fully justified, given the extensive temporal and geographical sampling available. Extending the analysis to include multiple timepoints, particularly for SARS-CoV-2, would strengthen the predictive claims. Similarly, applying the model to other well-sampled viral proteins, such as those from influenza or RSV, would broaden its relevance and test its generalizability. 

      The evaluation of forecasting evolution using real datasets is complex due to several conceptual and practical aspects. In contrast to traditional phylogenetic reconstruction of past evolutionary events and ancestral sequences, forecasting evolution often begins with a variant that is evolved forward in time and requires a rough fitness landscape to select among possible future variants (Lässig, et al. 2017). Another concern for validating the method is the need to know the initial variant that gives rise to the corresponding future (forecasted) variants, and it is not always known. Thus, we investigated systems where the initial variant, or a close approximation, is known, such as scenarios of in vitro monitored evolution. In the case of SARS-CoV-2, the Wuhan variant is commonly used as the starting variant of the pandemic. Next, since forecasting evolution is highly dependent on the used model of evolution, unexpected external factors can be dramatic for the predictions. For this reason, systems with minimal external influences provide a more controlled context for evaluating forecasting evolution. For instance, scenarios of in vitro monitored virus evolution avoid some external factors such as host immune responses. Another important aspect is the availability of data at two (i.e., present and future) or more time points along the evolutionary trajectory, with sufficient genetic diversity between them to identify clear evolutionary signatures. Additionally, using consensus sequences can help mitigate effects from unfixed mutations, which should not be modeled by a substitution model of evolution. Altogether, not all datasets are appropriate to properly evaluate or apply forecasting evolution. These aspects are indicated in the manuscript. Sequence comparisons based on the KL divergence require, at the studied time point, an observed distribution of amino acid frequencies among sites and an estimated distribution of amino acid frequencies among sites. In the study datasets, this is only the case for the HIV-1 MA dataset, which belongs to a previous study from one of us and collaborators where we obtained at least 20 independent sequences at each sampling point (Arenas, et al. 2016). This aspect is now more clearly indicated in the manuscript. Regarding the Omicron datasets, we used 384 curated sequences of the Omicron variant of concern to construct the study data and we believe that it is a representative sample. The sequence used for the initial time point was the Wuhan variant (Wu, et al. 2020), which is commonly assumed to be the origin of the pandemic in SARS-CoV-2 studies. As previously indicated, the use of consensus sequences is convenient to avoid variants with unfixed mutations. Regarding extending the analysis to other time points (other variants of concern), we kindly disagree because Omicron is the variant of concern with the highest genetic distance to the Wuhan variant, and a high genetic distance is required to properly evaluate the prediction method. Actually, we noted that earlier variants of concern show a small number of fixed mutations in the study proteins, despite the availability of large numbers of sequences in databases such as GISAID. Additionally, we investigated the evolutionary trajectories of HIV-1 protease (PR) in 12 intra-host viral populations with predictions for up to four different time points. Apart from those aspects, following the proposal of the reviewer, we have now incorporated the analysis of an additional dataset of influenza NS1 protein (Bao, et al. 2008), with predictions for two different time points, to further assess the generalizability of the method. We have now included details of this influenza NS1 protein dataset and the predictions derived from it in the manuscript.

      It would also be informative to include a retrospective analysis of the evolution of protein stability along known historical trajectories. This would allow the authors to assess whether folding stability is indeed preserved in real-world evolution, as assumed in their model.

      Our present study does not aim to investigate the evolution of the folding stability over time, although it provides this information indirectly at the studied time points. Instead, the present study shows that the folding stability of the forecasted protein variants is similar to the folding stability of the corresponding real protein variants for diverse viral proteins, which provides an important evaluation of the prediction method. Next, the folding stability can indeed vary over time in both real and modeled evolutionary scenarios, and our present study is not in conflict with this. In that regard, which is not the aim of our present study, some previous phylogenetic-based studies have reported temporal fluctuations in folding stability for diverse protein data (Arenas, et al. 2017; Olabode, et al. 2017; Arenas and Bastolla 2020; Ferreiro, et al. 2022).

      Finally, a discussion on the impact of structural templates - and whether the fixed template remains valid across divergent sequences - would be valuable. Addressing the possibility of structural remodeling or template switching during evolution would improve confidence in the model's applicability to more divergent evolutionary scenarios.

      This is an important point. For the datasets that required homology modeling (in several cases it was not necessary because the sequence was present in a protein structure of the PDB), the structural templates were selected using SWISS-MODEL, and we applied the best-fitting template. We have now included in a supplementary table details about the fitting of the structural templates. Indeed, our proposal assumes that the protein structure is maintained over the studied evolutionary time, which can be generally reasonable for short timescales where the structure is conserved (Illergard, et al. 2009; Pascual-Garcia, et al. 2010). Over longer evolutionary timescales, structural changes may occur and, in such cases, modeling the evolution of the protein structure would be necessary. To our knowledge, modeling the evolution of the protein structure remains a challenging task that requires substantial methodological developments. Recent advances in artificial intelligence, particularly in protein structure prediction from sequence, may offer promising tools for addressing this challenge. However, we believe that evaluating such approaches in the context of structural evolution would be difficult, especially given the limited availability of real data with known evolutionary trajectories involving structural change. In any case, this is probably an important direction for future research. We have now included this discussion in the manuscript.

      Reviewer #1 (Recommendations for the authors): 

      (1) Abstract: "expectedly, the errors grew up in the prediction of the corresponding sequences" <- Not entirely clear what is meant by "errors grew up" or what the errors grew with.

      This sentence refers to the accuracy of sequence prediction in comparison to that of folding stability prediction. We have now clarified this aspect in the manuscript.

      (2) Lines 162-165: "Alternatively, if the fitness is determined based on the similarity in folding stability between the modeled variant and a real variant, the birth rate is assumed to be 1 minus the root mean square deviation (RMSD) in folding stability." <- What is the biological motivation for using the RMSD? It seems like a more stable variant would always have higher fitness, at least according to Equation 1.

      RMSD is commonly used in molecular biology to compare proteins in terms of structural distance, folding stability, kinetics, and other properties. It offers advantages such as minimizing the influence of small deviations while amplifying larger differences, thereby enhancing the detection of remarkable molecular changes. Additionally, RMSD would facilitate the incorporation of other biophysical parameters, such as structural divergences from a wild-type variant or entropy, which could be informative for fitness in future versions of the method. We have now included this consideration in the manuscript.

      (3) Lines 165-166: "In both cases, the death rate (d) is considered as 1-b to allow a constant global (birth-death) rate" <- This would give a constant R = b / (1-b) over the entire phylogenetic tree. For applications to pathogens like viruses with epidemic dynamics, this is extremely implausible. Is there any need to make such a restrictive assumption? 

      Regarding technical considerations of the model, we refer to our answer to the first public review comment. Next, a constant global rate of evolution was observed in numerous genes and proteins of diverse organisms, including viruses (Gojobori, et al.1990; Leitner and Albert 1999; Shankarappa, et al. 1999; Liu, et al. 2004; Lu, et al. 2018; Zhou, et al. 2019). However, following the comment of the reviewer, and as we indicated in our answer to the first public review comment, we have now implemented and evaluated an additional birth-death model that allows for variation in the global birth-death rate among lineages. We have implemented this additional model in the framework and described it along with its results in the manuscript.

      (4) Lines 187-188: "As a consequence, since b+d=1 at each node, tn is consistent across all nodes, according to (Harmon, 2019)." <- This would also imply that all lineages have a growth rate r = b - d, which under a birth-death model is equivalent to saying all lineages have the same fitness! 

      We clarified this aspect in our answer to the first public review comment. In particular, in the model presented, protein variants with higher fitness have higher birth rates, leading to more birth events, while protein variants with lower fitness have lower birth rates leading to more extinction events, which presents biological meaning for the study system. In our model b and d can vary among lineages according to the corresponding fitness (i.e., a lineage may have b=0.9, d=0.1, r=0.8; while another one may have b=0.6, d=0.4, r=0.2). Since the reproductive success varies among lineages in our model, the statement “this is essentially assuming all lineages have the same absolute fitness” is incorrect, although it could be interpreted like that in certain models. Fitness affects reproductive success, but fitness and growth rate of evolution are different biological processes (despite a faster growth rate can sometimes be associated with higher fitness, a variant with a high fitness not necessarily has to accumulate substitutions at a higher rate). An example in molecular adaptation studies is the traditional nonsynonymous to synonymous substitution rates ratio (dN/dS), where dN/dS (that informs about selection derived from fitness) can be constant at different rates of evolution (dN and dS). In any case, we thank the reviewer for raising this point, which led us to incorporate an additional birth-death model and inspired some ideas.  Thus, following the comment of the reviewer and as indicated in the answer to the first public review comment, we have now implemented and evaluated an additional birthdeath model that allows for variation in the global birth-death rate among lineages. The results indicated that this model yields similar predictive accuracy compared to the previous birth-death model. We have now included these aspects, along with the results from the additional model, in the manuscript.

      (5) Line 321-322: "For the case of neutral evolution, all protein variants equally fit and are allowed, leading to only birth events," <- Why would there only be birth events? Lineages can die regardless of their fitness. 

      AU: In the neutral evolution model, all protein variants have the same fitness, resulting in a flat fitness landscape. Since variants are observed, we allowed birth events. However, it assumed the absence of death events as no information independent of fitness is available to support their inclusion and quantification, thereby avoiding the imposition of arbitrary death events based on an arbitrary death rate. We have now provided a justification of this assumption in the manuscript.

      Reviewer #2 (Recommendations for the authors): 

      (1) Clarify the purpose of the alternative fitness mode ("ΔG similarity to a target variant"): 

      The manuscript briefly introduces an alternative fitness function based on the similarity of a simulated protein's folding stability to that of a real protein variant, but does not provide a clear motivation, usage scenario, or results derived from it. 

      The presented model provides two approaches for deriving fitness from predicted folding stability. The simpler approach assumes that a more stable protein variant has higher fitness than a less stable one. The alternative approach assigns high fitness to protein variants whose stability closely matches observed stability, acknowledging that the real observed stability is derived from the real selection process, and this approach considers negative design by contrasting the prediction with real information. For the analyses of real data in this study, we used the second approach, guided by these considerations. We have now clarified this aspect in the manuscript.

      (2) Report structural template quality and modeling confidence: 

      Since folding stability (ΔG) estimates rely on structural models derived from homology templates, the accuracy of these predictions will be sensitive to the choice and quality of the template structure. I recommend that the authors report, for each protein modeled, the template's sequence identity, coverage, and modeling quality scores. This will help readers assess the confidence in the ΔG estimates and interpret how template quality might impact simulation outcomes. 

      We agree with the reviewer and we have now included additional information in a supplementary table regarding sequence identity, modeling quality and coverage of the structural templates for the proteins that required homology modeling. The selection of templates was performed using the well-established framework SWISS-MODEL and the best-fitting template was chosen. Next, a large number of protein structures are available in the PDB for the study proteins, which favors the accuracy of the homology modeling. For some datasets, homology modeling was not required, as the modeled sequence was already present in an available protein structure. We have now included this information in the manuscript and in a supplementary table.

      (3) Clarify whether structural remodeling occurs during simulation: 

      It appears that folding stability (ΔG) for all simulated protein variants is computed by mapping them onto a single initial homology model, without remodeling the structure as sequences evolve. If correct, this should be clearly stated, as it assumes that the structural fold remains valid across all simulated variants. A discussion on the potential impact of structural drift would be welcome.

      We agree with the reviewer. As indicated in our answer to a previous comment, our method assumes that the protein structure is maintained over the studied evolutionary time, which is generally acceptable for short timescales where the structure is conserved (Illergard, et al. 2009; Pascual-Garcia, et al. 2010). At longer timescales the protein structure could change, requiring the modeling of structural evolution over the evolutionary time. To our knowledge, modeling the evolution of the protein structure remains a challenging task that requires substantial methodological developments. Recent advances in artificial intelligence, particularly in protein structure prediction from sequence, can be promising tools for addressing this challenge. However, we believe that evaluating such approaches in the context of structural evolution would be difficult, especially given the limited availability of real datasets with known evolutionary trajectories involving structural change. In any case, this is probably an important direction for future research. We have now included this discussion in the manuscript.

    1. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #1

      Evidence, reproducibility and clarity

      In this study, the authors develop a complete integral drive system in Anopheles gambiae malaria mosquitoes. This type of gene drive is interesting, with special advantages and disadvantages compared to more common designs. Here, the authors develop the Cas9 element and combine it with a previously developed antimalaria effector element. The new element performs very well in terms of drive efficiency, but it has unintended fitness costs, and a higher than desirable rate of functional resistance allele formation. Nevertheless, this study represents a very good step forward toward developing effective gene drives and is thus of high impact.

      The format of the manuscript is a bit suboptimal for review. Please add line numbers next time for easy reference. It would also help to have spaces between paragraphs and to have figures (with legends) added to the text where they first appear.

      It might be useful to add subsections to the results, just like in the methods section. It could even be expanded a bit with some specific parts from the discussion, through this is optional.

      Abstract: The text says: "As a minimal genetic modification, nanosd does not induce widespread transcriptomic perturbations." However, it does seem to change things based on Figure 3c.

      Page 2: "drive technologies for public health and pest control applications" needs a period afterward.

      Page 2: "The fitness costs, homing efficiency, and resistance rate of the gene drive is" should be "The fitness costs, homing efficiency, and resistance rate of the gene drive are".

      Page 2: "When they target important mosquito genes, gene drives are designed to ensure that the nuclease activity window (germline) does not overlap with that of the target gene (somatic)." is note quite correct. This is, of course, sensible for suppression drives, but it's not a necessary requirement for modification drives with rescue elements in many situations.

      Page 2: "recessive somatic fitness cost phenotypes" is unclear. I think that you are trying to avoid the recessive fitness cost of null alleles becoming a dominant fitness cost from a gene drive allele (in drive-wild-type heterozygotes).

      Page 2: "This optimization approach has had only limited success, and suboptimal performance is commonly attributed to not capturing all the regulatory elements specific to the germline gene's expression9,12". I don't think this is correct. There are several examples where a new promoter helped a lot. The zpg promoter in Anopheles gambiae allowed success at the dsx site in suppression cage studies (Kyrou et al 2018), and nanos gave big improvement to modification drives at the cardinal locus (Carballer et al 2020). In flies, several promoters were tested, and one allowed success in cage experiments (Du et al 2024). In Aedes, the shu promoter allowed for high drive performance (Anderson et al 2023), though this last one hasn't been tested in more difficult situations. I think you could certainly argue in the general case that not all promoters will work the way their transcriptome says, but there are many examples where they seem to be pretty good.

      Page 2: "make it more likely that mutations that disrupt the drive components are selected against though loss of function of the host gene." I think that this needs a bit more explanation. You are referring to mutations in regulatory elements or frameshift mutations. This will make it more resistant to mutation. Also, these mutations would tend to have a minor effect expect perhaps in the cargo gene of a modification drive. By using a cargo gene in an integral drive, you could still keep it somewhat safer, but whether this is 1.2x or 10x safer is unclear.

      Page 3: "they can incur severe unintended fitness costs". This is central to integral drives and this manuscript. It's worth elaborating on.

      Page 3: "Regulatory elements from germline genes that have worked sub-optimally in traditional gene drive designs for the reasons outlined above may work well in an IDG design20." This is setting up the integral drive with nanos, but nanos DOES work well in traditional Anopheles gambiae gene drive designs. It is possible that you might end up with less somatic expression than Hammond et al 2020 (though the comparison is unclear due to batch effects in that study), but there is no direct comparison in this manuscript to that.

      Page 3: "This suggests an impact of maternal deposition on drive efficiency only in female drive carriers." This is quite strange. Usually, I would expect to see an equal reduction in efficiency between male and female progeny. Could this be due to limited sample size? Random idea: It's also possible that almost all maternal deposition was mosaic and wouldn't be enough to direct affect drive conversion. However, it could cause enough of a fitness cost TOGETHER with new drive expression in females that perhaps only tissues with randomly low expression rates properly developed and led to progeny, reducing drive inheritance? Another possibility: Could the drive/resistance males have impaired fertility, so these ones are underrepresented in the batch cross? If nanos is needed in males and a single drive copy is not quite enough for good fertility or mating competitiveness, they may be underrepresented in your crosses. They might have worse fertility than drive homozygous males, which at least have two partially working copies of nanos rather than just one (in many cells, at least). Maybe check the testis for abnormal phenotypes?

      Overall, it would be favorable if the drive allele was somewhere more fit than a nonfunctional resistance allele. This could already be achieved in this drive, but it doesn't get much mention.

      Page 3: There should be a comma after, "Interestingly, while many of the observed mutations were predicted to abolish nanos expression" and "This could indicate that in these experiments".

      Page 3 last sentence: Please improve the clarity.

      Removing the EGFP is supposed to restore the fitness, and this was helpful in some previous integral drive constructs. This could get a bit more mention (it is possible that I missed this somewhere in the manuscript).

      Page 4: The MM-CP line and it's association with the integral drive strategy could get a little more introduction. Maybe even a supplemental figure showing the general idea.

      Page 5: "cassette is predicted to disrupt the CP function entirely (Fig. 5d)" also lacks a period.

      Page 5: "The subsequent stabilization of the nanosd frequency and the lack of rapid loss suggests that any associated fitness cost is primarily recessive." This is not quite correct because by this point, drive/wild-type heterozygotes are rare, and this is where you'd find a potential dominant fitness cost. It should be correct in the end stages where it is a mix of drive and functional/nonfunctional resistance alleles (though the nonfunctional resistance alleles may cause greater fitness costs when together with a drive - see above).

      Page 6: "Maternal deposition of Cas9, or Cas9;gRNA, into the zygote can lead to cutting at stages when homing is not favoured, and has been commonly observed for canonical Anopheles nanos drives9,10,35." Reference 35 (which is more suitable for referencing an example of nanos in other Anopheles) found some resistance alleles by deep sequencing, but the timing that they formed was unclear (it's not certain if it was maternal deposition). This study may be a more suitable reference: Carballar-Lejarazú R, Tushar T, Pham TB, James AA. Cas9-mediated maternal-effect and derived resistance alleles in a gene-drive strain of the African malaria vector mosquito, Anopheles gambiae. Genetics, 2022.

      Page 8: "could further reduce the likelihood of resistance allele formation by increasing the frequency of HDR events." Multiple gRNAs would mostly help by reducing functional resistance allele formation, especially since drive conversion is already very high in Anopheles.

      Page 8, last paragraph: This conclusion is perhaps a little optimistic considering the functional resistance alleles, which should get a little more attention in the summary or elsewhere in the discussion section.

      Figure 1d: The vertical text saying "Non-WT" is confusing. The circles themselves show + and -. Also, "-" isn't necessarily a knockout allele, so I'm not sure if - is the best symbol for resistance.

      Figure 2e: The vertical scale is not the most intuitive. Consider rearranging it to "Transition from larvae to pupae" starting at zero and going to 1 when all the larvae become pupae.

      Figure 2e-f: For both of these, there are clear differences between males and females. Thus, when comparing drive homozygotes to wild-type, it would probably be better to have separate statistical comparisons for males and females.

      Figure 3: Can any of these transcription results in individual genes potentially explain the observed fitness cost?

      Figure 3b: The scale here also doesn't quite make sense. It's the fraction of underdeveloped ovaries, but the graph is also perhaps trying to show whether just 1-2 ovaries are present, or maybe how many ovaries are undeveloped, but then it would say "zero"? This should be clarified. Number of ovaries and how well-developed they are is separate (it can be put on the same graph, but needs to be more clear).

      Figure 4f: The vertical axis should say "ONNV."

      Figure 5c-d: These should be labeled as the most common resistance allele. Also, I'm not sure how relevant it is, but we also found an alternate start codon here: Hou S, Chen J, Feng R, Xu X, Liang N, Champer J. A homing rescue gene drive with multiplexed gRNAs reaches high frequency in cage populations but generates functional resistance. J Genet Genomics, 2024. Maybe this is a more common problem than one would expect?

      Figure 5cd,S4,S5: They have a bit of a weird plot. Why not make four line graphs for each? Also, some alleles use the  symbol. + is wild-type, which is well understood, but - as resistance is not always clear, and seeing them together may confuse readers. Additionally, the fact that you have the most common resistance allele in Figure 5cd might mean that you know more about the genotype? If so, it would be best to separate wild-type and resistance alleles in whatever the final figure looks like.

      Some supplemental raw data files would be useful if they were available, but the figures are through enough that this isn't essential.

      Review by:

      Jackson Champer, with major assistance from Ruobing Feng (essentially section B) and Jie Du

      Referee cross-commenting

      We don't have any cross-comments, other than supporting the idea of slightly more comparisons to the authors' previous construct.

      Significance

      • Describe the nature and significance of the advance (e.g. conceptual, technical, clinical) for the field.

      A key innovation of the nanosd gene drive is its integral gene drive (IGD) design, which inserts the drive cassette directly into the A. gambiae nanos gene, incorporating only the minimal components necessary for drive function. The drive achieves high transmission rates, without causing widespread disruption of gene expression or increasing susceptibility to malaria parasites, and imposes an acceptable fitness cost-primarily a reduction in female fecundity when homozygous. The strong performance of nanosd can be attributed to its design: Cas9 is expressed in the correct cells and timing to induce efficient homing, effectively hijacking the nanos gene's natural expression profile. However, despite the careful design aimed at preserving nanos function, the rescue was incomplete: homozygous female drive carriers exhibited a clear reduction in ovarian function.

      In caged population trials, both the drive and a co-introduced anti-malaria effector gene reached high frequencies, even in the presence of emerging resistance alleles. Because the drive is inserted into an essential gene, nonfunctional resistance alleles are selected against and tend to be purged over time. Nonetheless, functional resistance remains a concern. The use of a single, though precisely positioned gRNA targeting the native nanos gene ATG site increases the likelihood of generating functional resistance alleles. Over the long term, if the drive imposes fitness costs, it may be outcompeted by such functional resistance alleles, potentially undermining the goal of sustained population modification.

      Overall, this study represent a notable advance in Anopheles mosquito gene drive development and can be considered as high impact. - Place the work in the context of the existing literature (provide references, where appropriate).

      Previous IGD efforts in Drosophila, mice and mosquitoes have demonstrated nearly super‐Mendelian inheritance but often at the expense of host fitness. For example, Nash et al. built an intronic‐gRNA Cas9 drive at the D. melanogaster rcd-1r locus that propagated efficiently through cage populations (Nash et al., 2022), and Gonzalez et al. reported that a Cas9 drive inserted at the germline zpg locus in Anopheles stephensi biased inheritance by ~99.8% (Gonzalez et al., 2025). However, these strong drives disrupted essential genes: in A. gambiae, inserting Cas9 into zpg produced efficient homing but rendered females largely sterile (Ellis et al., 2022). A similar germline Cas9 knock-in in Mus musculus enabled gene conversion in both sexes, albeit with only modest efficiency and potential fitness trade-offs (Weitzel et al., 2021). The current nanosd IGD is explicitly designed to overcome this limitation by selecting a more permissive gene target and using a minimal drive cassette, so as to preserve mosquito viability while maintaining robust drive efficiency, although still with reduced female drive homozygotes fertility.

      This nanosd gene drive like previous homing drives in Anopheles, is capable of achieving a high level of inheritance bias. Although it uses the endogenous nanos regulatory elements, which have less leaky somatic expression compared to using vasa (Gantz et al., 2015; Hammond et al., 2016; Hammond et al., 2017) or zpg promoters(Hammond et al., 2021; Kyrou et al., 2018), to drive Cas9 expression and thereby reduces somatic expression-induced female sterility, the incomplete rescue of nanos function still leads to reduced female fertility in drive homozygotes. - State what audience might be interested in and influenced by the reported findings.

      It's worth noting the broad audience that will find this work relevant. Gene drive developers and molecular geneticists will be impressed by the good drive performance and directly influenced by the design principles showcased here. The study's integral gene drive architecture that leverages the endogenous nanos regulatory elements, in-frame E2A peptide linkage for co-expression, and intronic insertion of gRNA and selectable markers addresses long-standing challenges in promoter leakage, somatic fitness costs, and resistance allele evolution. What's more, vector biologists and malaria researchers will be interested in the successful deployment of a gene drive in A. gambiae that actually carries a disease-blocking trait. - Define your field of expertise with a few keywords to help the authors contextualize your point of view. Indicate if there are any parts of the paper that you do not have sufficient expertise to evaluate.

      We have worked on CRISPR gene drive development in both fruit flies and Anopheles mosquitoes and have experience with modeling their spread.

      References

      Ellis, D.A., Avraam, G., Hoermann, A., Wyer, C.A.S., Ong, Y.X., Christophides, G.K., and Windbichler, N. (2022). Testing non-autonomous antimalarial gene drive effectors using self-eliminating drivers in the African mosquito vector Anopheles gambiae. PLOS Genetics 18, e1010244-e1010244.

      Gantz, V.M., Jasinskiene, N., Tatarenkova, O., Fazekas, A., Macias, V.M., Bier, E., and James, A.A. (2015). Highly efficient Cas9-mediated gene drive for population modification of the malaria vector mosquito Anopheles stephensi. Proc Natl Acad Sci U S A 112, E6736-E6743.

      Gonzalez, E., Anderson, M.A.E., Ang, J.X.D., Nevard, K., Shackleford, L., Larrosa-Godall, M., Leftwich, P.T., and Alphey, L. (2025). Optimization of SgRNA expression with RNA pol III regulatory elements in Anopheles stephensi. Scientific Reports 15, 13408.

      Hammond, A., Galizi, R., Kyrou, K., Simoni, A., Siniscalchi, C., Katsanos, D., Gribble, M., Baker, D., Marois, E., Russell, S., et al. (2016). A CRISPR-Cas9 gene drive system targeting female reproduction in the malaria mosquito vector Anopheles gambiae. Nat Biotechnol 34, 78-83.

      Hammond, A., Karlsson, X., Morianou, I., Kyrou, K., Beaghton, A., Gribble, M., Kranjc, N., Galizi, R., Burt, A., Crisanti, A., et al. (2021). Regulating the expression of gene drives is key to increasing their invasive potential and the mitigation of resistance. PLOS Genetics 17, e1009321-e1009321.

      Hammond, A.M., Kyrou, K., Bruttini, M., North, A., Galizi, R., Karlsson, X., Kranjc, N., Carpi, F.M., D'Aurizio, R., Crisanti, A., et al. (2017). The creation and selection of mutations resistant to a gene drive over multiple generations in the malaria mosquito. PLOS Genetics 13, e1007039-e1007039.

      Kyrou, K., Hammond, A.M., Galizi, R., Kranjc, N., Burt, A., Beaghton, A.K., Nolan, T., and Crisanti, A. (2018). A CRISPR-Cas9 gene drive targeting doublesex causes complete population suppression in caged Anopheles gambiae mosquitoes. Nature Biotechnology 36, 1062-1066.

      Nash, A., Capriotti, P., Hoermann, A., Papathanos, P.A., and Windbichler, N. (2022). Intronic gRNAs for the construction of minimal gene drive systems. Frontiers in Bioengineering and Biotechnology 0, 570-570. Weitzel, A.J., Grunwald, H.A., Ceri, W., Levina, R., Gantz, V.M., Hedrick, S.M., Bier, E., and Cooper, K.L. (2021). Meiotic Cas9 expression mediates gene conversion in the male and female mouse germline. Plos Biol 19, e3001478-e3001478.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      I applaud the authors' for providing a thorough response to my comments from the first round of review. The authors' have addressed the points I raised on the interpretation of the behavioral results as well as the validation of the model (fit to the data) by conducting new analyses, acknowledging the limitations where required and providing important counterpoints. As a result of this process, the manuscript has considerably improved. I have no further comments and recommend this manuscript for publication.

      We are pleased that our revisions have addressed all the concerns raised by Reviewer #1.

      Reviewer #2 (Public review):

      Summary:

      This manuscript proposes that the use of a latent cause model for assessment of memory-based tasks may provide improved early detection in Alzheimer's Disease as well as more differentiated mapping of behavior to underlying causes. To test the validity of this model, the authors use a previously described knock-in mouse model of AD and subject the mice to several behaviors to determine whether the latent cause model may provide informative predictions regarding changes in the observed behaviors. They include a well-established fear learning paradigm in which distinct memories are believed to compete for control of behavior. More specifically, it's been observed that animals undergoing fear learning and subsequent fear extinction develop two separate memories for the acquisition phase and the extinction phase, such that the extinction does not simply 'erase' the previously acquired memory. Many models of learning require the addition of a separate context or state to be added during the extinction phase and are typically modeled by assuming the existence of a new state at the time of extinction. The Niv research group, Gershman et al. 2017, have shown that the use of a latent cause model applied to this behavior can elegantly predict the formation of latent states based on a Bayesian approach, and that these latent states can facilitate the persistence of the acquisition and extinction memory independently. The authors of this manuscript leverage this approach to test whether deficits in production of the internal states, or the inference and learning of those states, may be disrupted in knock-in mice that show both a build-up of amyloid-beta plaques and a deterioration in memory as the mice age.

      Strengths:

      I think the authors' proposal to leverage the latent cause model and test whether it can lead to improved assessments in an animal model of AD is a promising approach for bridging the gap between clinical and basic research. The authors use a promising mouse model and apply this to a paradigm in which the behavior and neurobiology are relatively well understood - an ideal situation for assessing how a disease state may impact both the neurobiology and behavior. The latent cause model has the potential to better connect observed behavior to underlying causes and may pave a road for improved mapping of changes in behavior to neurobiological mechanisms in diseases such as AD.

      The authors also compare the latent cause model to the Rescorla-Wagner model and a latent state model allowing for better assessment of the latent cause model as a strong model for assessing reinstatement.

      Weaknesses:

      I have several substantial concerns which I've detailed below. These include important details on how the behavior was analyzed, how the model was used to assess the behavior, and the interpretations that have been made based on the model.

      (1) There is substantial data to suggest that during fear learning in mice separate memories develop for the acquisition and extinction phases, with the acquisition memory becoming more strongly retrieved during spontaneous recovery and reinstatement. The Gershman paper, cited by the authors, shows how the latent causal model can predict this shift in latent causes by allowing for the priors to decay over time, thereby increasing the posterior of the acquisition memory at the time of spontaneous recovery. In this manuscript, the authors suggest a similar mechanism of action for reinstatement, yet the model does not appear to return to the acquisition memory after reinstatement, at least based on the simulation and examples shown in figures 1 and 3. More specifically, in figure 1, the authors indicate that the posterior probability of the latent cause, zA (the putative acquisition memory), increases, partially leading to reinstatement. This does not appear to be the case as test 3 (day 36) appears to have similar posterior probabilities for zA as well as similar weights for the CS as compared to the last days of extinction. Rather, the model appears to mainly modify the weights in the most recent latent cause, zB - the putative the 'extinction state', during reinstatement. The authors suggest that previous experimental data have indicated that spontaneous recovery or reinstatement effects are due to an interaction of the acquisition and extinction memory. These studies have shown that conditioned responding at a later time point after extinction is likely due to a balance between the acquisition memory and the extinction memory, and that this balance can shift towards the acquisition memory naturally during spontaneous recovery, or through artificial activation of the acquisition memory or inhibition of the extinction memory (see Lacagnina et al. for example). Here the authors show that the same latent cause learned during extinction, zB, appears to dominate during the learning phase of reinstatement, with rapid learning to the context - the weight for the context goes up substantially on day 35 - in zB. This latent cause, zB, dominates at the reinstatement test, and due to the increased associative strength between the context and shock, there is a strong CR. For the simulation shown in figure 1, it's not clear why a latent cause model is necessary for this behavior. This leads to the next point.

      We would like to first clarify that our behavioral paradigm did not last for 36 days, as noted by the reviewer. Our reinstatement paradigm contained 7 phases and 36 trials in total: acquisition (3 trials), test 1 (1 trial), extinction 1 (19 trials), extinction 2 (10 trials), test 2 (1 trial), unsignaled shock (1 trial), test 3 (1 trial). The day is labeled under each phase in Figure 2A. 

      We have provided explanations on how the reinstatement is explained by the latent cause model in the first round of the review. Briefly, both acquisition and extinction latent causes contribute to the reinstatement (test 3). The former retains the acquisition fear memory, and the latter has the updated w<sub>context</sub> from unsignaled shock. Although the reviewer is correct that the zB in Figure 1D makes a great contribution during the reinstatement, we would like to argue that the elevated CR from test 2 (trial 34) to test 3 (trial 36) is the result of the interaction between zA and zB.

      We provided Author response image 1 using the same data in Figure 1D and 1E to further clarify this point. The posterior probability of zA increased after an unsignaled shock (trial 35), which may be attributed to the return of acquisition fear memory. The posterior probability of zA then decreased again after test 3 (trial 36) because there was no shock in this trial. Along with the weight change, the expected shock change substantially in these three trials, resulting in reinstatement. Note that the mapping of expected shock to CR in the latent cause model is controlled by parameter θ and λ. Once the expected shock exceeds the threshold θ, the CR will increase rapidly if λ is smaller.

      Lastly, accepting the idea that separate memories are responsible for acquisition and extinction in the memory modification paradigm, the latent cause model (LCM) is a rational candidate modeling this idea. Please see the following reply on why a simple model like the Rescorla-Wagner (RW) model is not sufficient to fully explain the behaviors observed in this study.

      Author response image 1.

      The sum posterior probability (A), the sum of associative weight of CS (B), and the sum of associative weight of context (C) of acquisition and extinction latent causes in Figure 1D and 1E.

      (2) The authors compared the latent cause model to the Rescorla-Wagner model. This is very commendable, particularly since the latent cause model builds upon the RW model, so it can serve as an ideal test for whether a more simplified model can adequately predict the behavior. The authors show that the RW model cannot successfully predict the increased CR during reinstatement (Appendix figure 1). Yet there are some issues with the way the authors have implemented this comparison:

      (2A) The RW model is a simplified version of the latent cause model and so should be treated as a nested model when testing, or at a minimum, the number of parameters should be taken into account when comparing the models using a method such as the Bayesian Information Criterion, BIC.

      We acknowledge that the number of parameters was not taken into consideration when we compared the models. We thank the reviewer for the suggestion to use the Bayesian Information Criterion (BIC). However, we did not use BIC in this study for the following reasons. We wanted a model that can explain fear conditioning, extinction and reinstatement, so our first priority is to fit the test phases. Models that simulate CRs well in non-test phases can yield lower BIC values even if they fail to capture reinstatement. When we calculate the BIC by using the half normal distribution (μ = 0, σ \= 0.3) as the likelihood for prediction error in each trial, the BIC of the 12-month-old control is -37.21 for the RW model (Appendix 1–figure 1C) and -11.60 for the LCM (Figure 3C). Based on this result, the RW model would be preferred, yet the LCM was penalized by the number of parameters, even though it fit better in trial 36. Because we did not think this aligned with our purpose to model reinstatement, we chose to rely on the practical criteria to determine whether the estimated parameter set is accepted or not for our purpose (see Materials and Methods). The number of accepted samples can thus roughly be seen as the model's ability to explain the data in this study. These exclusion criteria then created imbalances in accepted samples across models (Appendix 1–figure 2). In the RW model, only one or two samples met the criteria, preventing meaningful statistical comparisons of BIC within each group. Overall, though we agreed that BIC is one of the reasonable metrics in model comparison, we did not think it aligns with our purpose in this study.

      (2B) The RW model provides the associative strength between stimuli and does not necessarily require a linear relationship between V and the CR. This is the case in the original RW model as well as in the LCM. To allow for better comparison between the models, the authors should be modeling the CR in the same manner (using the same probit function) in both models. In fact, there are many instances in which a sigmoid has been applied to RW associative strengths to predict CRs. I would recommend modeling CRs in the RW as if there is just one latent cause. Or perhaps run the analysis for the LCM with just one latent cause - this would effectively reduce the LCM to RW and keep any other assumptions identical across the models.

      Regarding the suggestion to run the analysis using the LCM with one latent cause, we agree that this method is almost identical to the RW model, which is also mentioned in the original paper (Gershman et al., 2017). Importantly, it would also eliminate the RW model’s advantage of assigning distinct learning rates to different stimuli, highlighted in the next comment (2C).

      We thank the reviewer for suggesting applying the transformation of associative strength (V) to CR as in the LCM. We examined this possibility by heuristically selecting parameter values to test how such a transformation would influence the RW model (Author response image 2A). Specifically, we set α<sub>CS</sub> = 0.5, α<sub>context</sub> \= 1, β = 1, and introduced the additional parameters θ and λ, as in the LCM. This parameter set is determined heuristically to address the reviewer’s concern about a higher learning rate of context. The dark blue line is the plain associative strength. The remaining lines are CR curves under different combinations of θ and λ.

      Consistent with the reviewer’s comment, under certain parameter settings (θ \= 0.01, λ = 0.01), the extended RW model can reproduce higher CRs at test 3, thereby approximating the discrimination index observed in the 12-month-old control group. However, this modification changes the characteristics of CRs in other phases from those in the plain RW model. In the acquisition phase, the CRs rise more sharply. In the extinction phase, the CRs remain high when θ is small. Though changing λ can modulate the steepness, the CR curve is flat on the second day of the extinction phase, which does not reproduce the pattern in observed data (Figure 2B). These trade-offs suggest that the RW model with the sigmoid transformation does not improve fit quality and, in fact, sacrifices features that were well captured by simpler RW simulations (Appendix 1–figure 1A to 1D). To further evaluate this extended RW model (RW*), we applied the same parameter estimation method used in the LCM for individual data (see Materials and Methods). For each animal, α<sub>CS</sub>, α<sub>context</sub>, β, θ, and λ were estimated with their lower and upper bounds set as previously described (see Appendix 1, Materials and Methods). The results showed that the number of accepted samples slightly increased compared to the RW model without sigmoidal transformation of CR (RW* vs. RW in Author response image 2B, 2C). However, this improvement did not surpass the LCM (RW* vs. LCM in Author response image 2B, Author response image 1C). Overall, these results suggest that while using the same method to map the expected shock to CR, the RW model does not outperform the LCM. Practically, further extension, such as adding novel terms, might improve the fitting level. We would like to note that such extensions should be carefully validated if they are reasonable and necessary for an internal model, which is beyond the scope of this study. We hope this addresses the reviewer's concerns about the implementation of the RW model. 

      Author response image 2.

      Simulation (A) and parameter estimation (B and C) in the extended Rescorla-Wagner model.

      (2C) In the paper, the model fits for the alphas in the RW model are the same across the groups. Were the alphas for the two models kept as free variables? This is an important question as it gets back to the first point raised. Because the modeling of the reinstatement behavior with the LCM appears to be mainly driven by latent cause zB, the extinction memory, it may be possible to replicate the pattern of results without requiring a latent cause model. For example, the 12-month-old App NL-G-F mice behavior may have a deficit in learning about the context. Within the RW model, if the alpha for context is set to zero for those mice, but kept higher for the other groups, say alpha_context = 0.8, the authors could potentially observe the same pattern of discrimination indices in figure 2G and 2H at test. Because the authors don't explicitly state which parameters might be driving the change in the DI, the authors should show in some way that their results cannot simply be due to poor contextual learning in the 12 month old App NL-G-F mice, as this can presumably be predicted by the RW model. The authors' model fits using RW don't show this, but this is because they don't consider this possibility that the alpha for context might be disrupted in the 12-month-old App NL-G-F mice. Of course, using the RW model with these alphas won't lead to as nice of fits of the behavior across acquisition, extinction, and reinstatement as the authors' LCM, the number of parameters are substantially reduced in the RW model. Yet the important pattern of the DI would be replicated with the RW model (if I'm not mistaken), which is the important test for assessment of reinstatement.

      We would like to clarify that we estimated three parameters in the RW model for individuals:  α<sub>CS</sub>,  α<sub>context</sub>, and β. Even if we did so, many samples did not satisfy our criteria (Appendix 1–figure 2). Please refer to the “Evaluation of model fit” in Appendix 1 and the legend of Appendix 1–figure 1A to 1D, where we have written the estimated parameter values.

      We did not agree that paralyzing the contextual learning by setting  α<sub>context</sub>  as 0 in the RW model can explain the CR curve of 12-month-old AD mice well. Specifically, the RW model cannot capture the between-day extinction dynamics (i.e., the increase in CR at the beginning of day 2 extinction)  and the higher CR at test 3 relative to test 2 (i.e., DI between test 3 and test 2 is greater than 0.5). In addition, because the context input (= 0.2) was relatively lower than the CS input (= 1), and there is only a single unsignaled shock trial, even setting  α<sub>context</sub> = 1 results in only a limited increase in CR (Appendix 1–figure 1A to 1D; see also Author response image 2 9). Thus, the RW model cannot replicate the reinstatement effect or the critical pattern of discrimination index, even under conditions of stronger contextual learning.  

      (3) As stated by the authors in the introduction, the advantage of the fear learning approach is that the memory is modified across the acquisition-extinction-reinstatement phases. Although perhaps not explicitly stated by the authors, the post-reinstatement test (test 3) is the crucial test for whether there is reactivation of a previously stored memory, with the general argument being that the reinvigorated response to the CS can't simply be explained by relearning the CS-US pairing, because re-exposure the US alone leads to increase response to the CS at test. Of course there are several explanations for why this may occur, particularly when also considering the context as a stimulus. This is what I understood to be the justification for the use of a model, such as the latent cause model, that may better capture and compare these possibilities within a single framework. As such, it is critical to look at the level of responding to both the context alone and to the CS. It appears that the authors only look at the percent freezing during the CS, and it is not clear whether this is due to the contextual-US learning during the US re-exposure or to increased responding to the CS - presumably caused by reactivation of the acquisition memory. The authors do perform a comparison between the preCS and CS period, but it is not clear whether this is taken into account in the LCM. For example, the instance of the model shown in figure 1 indicates that the 'extinction cause', or cause z6, develops a strong weight for the context during the reinstatement phase of presenting the shock alone. This state then leads to increased freezing during the final CS probe test as shown in the figure. If they haven't already, I think the authors must somehow incorporate these different phases (CS vs ITI) into their model, particularly since this type of memory retrieval that depends on assessing latent states is specifically why the authors justified using the latent causal model. In more precise terms, it's not clear whether the authors incorporate a preCS/ITI period each day the cue is presented as a vector of just the context in addition to the CS period in which the vector contains both the context and the CS. Based on the description, it seemed to me that they only model the CRs during the CS period on days when the CS is presented, and thereby the context is only ever modeled on its own (as just the context by itself in the vector) on extinction days when the CS is not presented. If they are modeling both timepoints each day that the CS I presented, then I would recommend explicitly stating this in the methods section.

      In this study, we did not model the preCS freezing rate, and we thank the reviewer for the suggestion to model preCS periods as separate context-only trials. In our view, however, this approach is not consistent with the assumptions of the LCM. Our rationale is that the available periods of context and the CS are different. We assume that observation of the context lasts from preCS to CS. If we simulate both preCS (context) and CS (context and tone), the weight of context would be updated twice. Instead, we follow the same method as described in the original code from Gershman et al. (2017) to consider the context effect. We agree that explicitly modeling preCS could provide additional insights, but we believe it would require modifying or extending the LCM. We consider this an important direction for future research, but it is outside the scope of this study.

      (4) The authors fit the model using all data points across acquisition and learning. As one of the other reviewers has highlighted, it appears that there is a high chance for overfitting the data with the LCM. Of course, this would result in much better fits than models with substantially fewer free parameters, such as the RW model. As mentioned above, the authors should use a method that takes into account the number of parameters, such as the BIC.

      Please refer to the reply to public review (2A) for the reason we did not take the suggestion to use BIC. In addition, we feel that we have adequately addressed the concern of overfitting in the first round of the review. 

      (5) The authors have stated that they do not think the Barnes maze task can be modeled with the LCM. Whether or not this is the case, if the authors do not model this data with the LCM, the Barnes maze data doesn't appear valuable to the main hypothesis. The authors suggest that more sophisticated models such as the LCM may be beneficial for early detection of diseases such as Alzheimer's, so the Barnes maze data is not valuable for providing evidence of this hypothesis. Rather, the authors make an argument that the memory deficits in the Barnes maze mimic the reinstatement effects providing support that memory is disrupted similarly in these mice. Although, the authors state that the deficits in memory retrieval are similar across the two tasks, the authors are not explicit as to the precise deficits in memory retrieval in the reinstatement task - it's a combination of overgeneralizing latent causes during acquisition, poor learning rate, over differentiation of the stimuli.

      We would like to clarify that we valued the latent cause model not solely because it is more sophisticated and fits more data points, but it is an internal model that implicates the cognitive process. Please also see the reply to the recommendations to authors (3) about the reason why we did not take the suggestion to remove this data.

      Reviewer #3 (Public review):

      Summary:

      This paper seeks to identify underlying mechanisms contributing to memory deficits observed in Alzheimer's disease (AD) mouse models. By understanding these mechanisms, they hope to uncover insights into subtle cognitive changes early in AD to inform interventions for early-stage decline.

      Strengths:

      The paper provides a comprehensive exploration of memory deficits in an AD mouse model, covering early and late stages of the disease. The experimental design was robust, confirming age-dependent increases in Aβ plaque accumulation in the AD model mice and using multiple behavior tasks that collectively highlighted difficulties in maintaining multiple competing memory cues, with deficits most pronounced in older mice.

      In the fear acquisition, extinction, and reinstatement task, AD model mice exhibited a significantly higher fear response after acquisition compared to controls, as well as a greater drop in fear response during reinstatement. These findings suggest that AD mice struggle to retain the fear memory associated with the conditioned stimulus, with the group differences being more pronounced in the older mice.

      In the reversal Barnes maze task, the AD model mice displayed a tendency to explore the maze perimeter rather than the two potential target holes, indicating a failure to integrate multiple memory cues into their strategy. This contrasted with the control mice, which used the more confirmatory strategy of focusing on the two target holes. Despite this, the AD mice were quicker to reach the target hole, suggesting that their impairments were specific to memory retrieval rather than basic task performance.

      The authors strengthened their findings by analyzing their data with a leading computational model, which describes how animals balance competing memories. They found that AD mice showed somewhat of a contradiction: a tendency to both treat trials as more alike than they are (lower α) and similar stimuli as more distinct than they are (lower σx) compared to controls.

      Weaknesses:

      While conceptually solid, the model struggles to fit the data and to support the key hypothesis about AD mice's inability to retain competing memories. These issues are evident in Figure 3:

      (1) The model misses trends in the data, including the gradual learning of fear in all groups during acquisition, the absence of a fear response at the start of the experiment, and the faster return of fear during reinstatement compared to the gradual learning of fear during acquisition. It also underestimates the increase in fear at the start of day 2 of extinction, particularly in controls.

      (2) The model explains the higher fear response in controls during reinstatement largely through a stronger association to the context formed during the unsignaled shock phase, rather than to any memory of the conditioned stimulus from acquisition (as seen in Figure 3C). In the experiment, however, this memory does seem to be important for explaining the higher fear response in controls during reinstatement (as seen in Author Response Figure 3). The model does show a necessary condition for memory retrieval, which is that controls rely more on the latent causes from acquisition. But this alone is not sufficient, since the associations within that cause may have been overwritten during extinction. The Rescorla-Wagner model illustrates this point: it too uses the latent cause from acquisition (as it only ever uses a single cause across phases) but does not retain the original stimulus-shock memory, updating and overwriting it continuously. Similarly, the latent cause model may reuse a cause from acquisition without preserving its original stimulus-shock association.

      These issues lead to potential overinterpretation of the model parameters. The differences in α and σx are being used to make claims about cognitive processes (e.g., overgeneralization vs. over differentiation), but the model itself does not appear to capture these processes accurately.

      The authors could benefit from a model that better matches the data and captures the retention and retrieval of fear memories across phases. While they explored alternatives, including the Rescorla-Wagner model and a latent state model, these showed no meaningful improvement in fit. This highlights a broader issue: these models are well-motivated but may not fully capture observed behavior.

      Conclusion:

      Overall, the data support the authors' hypothesis that AD model mice struggle to retain competing memories, with the effect becoming more pronounced with age. While I believe the right computational model could highlight these differences, the current models fall short in doing so.

      We thank the reviewer for the insightful comments. For the comments (1) and (2), please refer to our previous author response to comments #26 and #27. We recognize that the models tested in this study have limitations and, as noted, do not fully capture all aspects of the observed behavioral data. We see this as an important direction for future research and value the reviewer’s suggestions.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      I have maintained some of the main concerns included in the first round of reviews as I think they remain concerns with the new draft, even though the authors have included substantially more analysis of their data, which is appreciated. I particularly found the inclusion of the comparative modeling valuable, although I think the analysis comparing the models should be improved.

      (1) This relates to point 1 in the public assessment or #16 in the response to reviewers from the authors. The authors raise the point that even a low posterior can drive behavioral expression (lines 361-365 in the response to authors), and so the acquisition latent cause may partially drive reinstatement. Yet in the stimulation shown in figure 1D, this does not seem to be the case. As I mentioned in the public response, in figure 1, the posteriors for zA are similar on day 34 and day 36, yet only on day 36 is there a strong CR. At least in this example, it does not appear that zA contributes to the increased responding from day 34 (test 2) to day 36 (test 3). There may be a slight increase in z1 in figure 3C, but the dominant change from day 34 to day 36 appears to be the increase in the posterior of z3 and the substantial increase in w3. The authors then cite several papers which have shown the shift in balance between what it is the putative acquisition memory and extinction memory (i.e. Lacagnina et al.). Yet I do not see how this modeling fits with most of the previous findings. For example, in the Lacagnina et al. paper, activation of the acquisition ensemble or inhibition of the extinction ensemble drives freezing, whereas the opposite pattern reduces freezing. What appears to be the pattern in the modeling in this paper is primarily learning of context in the extinction latent cause to predict the shock. As I mention in point 2C of the public review, it's not clear why this pattern of results would require a latent cause model. Would a high alpha for context and not the CS not give a similar pattern of results in the RW model? At least for giving similar results of the DIs in figure 2?

      First, we would like to clarify that the x-axis in Figure 1D is labeled “Trial,” not “Day.” Please refer to the reply to public review (1), where we clarified the posterior probability of the latent cause from trials 34 to 36. Second, although we did not have direct neural circuit evidence in this study, we discussed the similarities between previous findings and the modeling in the first review. Briefly, our main point focuses on the interaction between acquisition and extinction memory. In other words, responses at different times arise from distinct internal states made up of competing memories. We assume that the reviewer expects a modeling result showing nearly full recovery of acquisition memory, which aligns with previous findings where optogenetic activation of the acquisition engram can partially mimic reinstatement (Zaki et al., 2022; see also the response to comment #12 in the first round of review). We acknowledge that such a modeling result cannot be achieved with the latent cause model and see it as a potential future direction for model improvement.

      Please also refer to the reply to public review (2) about how a high alpha for context in the RW model cannot explain the pattern we observed in the reinstatement paradigm.

      (2) This is related to point 3 in the public comments and #13 in the response to reviewers. I raised the question of comparing the preCS/ITI period with the CS period, but my main point was why not include these periods in the LCM itself as mentioned in more detail in point 3 in the current public review. The inclusion of the comparisons the authors performed helped, but my main point was that the authors could have a better measure of wcontext if they included the preCS period as a stimulus each day (when only the context is included in the stimulus). This would provide better estimates of wcontext. As stated in the public review, perhaps the authors did this, but my understanding of the methods this was not the case, rather, it seems the authors only included the CS period for CRs within the model (at least on days when the CS was present).

      Please refer to the reply to public review (3) about the reason why we did not model the preCS freezing rate.

      (3) This relates to point 4 in the public review and #15 and #24 in the response to authors. The authors have several points for why the two experiments are similar and how results may be extrapolated - lines 725-733. The first point is that associative learning is fundamental in spatial learning. I'm not sure that this broad connection between the two studies is particularly insightful for why one supports the other as associative learning is putatively involved in most behavioral tasks. In the second point about reversals, why not then use a reversal paradigm that would be easier to model with LCM? This data is certainly valuable and interesting, yet I don't think it's helpful for this paper to state qualitatively the similarities in the potential ways a latent cause framework might predict behavior on the Barnes maze. I would recommend that the authors either model the behavior with LCM, remove the experiment from the paper, or change the framing of the paper that LCM might be an ideal approach for early detection of dementia or Alzheimer's disease.

      We would like to clarify that our aim was not to present the LCM as an ideal tool for early detection of AD symptoms. Rather, our focus is on the broader idea of utilizing internal models and estimating individual internal states in early-stage AD. Regarding using a reversal paradigm that would be easier to model with LCM, the most straightforward approach is to use another type of paradigm for fear conditioning, then to examine the extent to which similar behavioral characteristics are observed between paradigms within subjects. However, re-exposing the same mice to such paradigms is constrained by strong carry-over effects, limiting the feasibility of this experiment. Other behavioral tasks relevant to AD that avoid shock generally involve action selection for subsequent observation (Webster et al., 2014), which falls outside the structure of LCM. Our rationale for including the Barnes maze task is that spatial memory deficit is implicated in the early stage of AD, making it relevant for translational research. While we acknowledge that exact modeling of Barnes maze behavior would require a more sophisticated model (as discussed in the first round of review), our intention to use the reversal Barnes maze paradigm is to suggest a presumable memory modification learning in a non-fear conditioning paradigm. We also discussed whether similar deficits in memory modification could be observed across two behavioral tasks.

      (4) Reviewer # mentioned that the change in pattern of behavior only shows up in the older mice questioning the clinical relevance of early detection. I do think this is a valid point and maybe should be addressed. There does seem to be a bit of a bump in the controls on day 23 that doesn't appear in the 6-month group. Perhaps this was initially a spontaneous recovery test indicated by the dotted vertical line? This vertical line does not appear to be defined in the figure 1 legend, nor in figures 2 and 3.

      We would like to emphasize that the App<sup>NL-G-F</sup> knock-in mouse is widely considered a model of early-stage AD, characterized by Aβ accumulation with little to no neurofibrillary tangle pathology or neuronal loss (see Introduction). By examining different ages, we can assess the contribution of both the amount and duration of Aβ accumulation as well as age-related factors. Modeling the deficit in the memory modification process in the older App<sup>NL-G-F</sup> knock-in mice, we suggested a diverged internal state in early-stage AD in older age, and this does not diminish the relevance of the model for studying early cognitive changes in AD.

      We would also like to clarify again that the x-axis in the figure is “Trial,” not “Day.” The vertical dashed lines in these figures indicate phase boundaries, and they were defined in the figure legend: in Figure 1C, “The vertical dashed lines separate the phases.”; in Figure 2B, “The dashed vertical line separates the extinction 1 and extinction 2 phases.”; in Figure 3, “The vertical dashed lines indicate the boundaries of phases.”

      (5) Are the examples in figure 3 good examples? The example for the 12-month-old control shows a substantial increase in weights for the context during test 3, but not for the CS. Yet in the bar plots in Figure 4 G and H, this pattern seems to be different. The weights for the context appear to substantially drop in the "after extinction" period as compared to the "extinction" period. It's hard to tell the change from "extinction" to "after extinction" for the CS weights (the authors change the y-axis for the CS weights but not for the context weights from panels G to H).

      We would like to clarify that in Figure 3C, the increase in weights for context is not presented during test 3 (trial 36), noted by the reviewer; rather, it is the unsignaled shock phase (trial 35).

      We assumed that the reviewer might misunderstand that the labels on the left in Figure 4, “Acquisition”, “Extinction”, and “After extinction”, indicate the time point. However, the data shown in Figure 4C to 4H are all from the same time point: test 3 (trial 36). The grouping reflects the classification of latent causes based on the trial in which they were inferred. In addition, for Figures 4G and 4H, the y‐axis limits were not set identically because the data range for “Sum of w<sub>CS</sub>” varied. This was done to ensure the visibility of all data points. In Figure 4, each dot represents one animal. Take Figure 3D as an example. The point in Figure 4G is the sum of w3 and w4 in trial 36, and the point in Figure 4H is w5 in trial 36, note that the subscript numerals indicate latent cause index. We hope this addresses the reviewer’s question about the difference between the two figures.


      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      The authors show certain memory deficits in a mouse knock-in model of Alzheimer's Disease (AD). They show that the observed memory deficits can be explained by a computational model, the latent cause model of associative memory. The memory tasks used include the fear memory task (CFC) and the 'reverse' Barnes maze. Research on AD is important given its known huge societal burden. Likewise, better characterization of the behavioral phenotypes of genetic mouse models of AD is also imperative to advance our understanding of the disease using these models. In this light, I applaud the authors' efforts.

      Strengths:

      (1) Combining computational modelling with animal behavior in genetic knock-in mouse lines is a promising approach, which will be beneficial to the field and potentially explain any discrepancies in results across studies as well as provide new predictions for future work.

      (2) The authors' usage of multiple tasks and multiple ages is also important to ensure generalization across memory tasks and 'modelling' of the progression of the disease.

      Weaknesses:

      [#1] (1) I have some concerns regarding the interpretation of the behavioral results. Since the computational model then rests on the authors' interpretation of the behavioral results, it, in turn, makes judging the model's explanatory power difficult as well. For the CFC data, why do knock-in mice have stronger memory in test 1 (Figure 2C)? Does this mean the knock-in mice have better memory at this time point? Is this explained by the latent cause model? Are there some compensatory changes in these mice leading to better memory? The authors use a discrimination index across tests to infer a deficit in re-instatement, but this indicates a relative deficit in re-instatement from memory strength in test 1. The interpretation of these differential DIs is not straightforward. This is evident when test 1 is compared with test 2, i.e., the time point after extinction, which also shows a significant difference across groups, Figure 2F, in the same direction as the re-instatement. A clarification of all these points will help strengthen the authors' case.

      We appreciate the reviewer for the critical comments. According to the latent cause framework, the strength of the memory is influenced by at least 2 parameters: associative weight between CS and US given a latent cause and posterior probability of the latent cause. The modeling results showed that a higher posterior probability of acquisition latent cause, but not higher associative weight, drove the higher test 1 CR in App<sup>NL-G-F</sup> mice (Results and Discussion; Figure 4 – figure supplement 3B, 3C). In terms of posterior, we agree that App<sup>NL-G-F</sup> mice have strong fear memory. On the other hand, this suggests that App<sup>NL-G-F</sup> mice exhibited a tendency toward overgeneralization, favoring modification of old memories, which adversely affected the ability to retain competing memories. The strong memory in test 1 would be a compensatory effect of overgeneralization.    

      To estimate the magnitude of reinstatement, at least, one would have to compare CRs between test 2 (extinction) and test 3 (reinstatement), as well as those between test 1 (acquisition) and test 3. These comparisons represent the extent to which the memory at the reinstatement is far from that in the extinction, and close to that in the acquisition. Since discrimination index (DI) has been widely used as a normalized measure to evaluate the extent to which the system can distinguish between two conditions, we applied DI consistently to behavioral and simulated data in the reinstatement experiment, and the behavioral data in the reversal Barnes maze experiment, allowing us to evaluate the discriminability of an agent in these experiments. In addition, we used DI to examine its correlation with estimated parameters, enabling us to explore how individual discriminability may relate to the internal state. We have already discussed the differences in DI between test 3 and test 1, as well as CR in test 1 between control and App<sup>NL-G-F</sup> in the manuscript and further elaborated on this point in Line 232, 745-748.   

      [#2] (2) I have some concerns regarding the interpretation of the Barnes maze data as well, where there already seems to be a deficit in the memory at probe test 1 (Figure 6C). Given that there is already a deficit in memory, would not a more parsimonious explanation of the data be that general memory function in this task is impacted in these mice, rather than the authors' preferred interpretation? How does this memory weakening fit with the CFC data showing stronger memories at test 1? While I applaud the authors for using multiple memory tasks, I am left wondering if the authors tried fitting the latent cause model to the Barnes maze data as well.

      While we agree that the deficits shown in probe test 1 may imply impaired memory function in App<sup>NL-G-F</sup> mice in this task, it would be difficult to explain this solely in terms of impairments in general memory function. The learning curve and the daily strategy changes suggested that App<sup>NL-G-F</sup> mice would have virtually intact learning ability in the initial training phase (Figure 6B, 6F, Figure 6 – figure supplement 1 and 3). For the correspondence relationship between the reinstatement and the reversal Barnes maze learning from the aspect of memory modification process, please also see our reply to comment #24. We have explained why we did not fit the latent cause model to the Barnes maze data in the provisional response.

      [#3] (3) Since the authors use the behavioral data for each animal to fit the model, it is important to validate that the fits for the control vs. experimental groups are similar to the model (i.e., no significant differences in residuals). If that is the case, one can compare the differences in model results across groups (Figures 4 and 5). Some further estimates of the performance of the model across groups would help.

      We have added the residual (i.e., observed CR minus simulated CR) in Figure 3 – figure supplement 1D and 1E. The fit was similar between control and App<sup>NL-G-F</sup> mice groups in the test trials, except test 3 in the 12-month-old group. The residual was significantly higher in the 12-month-old control mice than App<sup>NL-G-F</sup> mice, suggesting the model underestimated the reinstatement in the control, yet the DI calculated from the simulated CR replicates the behavioral data (Figure 3 – figure supplement 1A to 1C). These results suggest that the latent cause model fits our data with little systematic bias such as an overestimation of CR for the control group in the reinstatement, supporting the validity of the comparisons in estimated parameters between groups. These results and discussion have been added in the manuscript Line 269-276.

      One may notice that the latent cause model overestimated the CR in acquisition trials in all groups in Figure 3 – figure supplement 1D and 1E. We have discussed this point in the reply to comment #26, 34 questioned by reviewer 3.

      [#4] (4) Is there an alternative model the authors considered, which was outweighed in terms of prediction by this model? 

      Yes, we have further evaluated two alternative models: the Rescorla-Wagner (RW; Rescorla & Wagner, 1972) model and the latent state model (LSM; Cochran & Cisler, 2019). The RW model serves as a baseline, given its known limitations in explaining fear return after extinction. The LSM is another contemporary model that shares several concepts with the latent cause model (LCM) such as building upon the RW model, assuming a latent variable inferred by Bayes’ rule, and involving a ruminative update for memory modification. We evaluated the three models in terms of the prediction accuracy and reproducibility of key behavioral features. Please refer to the Appendix 1 for detailed methods and results for these two models.

      As expected, the RW model fit well to the data till the end of extinction but failed to reproduce reinstatement (Appendix 1 – figure 1A to 1D). Due to a large prediction error in test 3, few samples met the acceptance criteria we set (Appendix 1 – figure 2 and 3A). Conversely, the LSM reproduced reinstatement, as well as gradual learning in acquisition and extinction phases, particularly in the 12month-old control (Appendix 1 – figure 1G). The number of accepted samples in the LSM was higher than in the RW model but generally lower than in the LCM (Appendix 1 – figure 2). The sum of prediction errors over all trials in the LSM was comparable to that in the LCM in the 6-month-old group (Appendix 1 – figure 4A), it was significantly lower in the 12-month-old group (Appendix 1 – figure 4B). Especially the LSM generated smaller prediction errors during the acquisition trials than in the LCM, suggesting that the LSM might be better at explaining the behaviors of acquisition (Appendix 1 – figure 4A and 4B; but see the reply for comment #34). While the LSM generated smaller prediction errors than the LCM in test 2 of the control group, it failed to replicate the observed DIs, a critical behavioral phenotype difference between control and App<sup>NL-G-F</sup> mice (Appendix 1 – figure 6A to 6C; cf. Figure 2F to 2H, Figure 3 – figure supplement 1A to 1C).

      Thus, although each model could capture different aspects of reinstatement, standing on the LCM to explain the reinstatement better aligns with our purpose. It should also be noted that we did not explore all parameter spaces of the LSM, hence we cannot rule out the possibility that alternative parameter sets could provide a better fit and explain the memory modification process well. A more comprehensive parameter search in the LSM may be a valuable direction for future research. 

      [#5] One concern here is also parameter overfitting. Did the authors try leaving out some data (trials/mice) and predicting their responses based on the fit derived from the training data?

      Following the reviewer’s suggestion, we confirmed if overfitting occurred using all trials to estimate parameters. Estimating parameters while actually leaving out trials would disorder the time lapse across trials, and thereby the prior of latent causes in each trial. Instead, we removed the constraint of prediction error by setting the error threshold to 1 for certain trials to virtually leave these trials out. We treated these trials as a virtual “training” dataset, while the rest of the trials were a “test” dataset. For the median CR data of each group (Figure 3), we estimated parameters under 6 conditions with unique training and test trials, then evaluated the prediction error for the training and test trials. Note that training and test trials were arbitrarily decided. Also, the error threshold for the acquisition trial was set to 1 as described in Materials and Methods, which we have further discussed the reason in the reply to comment #34 and treated acquisition trials separately from the test trials. We expect that the contribution of the data from the acquisition and test trials for parameter estimation could be discounted compared to those from the training trials with the constraint, and if overfitting occurred, the prediction error in the test data would be worse than that in the training trials.

      Author response image 1A to 1F showed the simulated and observed CR under each condition, where acquisition trials were in light-shaded areas, test trials were in dark-shaded areas, and the rest of the trials were training trials. Author response image 1G showed mean squared prediction error across the acquisition, training and test trials under each condition. The dashed gray line showed the mean squared prediction error of training trials in Figure 3 as a baseline.

      In conditions i and ii, where two or four trials in the extinction were used for training (Author response image 1A and 1B), the prediction error was generally higher in test trials than in training trials. In conditions iii and iv where ten trials in the extinction were used for training (Author response image 1C and 1D), the difference in prediction error between testing and training trials became smaller. These results suggest that providing more extinction trial data would reduce overfitting. In condition v (Author response image 1E), the results showed that using trials until extinction can predict reinstatement in control mice but not App<sup>NL-G-F</sup> mice. Similarly, in condition vi (Author response image 1F), where test phase trials were left out, the prediction error differences were greater in App<sup>NL-G-F</sup> mice. These results suggest that the test trials should be used for the parameter estimation to minimize prediction error for all groups. Overall, this analysis suggests that using all trials would reduce prediction error with few overfitting. 

      Author response image 1.

      Leaving trials out in parameter estimation in the latent cause model. (A – F) The observed CR (colored line) is the median freezing rate during the CS presentation over the mice within each group, which is the same as that in Figure 3. The colors indicate different groups: orange represents 6-month-old control, light blue represents 6-month-old App<sup>NL-G-F</sup> mice, pink represents 12-month-old control, and dark blue represents 12-month-old App<sup>NL-G-F</sup> mice. Under six different leave-out conditions (i – vi), parameters were estimated and used for generating simulated CR (gray line). In each condition, trials were categorized as acquisition (light-shaded area), training data (white area), and test data (dark-shaded area) based on the error threshold during parameter estimation. Only the error threshold of the test data trial was different from the original method (see Material and Method) and set to 1. In conditions i to vi, the number of test data trials is 27, 25, 19, and 19 in extinction phases. In condition v, the number of test data trials is 2 (trials 35 and 36). In condition vi, test data trials were the 3 test phases (trials 4, 34, and 36). (G) Each subplot shows the mean squared prediction error for the test data trial (gray circles), training data trial (white squares), and acquisition trial (gray triangles) in each group. The left y-axis corresponds to data from test and training trials, and the right y-axis corresponds to data from acquisition trials. The dashed line indicates the results calculated from Figure 3 as a baseline.  

      Reviewer #1 (Recommendations for the authors):

      Minor:

      [#6] (1) I would like the authors to further clarify why 'explaining' the reinstatement deficit in the AD mouse model is important in working towards the understanding of AD i.e., which aspect of AD this could explain etc.

      In this study, we utilized the reinstatement paradigm with the latent cause model as an internal model to illustrate how estimating internal states can improve understanding of cognitive alteration associated with extensive Aβ accumulation in the brain. Our findings suggest that misclassification in the memory modification process, manifesting as overgeneralization and overdifferentiation, underlies the memory deficit in the App<sup>NL-G-F</sup> knock-in model mice. 

      The parameters in the internal model associated with AD pathology (e.g., α and σ<sub>x</sub><sup>2</sup> in this study) can be viewed as computational phenotypes, filling the explanatory gap between neurobiological abnormalities and cognitive dysfunction in AD. This would advance the understanding of cognitive symptoms in the early stages of AD beyond conventional behavioral endpoints alone.

      We further propose that altered internal states in App<sup>NL-G-F</sup> knock-in mice may underlie a wide range of memory-related symptoms in AD as we observed that App<sup>NL-G-F</sup> knock-in mice failed to retain competing memories in the reversal Barnes maze task. We speculate on how overgeneralization and overdifferentiation may explain some AD symptoms in the manuscript:

      - Line 565-569: overgeneralization may explain deficits in discriminating highly similar visual stimuli reported in early-stage AD patients as they misclassify the lure as previously learned object

      - Line 576-579: overdifferentiation may explain impaired ability to transfer previously learned association rules in early-stage AD patients as they misclassify them as separated knowledge. 

      - Line 579-582: overdifferentiation may explain delusions in AD patients as an extended latent cause model could simulate the emergence of delusional thinking

      We provide one more example here that overgeneralization may explain that early-stage AD patients are more susceptible to proactive interference than cognitively normal elders in semantic memory tests (Curiel Cid et al., 2024; Loewenstein et al., 2015, 2016; Valles-Salgado et al., 2024), as they are more likely to infer previously learned material. Lastly, we expect that explaining memory-related symptoms within a unified framework may facilitate future hypothesis generation and contribute to the development of strategies for detecting the earliest cognitive alteration in AD.  

      [#7] (2) The authors state in the abstract/introduction that such computational modelling could be most beneficial for the early detection of memory disorders. The deficits observed here are pronounced in the older animals. It will help to further clarify if these older animals model the early stages of the disease. Do the authors expect severe deficits in this mouse model at even later time points?

      The early stage of the disease is marked by abnormal biomarkers associated with Aβ accumulation and neuroinflammation, while cognitive symptoms are mild or absent. This stage can persist for several years during which the level of Aβ may reach a plateau. As the disease progresses, tau pathology and neurodegeneration emerge and drive the transition into the late stage and the onset of dementia. The App<sup>NL-G-F</sup> knock-in mice recapitulate the features present in the early stage (Saito et al., 2014), where extensive Aꞵ accumulation and neuroinflammation worsen along with ages (Figure 2 – figure supplement 1). Since App<sup>NL-G-F</sup> knock-in mice are central to Aβ pathology without tauopathy and neurodegeneration, it should be noted that it does not represent the full spectrum of the disease even at advanced ages. Therefore, older animals still model the early stages of the diseases and are suitable to study the long-term effect of Aβ accumulation and neuroinflammation. 

      The age tested in previous reports using App<sup>NL-G-F</sup> mice spanned a wide range from 2 months old to 24 months old. Different behavioral tasks have varied sensitivity but overall suggest the dysfunction worsens with aging (Bellio et al., 2024; Mehla et al., 2019; Sakakibara et al., 2018). We have tested the reinstatement experiment with 17-month-old App<sup>NL-G-F</sup> mice before (Author response image 2). They showed more advanced deficits with the same trends observed in 12-month-old App<sup>NL-G-F</sup> mice, but their freezing rates were overall at a lower level. There is a concern that possible hearing loss may affect the results and interpretation, therefore we decided to focus on 12-month-old data.

      Author response image 2.

      Freezing rate across reinstatement paradigm in the 17-month-old App<sup>NL-G-F</sup> mice. Dashed and solid lines indicate the median freezing rate over 34 mice before (preCS) and during (CS) tone presentation, respectively. Red, blue, and yellow backgrounds represent acquisition, extinction, and unsignaled shock in Figure 2A. The dashed vertical line separates the extinction 1 and extinction 2 phases.

      [#8] (3) There are quite a few 'marginal' p-values in the paper at p>0.05 but near it. Should we accept them all as statistically significant? The authors need to clarify if all the experimental groups are sufficiently powered.

      For our study, we decided a priori that p < 0.05 would be considered statistically significant, as described in the Materials and Methods. Therefore, in our Results, we did not consider these marginal values as statistically significant but reported the trend, as they may indicate substantive significance.

      We described our power analysis method in the manuscript Line 897-898 and have provided the results in Tables S21 and S22.

      [#9] (4) The authors emphasize here that such computational modelling enables us to study the underlying 'reasoning' of the patient (in the abstract and introduction), I do not see how this is the case. The model states that there is a latent i.e. another underlying variable that was not previously considered.

      Our use of the term “reasoning” was to distinguish the internal model, which describes how an agent makes sense of the world, from other generative models implemented for biomarker and disease progression prediction. However, we agree that using “reasoning” may be misleading and imprecise, so to reduce ambiguity we have removed this word in our manuscript Line 27: Nonetheless, internal models of the patient remain underexplored in AD; Line 85: However, previous approaches did not suppose an internal model of the world to predict future from current observation given prior knowledge.   

      [#10] (5) The authors combine knock-in mice with controls to compute correlations of parameters of the model with behavior of animals (e.g. Figure 4B and Figure 5B). They run the risk of spurious correlations due to differences across groups, which they have indeed shown to exist (Figure 4A and 5A). It would help to show within-group correlations between DI and parameter fit, at least for the control group (which has a large spread of data).

      We agree that genotype (control, App<sup>NL-G-F</sup>) could be a confounder between the estimated parameters and DI, thereby generating spurious correlations. To address this concern, we have provided withingroup correlation in Figure 4 – figure supplement 2 for the 12-month-old group and Figure 5 – figure supplement 2 for the 6-month-old group.

      In the 12-month-old group, the significant positive correlation between σx2 and DI remained in both control and App<sup>NL-G-F</sup> mice even if we adjusted the genotype effect, suggesting that it is very unlikely that the correlations in Figure 4B are due to the genotype-related confounding. On the other hand, the positive correlation between α and DI was found to be significant in the control mice but not in the App<sup>NL-G-F</sup> mice. Most of α were distributed around the lower bound in App<sup>NL-G-F</sup> mice, which possibly reduced the variance and correlation coefficient. These results support our original conclusion that α and σ<sub>x</sub><sup>2</sup> are parameters associated with a lower magnitude of reinstatement in aged App<sup>NL-G-F</sup> mice.

      In the 6-month-old group, the correlations shown in Figure 5B were not preserved within subgroups, suggesting genotype would be a confounder for α, σ<sub>x</sub><sup>2</sup>, and DI. We recognized that significant correlations in Figure 5B may arise from group differences, increased sample size, or greater variance after combining control and App<sup>NL-G-F</sup> mice. 

      Therefore, we concluded that α and σ<sub>x</sub><sup>2</sup> are associated with the magnitude of reinstatement but modulated by the genotype effect depending on the age. 

      We have added interpretations of within-group correlation in the manuscript Line 307-308, 375-378.

      [#11] (6) It is unclear to me why overgeneralization of internal states will lead to the animals having trouble recalling a memory. Would this not lead to overgeneralization of memory recall instead?

      We assume that the reviewer is referring to “overgeneralization of internal states,” a case in which the animal’s internal state remained the same regardless of the observation, thereby leading to “overgeneralization of memory recall.” We agree that this could be one possible situation and appears less problematic than the case in which this memory is no longer retrievable. 

      However, in our manuscript, we did not deal with the case of “overgeneralization of internal states”. Rather, our findings illustrated how the memory modification process falls into overgeneralization or overdifferentiation and how it adversely affects the retention of competing memories, thereby causing App<sup>NL-G-F</sup> mice to have trouble recalling the same memory as the control mice. 

      According to the latent cause model, retrieval failure is explained by a mismatch of internal states, namely when an agent perceives that the current cue does not match a previously experienced one, the old latent cause is less likely to be inferred due to its low likelihood (Gershman et al., 2017). For example, if a mouse exhibited higher CR in test 2, it would be interpreted as a successful fear memory retrieval due to overgeneralization of the fear memory. However, it reflects a failure of extinction memory retrieval due to the mismatch between the internal states at extinction and test 2. This is an example that overgeneralization of memory induces the failure of memory retrieval. 

      On the other hand, App<sup>NL-G-F</sup> mice exhibited higher CR in test 1, which is conventionally interpreted as a successful fear memory retrieval. When estimating their internal states, they would infer that their observation in test 1 well matches those under the acquisition latent causes, that is the overgeneralization of fear memory as shown by a higher posterior probability in acquisition latent causes in test 1 (Figure 4 – figure supplement 3). This is an example that over-generalization of memory does not always induce retrieval failure as we explained in the reply to comment #1. 

      Reviewer #2 (Public review):

      Summary:

      This manuscript proposes that the use of a latent cause model for the assessment of memory-based tasks may provide improved early detection of Alzheimer's Disease as well as more differentiated mapping of behavior to underlying causes. To test the validity of this model, the authors use a previously described knock-in mouse model of AD and subject the mice to several behaviors to determine whether the latent cause model may provide informative predictions regarding changes in the observed behaviors. They include a well-established fear learning paradigm in which distinct memories are believed to compete for control of behavior. More specifically, it's been observed that animals undergoing fear learning and subsequent fear extinction develop two separate memories for the acquisition phase and the extinction phase, such that the extinction does not simply 'erase' the previously acquired memory. Many models of learning require the addition of a separate context or state to be added during the extinction phase and are typically modeled by assuming the existence of a new state at the time of extinction. The Niv research group, Gershman et al. 2017, have shown that the use of a latent cause model applied to this behavior can elegantly predict the formation of latent states based on a Bayesian approach, and that these latent states can facilitate the persistence of the acquisition and extinction memory independently. The authors of this manuscript leverage this approach to test whether deficits in the production of the internal states, or the inference and learning of those states, may be disrupted in knock-in mice that show both a build-up of amyloid-beta plaques and a deterioration in memory as the mice age.

      Strengths:

      I think the authors' proposal to leverage the latent cause model and test whether it can lead to improved assessments in an animal model of AD is a promising approach for bridging the gap between clinical and basic research. The authors use a promising mouse model and apply this to a paradigm in which the behavior and neurobiology are relatively well understood - an ideal situation for assessing how a disease state may impact both the neurobiology and behavior. The latent cause model has the potential to better connect observed behavior to underlying causes and may pave a road for improved mapping of changes in behavior to neurobiological mechanisms in diseases such as AD.

      Weaknesses:

      I have several substantial concerns which I've detailed below. These include important details on how the behavior was analyzed, how the model was used to assess the behavior, and the interpretations that have been made based on the model.

      [#12] (1) There is substantial data to suggest that during fear learning in mice separate memories develop for the acquisition and extinction phases, with the acquisition memory becoming more strongly retrieved during spontaneous recovery and reinstatement. The Gershman paper, cited by the authors, shows how the latent causal model can predict this shift in latent states by allowing for the priors to decay over time, thereby increasing the posterior of the acquisition memory at the time of spontaneous recovery. In this manuscript, the authors suggest a similar mechanism of action for reinstatement, yet the model does not appear to return to the acquisition memory state after reinstatement, at least based on the examples shown in Figures 1 and 3. Rather, the model appears to mainly modify the weights in the most recent state, putatively the 'extinction state', during reinstatement. Of course, the authors must rely on how the model fits the data, but this seems problematic based on prior research indicating that reinstatement is most likely due to the reactivation of the acquisition memory. This may call into question whether the model is successfully modeling the underlying processes or states that lead to behavior and whether this is a valid approach for AD.

      We thank the reviewer for insightful comments. 

      We agree that, as demonstrated in Gershman et al. (2017), the latent cause model accounts for spontaneous recovery via the inference of new latent causes during extinction and the temporal compression property provided by the prior. Moreover, it was also demonstrated that even a relatively low posterior can drive behavioral expression if the weight in the acquisition latent cause is preserved. For example, when the interval between retrieval and extinction was long enough that acquisition latent cause was not dominant during extinction, spontaneous recovery was observed despite the posterior probability of acquisition latent cause (C1) remaining below 0.1 in Figure 11D of Gershman et al. (2017). 

      In our study, a high response in test 3 (reinstatement) is explained by both acquisition and extinction latent cause. The former preserves the associative weight of the initial fear memory, while the latter has w<sub>context</sub> learned in the unsignaled shock phase. These positive w were weighted by their posterior probability and together contributed to increased expected shock in test 3. Though the posterior probability of acquisition latent cause was lower than extinction latent cause in test 3 due to time passage, this would be a parallel instance mentioned above. To clarify their contributions to reinstatement, we have conducted additional simulations and the discussion in reply to the reviewer’s next comment (see the reply to comment #13).

      We recognize that our results might appear to deviate from the notion that reinstatement results from the strong reactivation of acquisition memory, where one would expect a high posterior probability of the acquisition latent cause. However, we would like to emphasize that the return of fear emerges from the interplay of competing memories. Previous studies have shown that contextual or cued fear reinstatement involves a neural activity switch back to fear state in the medial prefrontal cortex (mPFC), including the prelimbic cortex and infralimbic cortex, and the amygdala, including ventral intercalated amygdala neurons (ITCv), medial subdivision of central nucleus of the amygdala (CeM), and the basolateral amygdala (BLA) (Giustino et al., 2019; Hitora-Imamura et al., 2015; Zaki et al., 2022). We speculate that such transition is parallel to the internal states change in the latent cause model in terms of posterior probability and associative weight change.

      Optogenetic manipulation experiments have further revealed how fear and extinction engrams contribute to extinction retrieval and reinstatement. For instance, Gu et al. (2022) used a cued fear conditioning paradigm and found that inhibition of extinction engrams in the BLA, ventral hippocampus (vHPC), and mPFC after extinction learning artificially increased freezing to the tone cue. Similar results were observed in contextual fear conditioning, where silencing extinction engrams in the hippocampus dentate gyrus (DG) impaired extinction retrieval (Lacagnina et al., 2019). These results suggest that the weakening extinction memory can induce a return of fear response even without a reminder shock. On the other hand, Zaki et al. (2022) showed that inhibition of fear engrams in the BLA, DG, or hippocampus CA1 attenuated contextual fear reinstatement. However, they also reported that stimulation of these fear engrams was not sufficient to induce reinstatement, suggesting these fear engram only partially account for reinstatement. 

      In summary, reinstatement likely results from bidirectional changes in the fear and extinction circuits, supporting our interpretation that both acquisition and extinction latent causes contribute to the reinstatement. Although it remains unclear whether these memory engrams represent latent causes, one possible interpretation is that w<sub>context</sub> update in extinction latent causes during unsignaled shock indicates weakening of the extinction memory, while preservation of w in acquisition latent causes and their posterior probability suggests reactivation of previous fear memory. 

      [#13] (2) As stated by the authors in the introduction, the advantage of the fear learning approach is that the memory is modified across the acquisition-extinction-reinstatement phases. Although perhaps not explicitly stated by the authors, the post-reinstatement test (test 3) is the crucial test for whether there is reactivation of a previously stored memory, with the general argument being that the reinvigorated response to the CS can't simply be explained by relearning the CS-US pairing, because re-exposure the US alone leads to increase response to the CS at test. Of course there are several explanations for why this may occur, particularly when also considering the context as a stimulus. This is what I understood to be the justification for the use of a model, such as the latent cause model, that may better capture and compare these possibilities within a single framework. As such, it is critical to look at the level of responding to both the context alone and to the CS. It appears that the authors only look at the percent freezing during the CS, and it is not clear whether this is due to the contextual US learning during the US re-exposure or to increased response to the CS - presumably caused by reactivation of the acquisition memory. For example, the instance of the model shown in Figure 1 indicates that the 'extinction state', or state z6, develops a strong weight for the context during the reinstatement phase of presenting the shock alone. This state then leads to increased freezing during the final CS probe test as shown in the figure. By not comparing the difference in the evoked freezing CR at the test (ITI vs CS period), the purpose of the reinstatement test is lost in the sense of whether a previous memory was reactivated - was the response to the CS restored above and beyond the freezing to the context? I think the authors must somehow incorporate these different phases (CS vs ITI) into their model, particularly since this type of memory retrieval that depends on assessing latent states is specifically why the authors justified using the latent causal model.

      To clarify the contribution of context, we have provided preCS freezing rate across trials in Figure 2 – figure supplement 2. As the reviewer pointed out, the preCS freezing rate did not remain at the same level across trials, especially within the 12-month-old control and App<sup>NL-G-F</sup> group (Figure 2 – figure supplement 2A and 2B), suggesting the effect context. A paired samples t-test comparing preCS freezing (Figure 2 – figure supplement 2E) and CS freezing (Figure 2E) in test 3 revealed significant differences in all groups: 6-month-old control, t(23) = -6.344, p < 0.001, d = -1.295; 6-month-old App<sup>NL-G-F</sup>, t(24) = -4.679, p < 0.001, d = -0.936; 12-month-old control, t(23) = -4.512, p < 0.001, d = 0.921; 12-month-old App<sup>NL-G-F</sup>, t(24) = -2.408, p = 0.024, d = -0.482. These results indicate that the response to CS was above and beyond the response to context only. We also compared the change in freezing rate (CS freezing rate minus preCS freezing rate) in test 2 and test 3 to examine the net response to the tone. The significant difference was found in the control group, but not in the App<sup>NL-GF</sup> group (Author response image 3). The increased net response to the tone in the control group suggested that the reinstatement was partially driven by reactivation of acquisition memory, not solely by the contextual US learning during the unsignaled shock phase. We have added these results and discussion in the manuscript Line 220-231.

      Author response image 3.

      Net freezing rate in test 2 and test 3. Net freezing rate is defined as the CS freezing rate (i.e., freezing rate during 1 min CS presentation) minus the preCS freezing rate (i.e., 1 min before CS presentation). The dashed horizontal line indicates no freezing rate change from the preCS period to the CS presentation. *p < 0.05 by paired-sample Student’s t-test, and the alternative hypothesis specifies that test 2 freezing rate change is less than test 3. Colors indicate different groups: orange represents 6-month-old control (n = 24), light blue represents 6-month-old App<sup>NL-G-F</sup> mice (n = 25), pink represents 12-month-old control (n = 24), and dark blue represents 12-month-old App<sup>NL-G-F</sup> mice (n = 25). Each black dot represents one animal. Statistical results were as follows: t(23) = -1.927, p = 0.033, Cohen’s d = -0.393 in 6-month-old control; t(24) = -1.534, p = 0.069, Cohen’s d = -0.307 in 6-month-old App<sup>NL-G-F</sup>; t(23) = -1.775, p = 0.045, Cohen’s d = -0.362 in 12-month-old control; t(24) = 0.86, p = 0.801, Cohen’s d = 0.172 in 12-monthold App<sup>NL-G-F</sup>

      According to the latent cause model, if the reinstatement is merely induced by an association between the context and the US in the unsignaled shock phase, the CR given context only and that given context and CS in test 3 should be equal. However, the simulation conducted for each mouse using their estimated parameters confirmed that this was not the case in this study. The results showed that simulated CR was significantly higher in the context+CS condition than in the context only condition (Author response image 4). This trend is consistent with the behavioral results we mentioned above.

      Author response image 4.

      Simulation of context effect in test 3. Estimated parameter sets of each sample were used to run the simulation that only context or context with CS was present in test 3 (trial 36). The data are shown as median with interquartile range, where white bars with colored lines represent CR for context only and colored bars represent CR for context with CS. Colors indicate different groups: orange represents 6-month-old control (n = 15), light blue represents 6-month-old App<sup>NL-G-F</sup> mice (n = 12), pink represents 12-month-old control (n = 20), and dark blue represents 12-month-old App<sup>NL-G-F</sup> mice (n = 18). Each black dot represents one animal. **p < 0.01, and ***p < 0.001 by Wilcoxon signed-rank test comparing context only and context + CS in each group, and the alternative hypothesis specifies that CR in context is not equal to CR in context with CS. Statistical results were as follows: W = 15, p = 0.008, effect size r = -0.66 in 6-month-old control; W = 0, p < 0.001, effect size r = -0.88 in 6-month-old App<sup>NL-G-F</sup>; W = 25, p = 0.002, effect size r = -0.67 in 12-month-old control; W = 9, p = 0.002 , effect size r = -0.75 in 12-month-old App<sup>NL-G-F</sup>

      [#14] (3) This is related to the second point above. If the question is about the memory processes underlying memory retrieval at the test following reinstatement, then I would argue that the model parameters that are not involved in testing this hypothesis be fixed prior to the test. Unlike the Gershman paper that the authors cited, the authors fit all parameters for each animal. Perhaps the authors should fit certain parameters on the acquisition and extinction phase, and then leave those parameters fixed for the reinstatement phase. To give a more concrete example, if the hypothesis is that AD mice have deficits in differentiating or retrieving latent states during reinstatement which results in the low response to the CS following reinstatement, then perhaps parameters such as the learning rate should be fixed at this point. The authors state that the 12-month-old AD mice have substantially lower learning rate measures (almost a 20-fold reduction!), which can be clearly seen in the very low weights attributed to the AD mouse in Figure 3D. Based on the example in Figure 3D, it seems that the reduced learning rate in these mice is most likely caused by the failure to respond at test. This is based on comparing the behavior in Figures 3C to 3D. The acquisition and extinction curves appear extremely similar across the two groups. It seems that this lower learning rate may indirectly be causing most of the other effects that the authors highlight, such as the low σx, and the changes to the parameters for the CR. It may even explain the extremely high K. Because the weights are so low, this would presumably lead to extremely low likelihoods in the posterior estimation, which I guess would lead to more latent states being considered as the posterior would be more influenced by the prior.

      We thank the reviewer for the suggestion about fitting and fixing certain parameters in different phases.

      However, this strategy may not be optimal for our study for the following scientific reasons.

      Our primary purpose is to explore internal states in the memory modification process that are associated with the deficit found in App<sup>NL-G-F</sup> mice in the reinstatement paradigm. We did not restrict the question to memory retrieval, nor did we have a particular hypothesis such that only a few parameters of interest account for the impaired associative learning or structure learning in App<sup>NL-G-F</sup> mice while all other parameters are comparable between groups. We are concerned that restricting questions to memory retrieval at the test is too parsimonious and might lead to misinterpretation of the results. As we explain in reply to comment #5, removing trials in extinction during parameter estimation reduces the model fit performance and runs the risk of overfitting within the individual. Therefore, we estimated all parameters for each animal, with the assumption that the estimated parameter set represents individual internal state (i.e., learning and memory characteristics) and should be fixed within the animal across all trials.  

      Figure 3 is the parameter estimation and simulation results using the median data of each group as an individual. The estimated parameter value is one of the possible cases in that group to demonstrate how a typical learning curve fits the latent cause model. The reviewer mentioned “20-fold reduction in learning rate” is the comparison of two data points, not the actual comparison between groups. The comparison between control and App<sup>NL-G-F</sup> mice in the 12-month-old group for all parameters was provided in Table S7. The Mann-Whitney U test did not reveal a significant difference in learning rate (η): 12-month-old control (Mdn = 0.09, IQR=0.23) vs. 12-month-old App<sup>NL-G-F</sup> (Mdn = 0.12, IQR=0.23), U = 199, p = 0.587.  

      We agree that lower learning rate could bias the learning toward inferring a new latent cause. However, this tendency may depend on the value of other parameters and varied in different phases in the reinstatement paradigm. Here, we used ⍺ as an example and demonstrate their interaction in Appendix 2 – table 2 with relatively extreme values: ⍺ \= {1, 3} and η \= {0.01, 0.5} while the rest of the parameters fixed at the initial guess value. 

      When ⍺ = 1, the number of latent causes across phases (K<sub>acq</sub>, K<sub>ext</sub>, K<sub>rem</sub>) remain unchanged and their posterior probability in test 3 were comparable even if η increased from 0.01 to 0.5. This is an example that lower η does not lead to inferring new latent causes because of low ⍺. The effect of low learning rate manifests in test 3 CR due to low w<sub>context, acq</sub> and w<sub>context, ext</sub>

      When ⍺ = 3, the number of acquisition latent causes (K<sub>acq</sub>) was higher in the case of η = 0.01 than that of η = 0.5, showing the effect mentioned by the reviewer. However, test 1 CR is much lower when η = 0.01, indicating unsuccessful learning even after inferring a new latent cause. This is none of the cases observed in this study. During extinction phases, the effect of η is surpassed by the effect of high ⍺, where the number of extinction latent causes (K<sub>ext</sub>) is high and not affected by η. After the extinction phases, the effect of K kicks in as the total number of latent causes reaches its value (K = 33 in this example), especially in the case of η = 0.01. A new latent cause is inferred after extinction in the condition of η = 0.5, but the CR 3 is still high as the w<sub>context, acq</sub> and w<sub>context_, ext_</sub> are high. This is an example that a new latent cause is inferred in spite of higher η

      Overall, the learning rate would not have a prominent effect alone throughout the reinstatement paradigm, and it has a joint effect with other parameters. Note that the example here did not cover our estimated results, as the estimated learning rate was not significantly different between control and App<sup>NL-G-F</sup> mice (see above). Please refer to the reply to comment #31 for more discussion about the interaction among parameters when the learning rate is fixed. We hope this clarifies the reviewer’s concern.

      [#15] (4) Why didn't the authors use the latent causal model on the Barnes maze task? The authors mention in the discussion that different cognitive processes may be at play across the two tasks, yet reversal tasks have been suggested to be solved using latent states to be able to flip between the two different task states. In this way, it seems very fitting to use the latent cause model. Indeed, it may even be a better way to assess changes in σx as there are presumably 12 observable stimuli/locations.

      Please refer to our provisional response about the application of the latent cause model to the reversal Barnes maze task. Briefly, it would be difficult to directly apply the latent cause model to the Barnes maze data because this task involves operant learning, and thereby almost all conditions in the latent cause model are not satisfied. Please also see our reply to comment #24 for the discussion of the link between the latent cause model and Barnes maze task. 

      Reviewer #2 (Recommendations for the authors):

      [#16] (1) I had a bit of difficulty finding all the details of the model. First, I had to mainly rely on the Gershman 2017 paper to understand the model. Even then, there were certain aspects of the model that were not clear. For instance, it's not quite clear to me when the new internal states are created and how the maximum number of states is determined. After reading the authors' methods and the Gershman paper, it seems that a new internal state is generated at each time point, aka zt, and that the prior for that state decays onwards from alpha. Yet because most 'new' internal states don't ever take on much of a portion of the posterior, most of these states can be ignored. Is that a correct understanding? To state this another way, I interpret the equation on line 129 to indicate that the prior is determined by the power law for all existing internal states and that each new state starts with a value of alpha, yet I don't see the rule for creating a new state, or for iterating k other than that k iterates at each timestep. Yet this seems to not be consistent with the fact that the max number of states K is also a parameter fit. Please clarify this, or point me to where this is better defined.

      I find this to be an important question for the current paper as it is unclear to me when the states were created. Most notably, in Figure 3, it's important to understand why there's an increase in the posterior of z<sup>5</sup> in the AD 12-month mice at test. Is state z<sup>5</sup> generated at trial 5? If so, the prior would be extremely small by trial 36, making it even more perplexing why z<sup>5</sup> has such a high posterior. If its weights are similar to z<sup>3</sup> and z<sup>4</sup>, and they have been much more active recently, why would z<sup>5</sup> come into play?

      We assume that the “new internal state" the reviewer is referring to is the “new latent cause." We would like to clarify that “internal state" in our study refers to all the latent causes at a given time point and observation. As this manuscript is submitted as a Research Advance article in eLife, we did not rephrase all the model details. Here, we explain when a new latent cause is created (i.e., the prior probability of a new latent cause is greater than 0) with the example of the 12-month-old group (Figure 3C and 3D). 

      Suppose that before the start of each trial, an agent inferred the most likely latent cause with maximum posterior, and it inferred k latent causes so far. A new latent cause can be inferred at the computation of the prior of latent causes at the beginning of each trial.  

      In the latent cause model, it follows a distance-dependent Chinese Restaurant Process (CRP; Blei and Frazier, 2011). The prior of each old latent cause is its posterior probability, which is the final count of the EM update before the current. In addition, the prior of old latent causes is sensitive to the time passage so that it exponentially decreases as a forgetting function modulated by g (see Figure 2 in Gershman et al., 2017). Simultaneously, the prior of a new cause is assigned ⍺. The new latent cause is inferred at this moment. Hence, the prior of latent causes is jointly determined by ⍺, g and its posterior probability. The maximum number of latent causes K is set a priori and does not affect the prior while k < K (see also reply to comment #30 for the discussion of boundary set for K and comment #31 for the discussion of the interaction between ⍺ and K). Note that only one new latent cause can be inferred in each trial, and (k+1)<sup>th</sup> latent cause, which has never been inferred so far, is chosen as the new latent cause.

      In our manuscript, the subscript number in zₖ denotes the order in which they were inferred, not the trial number. In Figures 3C and 3D, z<sub>3</sub> and z<sub>4</sub> were inferred in trials 5 and 6 during extinction; z<sub>5</sub> is a new latent cause inferred in trial 36. Therefore, the prior of z<sub>5</sub> is not extremely small compared to z<sub>4</sub> and z<sub>3</sub>.

      In both control and App<sup>NL-G-F</sup> mice in the 12-month-old (Figures 3C and 3D), z<sup>3</sup> is dominant until trial 35. The unsignaled shock at trial 35 generates a large prediction error as only context is presented and followed by the US. This prediction error reduces posterior of z<sup>3</sup>, while increasing the posterior of z<sup>4</sup> and w<sub>context</sub> in z<sup>3</sup> and z<sup>4</sup>. This decrease of posterior of z<sup>3</sup> is more obvious in the App<sup>NL-G-F</sup> than in the control group, prompting them to infer a new latent cause z<sup>5</sup> (Figure 3C and 3D). Although Figure 3C and 3D are illustrative examples as we explained in the reply to comment #14, this interpretation would be plausible as the App<sup>NL-G-F</sup> group inferred a significantly larger number of latent causes after the extinction with slightly higher posteriors of them than those in the control group (Figure 4E).

      [#17] (2) Related to the above, Are the states zA and zB defined by the authors to help the reader group the states into acquisition and extinction states, or are they somehow grouped by the model? If the latter is true, I don't understand how this would occur based on the model. If the former, could the authors state that these states were grouped together by the author?

      We used zA and zB annotations to assist with the explanation, so this is not grouped by the model. We have stated this in the manuscript Line 181-182.

      [#18] (3) This expands on the third point above. In Figure 3D, internal states z<sup>3</sup>, z<sup>4</sup>, and z<sup>5</sup> appear to be pretty much identical in weights in the App group. It's not clear to me why then the posterior of z<sup>5</sup> would all of a sudden jump up. If I understand correctly, the posterior is the likelihood of the observations given the internal state (presumably this should be similar across z<sup>3</sup>,z<sup>4</sup>, and z<sup>5</sup>), multiplied by the prior of the state. Z3 and Z4 are the dominant inferred states up to state 36. Why would z<sup>5</sup> become more likely if there doesn't appear to be any error? I'm inferring no error because there are little or no changes in weights on trial 36, most prominently no changes in z<sup>3</sup> which is the dominant internal state in step 36. If there's little change in weights, or no errors, shouldn't the prior dominate the calculation of the posterior which would lead to z<sup>3</sup> and z<sup>4</sup> being most prominent at trial 36?

      We have explained how z<sup>5</sup> of the 12-month-old App<sup>NL-G-F</sup> was inferred in the reply to comment #16. Here, we explain the process underlying the rapid changes of the posterior of z<sup>3</sup>, z<sup>4</sup>, and z<sup>5</sup> from trial 35 to 36.

      During the extinction, the mice inferred z<sup>3</sup> given the CS and the context in the absence of US. In trial 35, they observed the context and the unsignaled shock in the absence of the CS. This reduced the likelihood for the CS under z<sup>3</sup> and thereby the posterior of z<sup>3</sup>, while relatively increasing the posterior of z<sup>4</sup>. The associative weight between the context and the US , w<sub>context</sub>, indeed increased in both z<sup>3</sup> and z<sup>4</sup>, but w<sub>context</sub> of z<sup>4</sup> was updated more than that of z<sup>3</sup> due to its higher posterior probability. At the beginning of trial 36, a new latent cause z<sup>5</sup> was inferred with a certain prior (see also the reply for comment #16), and w<sub>5</sub> = w<sub>0</sub>, where w<sub>0</sub> is the initial value of weight. After normalizing the prior over latent causes, the emergence of z<sup>5</sup> reduced the prior probability of other latent causes compared to the case where the prior of z<sup>5</sup> is 0. Since the CS was presented while the US was absent in trial 36, the likelihood of the CS and that of the US under z<sup>3</sup>, and especially z<sup>4</sup>, given the cues and w became lower than the case in which z<sup>5</sup> has not been inferred yet. Consequently, the posterior of z<sup>5</sup> became salient (Figure 3D).

      To maintain consistency across panels, we used a uniform y-axis range. However, we acknowledge that this may make it harder to notice the changes of associative weights in Figure 3D. We have provided the subpanel in Figure 3D with a smaller y-axis limit to reveal the weight changes at trial 35 in Author response image 5.

      Author response image 5.

      Magnified view of w<sub>context</sub> and wCS in the last 3 trials in Figure 3D. The graph format is the same as in Figure 3D. The weight for CS (_w_CS) and that for context (w<sub>context</sub>) in each latent cause across trial 34 (test 2), 35 (unsignaled shock), and 36 (test 3) in 12-month-old App<sup>NL-G-F</sup> in Figure 3D was magnified in the upper and lower magenta box, respectively.

      [#19] (8) In Figure 4B - The figure legend didn't appear to indicate at which time points the DIs are plotted.

      We have amended the figure legend to indicate that DI between test 3 and test 1 is plotted.

      [#20] (9) Lines 301-303 state that the posterior probabilities of the acquisition internal states in the 12month AD mice were much higher at test 1 and that this resulted in different levels of CR across the control and 12-month App group. This is shown in the Figure 4A supplement, but this is not apparent in Figure 3 panels C and D. Is the example shown in panel D not representative of the group? The CRs across the two examples in Figure 3 C and D look extremely similar at test 1. Furthermore, the posteriors of the internal states look pretty similar across the two groups for the first 4 trials. Both the App and control have substantial posterior probabilities for the acquisition period, I don't see any additional states at test 1. The pattern of states during acquisition looks strikingly similar across the two groups, whereas the weights of the stimuli are considerably different. I think it would help the authors to use an example that better represents what the authors are referring to, or provide data to illustrate the difference. Figure 4C partly shows this, but it's not very clear how strong the posteriors are for the 3rd state in the controls.

      Figure 3 serves as an example to explain the internal states in each group (see also the third paragraph in the reply to comment #14). Figure 4C to H showed the results from each sample for between-group comparison in selected features. Therefore, the results of direct comparisons of the parameter values and internal states between genotypes in Figure 3 are not necessarily the same as those in Figure 4. Both examples in Figure 3C and 3D inferred 2 latent causes during the acquisition. In terms of posterior till test 1 (trial 4), the two could be the same. However, such examples were not rare, as the proportion of the mice that inferred 2 latent causes during the acquisition was slightly lower than 50% in the control, and around 90% in the App<sup>NL-G-F</sup> mice (Figure 4C). The posterior probability of acquisition latent cause in test 1 showed a similar pattern (Figure 4 – figure supplement 3), with values near 1 in around 50% of the control mice and around 90% of the App<sup>NL-G-F</sup> mice.  

      [#21] (10) Line 320: This is a confusing sentence. I think the authors are saying that because the App group inferred a new state during test 3, this would protect the weights of the 'extinction' state as compared to the controls since the strength of the weight updates depends on the probability of the posterior.

      In order to address this, we have revised this sentence to “Such internal states in App<sup>NL-G-F</sup> mice would diverge the associative weight update from those in the control mice after extinction.” in the manuscript Line 349-351.

      [#22] (11) In lines 517-519 the authors address the difference in generalizing the occurrence of stimuli across the App and control groups. It states that App mice with lower alpha generalized observations to an old cause rather than attributing it as a new state. Going back to statement 3 above, I think it's important to show that the model fit of a reduction in alpha does not go hand-in-hand with a reduction in the learning rates and hence the weights. Again, if the likelihoods are diminished due to the low weights, then the fit of alpha might be reduced as well. To reiterate my point above, if the observations in changes in generalization and differentiation occur because of a reduction in the learning rate, the modeling may not be providing a particularly insightful understanding of AD, other than that poor learning leads to ineffectual generalization and differentiation. Do these findings hold up if the learning rates are more comparable across the control and App group?

      These findings were explained on the basis of comparable learning rates between control and App<sup>NL-GF</sup> mice in the 12-month-old group (see the reply to comment #14). In addition, we have conducted simulation for different ⍺ and σ<sub>x</sub><sup>2</sup> values under the condition of the fixed learning rate, where overgeneralization and overdifferentaiton still occurred (see the reply to comment #26).  

      [#23] (12) Lines 391 - 393. This is a confusing sentence. "These results suggest that App NL-G-F mice could successfully form a spatial memory of the target hole, while the memory was less likely to be retrieved by a novel observation such as the absence of the escape box under the target hole at the probe test 1." The App mice show improved behavior across days of approaching the correct hole. Is this statement suggesting that once they've approached the target hole, the lack of the escape box leads to a reduction in the retention of that memory?

      We speculated that when the mice observed the absence of the escape box, a certain prediction error would be generated, which may have driven the memory modification. In App<sup>NL-G-F</sup> mice, such modification, either overgeneralization or overdifferentiation, could render the memory of the target hole vulnerable; if overgeneralization occurred, the memory would be quickly overwritten as the goal no longer exists in this position in this maze, while if overdifferentiation occurred, a novel memory such that the goal does not exist in the maze different from previous one would be formed. In either case of misclassification, the probability of retrieving the goal position would be reduced. To reduce ambiguity in this sentence, we have revised the description in the manuscript Line 432-434 as follows: “These results suggest that App<sup>NL-G-F</sup> mice could successfully form a spatial memory of the target hole, while they did not retrieve the spatial memory of the target hole as strongly as control mice when they observed the absence of the escape box during the probe test.”

      [#24] (13) The connection between the results of Barnes maze and the fear learning paradigm is weak. How can changes in overgeneralization due to a reduction in the creation of inferred states and differentiation due to a reduced σx lead to the observations in the Barnes maze experiment?

      We extrapolated our interpretation in the reinstatement modeling to behaviors in a different behavioral task, to explore the explanatory power of the latent cause framework formalizing mechanisms of associative learning and memory modification. Here, we explain the results of the reversal Barnes maze paradigm in terms of the latent cause model, while conferring the reinstatement paradigm.

      Whilst we acknowledge that fear conditioning and spatial learning are not fully comparable, the reversal Barnes maze paradigm used in our study shares several key learning components with the reinstatement paradigm. 

      First, associative learning is fundamental in spatial learning (Leising & Blaisdell, 2009; Pearce, 2009). Although we did not make any specific assumptions of what kind of associations were learned in the Barnes maze, performance improvements in learning phases likely reflect trial-and-error updates of these associations involving sensory preconditioning or secondary conditioning. Second, the reversal training phases could resemble the extinction phase in the reinstatement paradigm, challenge previously established memory. In terms of the latent cause model, both the reversal learning phase in the reversal Barnes maze paradigm and the extinction phase in the reinstatement paradigm induce a mismatch of the internal state. This process likely introduces large prediction errors, triggering memory modification to reconcile competing memories.  

      Under the latent cause framework, we posit that the mice would either infer new memories or modify existing memories for the unexpected observations in the Barnes maze (e.g., changed location or absence of escape box) as in the reinstatement paradigm, but learn a larger number of association rules between stimuli in the maze compared to those in the reinstatement. In the reversal Barnes maze paradigm, the animals would infer that a latent cause generates the stimuli in the maze at certain associative weights in each trial, and would adjust behavior by retaining competing memories.

      Both overgeneralization and overdifferentiation could explain the lower exploration time of the target hole in the App<sup>NL-G-F</sup> mice in probe test 1. In the case of overgeneralization, the mice would overwrite the existing spatial memory of the target hole with a memory that the escape box is absent. In the case of overdifferentiation, the mice would infer a new memory such that the goal does not exist in the novel field, in addition to the old memory where the goal exists in the previous field. In both cases, the App<sup>NL-G-F</sup> mice would not infer that the location of the goal is fixed at a particular point and failed to retain competing spatial memories of the goal, leading to relying on a less precise, non-spatial strategy to solve the task.  

      Since there is no established way to formalize the Barnes maze learning in the latent cause model, we did not directly apply the latent cause model to the Barnes maze data. Instead, we used the view above to explore common processes in memory modification between the reinstatement and the Barnes maze paradigm. 

      The above description was added to the manuscript on page 13 (Line 410-414) and page 19-20 (Line 600-602, 626-639).

      [#25] (14) In the fear conditioning task, it may be valuable to separate responding to the context and the cue at the time of the final test. The mice can learn about the context during the reinstatement, but there must be an inference to the cue as it's not present during the reinstatement phase. This would provide an opportunity for the model to perhaps access a prior state that was formed during acquisition. This would be more in line with the original proposal by Gershman et al. 2017 with spontaneous recovery.

      Please refer to the reply to comment #13 regarding separating the response to context in test 3.  

      Reviewer #3 (Public review):

      Summary:

      This paper seeks to identify underlying mechanisms contributing to memory deficits observed in Alzheimer's disease (AD) mouse models. By understanding these mechanisms, they hope to uncover insights into subtle cognitive changes early in AD to inform interventions for early-stage decline.

      Strengths:

      The paper provides a comprehensive exploration of memory deficits in an AD mouse model, covering the early and late stages of the disease. The experimental design was robust, confirming age-dependent increases in Aβ plaque accumulation in the AD model mice and using multiple behavior tasks that collectively highlighted difficulties in maintaining multiple competing memory cues, with deficits most pronounced in older mice.

      In the fear acquisition, extinction, and reinstatement task, AD model mice exhibited a significantly higher fear response after acquisition compared to controls, as well as a greater drop in fear response during reinstatement. These findings suggest that AD mice struggle to retain the fear memory associated with the conditioned stimulus, with the group differences being more pronounced in the older mice.

      In the reversal Barnes maze task, the AD model mice displayed a tendency to explore the maze perimeter rather than the two potential target holes, indicating a failure to integrate multiple memory cues into their strategy. This contrasted with the control mice, which used the more confirmatory strategy of focusing on the two target holes. Despite this, the AD mice were quicker to reach the target hole, suggesting that their impairments were specific to memory retrieval rather than basic task performance.

      The authors strengthened their findings by analyzing their data with a leading computational model, which describes how animals balance competing memories. They found that AD mice showed somewhat of a contradiction: a tendency to both treat trials as more alike than they are (lower α) and similar stimuli as more distinct than they are (lower σx) compared to controls.

      Weaknesses:

      While conceptually solid, the model struggles to fit the data and to support the key hypothesis about AD mice's ability to retain competing memories. These issues are evident in Figure 3:

      [#26] (1) The model misses key trends in the data, including the gradual learning of fear in all groups during acquisition, the absence of a fear response at the start of the experiment, the increase in fear at the start of day 2 of extinction (especially in controls), and the more rapid reinstatement of fear observed in older controls compared to acquisition.

      We acknowledge these limitations and explained why they arise in the latent cause model as follows.

      a. Absence of a fear response at the start of the experiment and the gradual learning of fear during acquisition 

      In the latent cause model, the CR is derived from a sigmoidal transformation from the predicted outcome with the assumption that its mapping to behavioral response may be nonlinear (see Equation 10 and section “Conditioned responding” in Gershman et al., 2017). 

      The magnitude of the unconditioned response (trial 1) is determined by w<sub>0</sub>, θ, and λ. An example was given in Appendix 2 – table 3. In general, a higher w<sub>0</sub> and a lower θ produce a higher trial 1 CR when other parameters are fixed. During the acquisition phase, once the expected shock exceeds θ, CR rapidly approaches 1, and further increases in expected shock produce few changes in CR. This rapid increase was also evident in the spontaneous recovery simulation (Figure 11) in Gershman et al. (2017). The steepness of this rapid increase is modulated by λ such that a higher value produces a shallower slope. This is a characteristic of the latent cause model, assuming CR follows a sigmoid function of expected shock, while the ordinal relationship over CRs is maintained with or without the sigmoid function, as Gershman et al. (2017) mentioned. If one assumes that the CR should be proportional to the expected shock, the model can reproduce the gradual response as a linear combination of w and posteriors of latent causes while omitting the sigmoid transformation (Figure 3). 

      b. Increase in fear at the start of day 2 extinction

      This point is partially reproduced by the latent cause model. As shown in Figure 3, trial 24 (the first trial of day 2 extinction) showed an increase in both posterior probability of latent cause retaining fear memory and the simulated CRs in all groups except the 6-month-old control group, though the increase in CR was small due to the sigmoid transformation (see above). This can be explained by the latent cause model as 24 h time lapse between extinction 1 and 2 decreases the prior of the previously inferred latent cause, leading to an increase of those of other latent causes. 

      Unlike other groups, the 6-month-old control did not exhibit increased observed CR at trial 24

      but at trial 25 (Figure 3A). The latent cause model failed to reproduce it, as there was no increase in posterior probability in trial 24 (Figure 3A). This could be partially explained by the low value of g, which counteracts the effect of the time interval between days: lower g keeps prior of the latent causes at the same level as those in the previous trial. Despite some failures in capturing this effect, our fitting policy was set to optimize prediction among the test trials given our primary purpose of explaining reinstatement.

      c. more rapid reinstatement of fear observed in older controls compared to acquisition

      We would like to point out that this was replicated by the latent cause model as shown in Figure 3 – figure supplement 1C. The DI between test 3 and test 1 calculated from the simulated CR was significantly higher in 12-month-old control than in App<sup>NL-G-F</sup> mice (cf. Figure 2C to E).  

      [#27] (2) The model attributes the higher fear response in controls during reinstatement to a stronger association with the context from the unsignaled shock phase, rather than to any memory of the conditioned stimulus from acquisition. These issues lead to potential overinterpretation of the model parameters. The differences in α and σx are being used to make claims about cognitive processes (e.g., overgeneralization vs. overdifferentiation), but the model itself does not appear to capture these processes accurately. The authors could benefit from a model that better matches the data and that can capture the retention and recollection of a fear memory across phases.

      First, we would like to clarify that the latent cause model explains the reinstatement not only by the extinction latent cause with increased w<sub>context</sub> but also the acquisition latent cause with preserved wCS and w<sub>context</sub> (see also reply to comment #13). Second, the latent cause model primarily attributes the higher fear reinstatement in control to a lower number of latent causes inferred after extinction (Figure 4E) and higher w<sub>context</sub> in extinction latent cause (Figure 4G). We noted that there was a trend toward significance in the posterior probability of latent causes inferred after extinction (Figure 4E), which in turn influences those of acquisition latent causes. Although the posterior probability of acquisition latent cause appeared trivial and no significance was detected between control and App<sup>NL-G-F</sup> mice (Figure 4C), it was suppressed by new latent causes in App<sup>NL-G-F</sup> mice (Author response image 6).

      This indicates that App<sup>NL-G-F</sup> mice retrieved acquisition memory less strongly than control mice. Therefore, we argue that the latent cause model attributed a higher fear response in control during reinstatement not solely to the stronger association with the context but also to CS fear memory from acquisition. Although we tested whether additional models fit the reinstatement data in individual mice, these models did not satisfy our fitting criteria for many mice compared to the latent cause model (see also reply to comment #4 and #28).

      Author response image 6.

      Posterior probability of acquisition, extinction, and after extinction latent causes in test 3. The values within each bar indicate the mean posterior probability of acquisition latent cause (darkest shade), extinction latent cause (medium shade), and latent causes inferred after extinction (lightest shade) in test 3 over mice within genotype. Source data are the same as those used in Figure 4C–E (posterior of z).

      Conclusion:

      Overall, the data support the authors' hypothesis that AD model mice struggle to retain competing memories, with the effect becoming more pronounced with age. While I believe the right computational model could highlight these differences, the current model falls short in doing so.

      Reviewer #3 (Recommendations for the authors):

      [#28] Other computational models may better capture the data. Ideally, I'd look for a model that can capture the gradual learning during acquisition, and, in some mice, the inferring of a new latent cause during extinction, allowing the fear memory to be retained and referenced at the start of day 2 extinction and during later tests.

      We have further evaluated another computational model, the latent state model, and compared it with the latent cause model. The simulation of reinstatement and parameter estimation method of the latent state model were described in the Appendix.

      The latent state model proposed by Cochran and Cisler (2019) shares several concepts with the latent cause model, and well replicates empirical data under certain conditions. We expect that it can also explain the reinstatement. 

      Following the same analysis flow for the latent cause model, we estimated the parameters and simulated reinstatement in the latent state model from individual CRs and median of them. In the median freezing rate data of the 12-month-old control mice, the simulated CR replicated the observed CR well and exhibited the ideal features that the reviewer looked for: gradual learning during acquisition and an increased fear at the start of the second-day extinction (Appendix 1 – figure 1G). However, a lot of samples did not fit well to the latent state model. The number of anomalies was generally higher than that in the latent cause model (Appendix 1 – figure 2). Within the accepted samples, the sum of squared prediction error in all trials was significantly lower in the latent state model, which resulted from lower prediction error in the acquisition trials (Appendix 1 – figure 4A and 4B). In the three test trials, the squared prediction error was comparable between the latent state model and the latent cause model except for the test 2 trials in the control group (Appendix 1 – figure 4A and 4B, rightmost panel). On the other hand, almost all accepted samples continued to infer the acquisition latent states during extinction without inferring new states (Appendix 1 – figure 5B and 5E, left panel), which differed from the ideal internal states the reviewer expected. While the latent state model fit performance seems to be better than the latent cause model, the accepted samples cannot reproduce the lower DI between test 3 and test 1 in aged App<sup>NL-G-F</sup> mice (Appendix 1 – figure 6C). These results make the latent state model less suitable for our purpose and therefore we decided to stay with the latent cause model. It should also be noted that we did not explore all parameter spaces of the latent state model hence we cannot rule out the possibility that alternative parameter sets could provide a better fit and explain the memory modification process well. A more comprehensive parameter search in the LSM may be a valuable direction for future research.

      If you decide not to go with a new model, my preference would be to drop the current modeling. However, if you wish to stay with the current model, I'd like to see justification or acknowledgment of the following:

      [#29] (1) Lower bound on alpha of 1: This forces the model to infer new latent causes, but it seems that some mice, especially younger AD mice, might rely more on classical associative learning (e.g., Rescorla-Wagner) rather than inferring new causes.

      We acknowledge that the default value set in Gershman et al. (2017) is 0.1, and the constraint we set is a much higher value. However, ⍺ = 1 does not always force the model to infer new latent causes.

      In the standard form Chinese restaurant process (CRP), the prior that n<sup>th</sup> observation is assigned to a new cluster is given by ⍺ / (n - 1 + ⍺) (Blei & Gershman, 2012). When ⍺ = 1, the prior of the new cluster for the 2nd observation will be 0.5; when ⍺ = 3, this prior increases to 0.75. Thus, when ⍺ > 1, the prior of the new cluster is above chance early in the sequence, which may relate to the reviewer’s concern. However, this effect diminishes as the number of observations increases. For instance, the prior of the new cluster drops to 0.1 and 0.25 for the 10th observation when ⍺ = 1 and 3, respectively. Furthermore, the prior in the latent cause model is governed by not only α but also g, a scaling parameter for the temporal difference between successive observations (see Results in the manuscript) following “distance-dependent” CRP, then normalized over all latent causes including a new latent cause. Thus, it does not necessarily imply that ⍺ greater than 1 forces agents to infer a new latent cause_. As shown in Appendix 2 – table 4, the number of latent causes does not inflate in each trial when _α = 1. On the other hand, the high number of latent causes due to α = 2 can be suppressed when g = 0.01. More importantly, the driving force is the prediction error generated in each trial (see also comment #31 about the interaction between ⍺ and σ<sub>x</sub><sup>2</sup>). Raising the value of ⍺ per se can be viewed as increasing the probability to infer a new latent cause, not forcing the model to do so by higher α alone. 

      During parameter exploration using the median behavioral data under a wider range of ⍺ with a lower boundary at 0.1, the estimated value eventually exceeded 1. Therefore, we set the lower bound of ⍺ to be 1 is to reduce inefficient sampling. 

      [#30] (2) Number of latent causes: Some mice infer nearly as many latent causes as trials, which seems unrealistic.

      We set the upper boundary for the maximum number of latent causes (K) to be 36 to align with the infinite features of CRP. This allowed some mice to infer more than 20 latent causes in total. When we checked the learning curves in these mice, we found that they largely fluctuated or did not show clear decreases during the extinction (Author response image 7, colored lines). The simulated learning curves were almost flat in these trials (Author response image 7, gray lines). It might be difficult to estimate the internal states of such atypical mice if the sampling process tried to fit them by increasing the number of latent causes. Nevertheless, most of the samples have a reasonable total number of latent causes: 12-month-old control mice, Mdn = 5, IQR = 4; 12-month-old App<sup>NL-G-F</sup> mice, Mdn = 5, IQR = 1.75; 6-month-old control mice, Mdn = 7, IQR = 12.5; 6-month-old App<sup>NL-G-F</sup> mice, Mdn = 5, IQR = 5.25. These data were provided in Tables S9 and S12.  

      Author response image 7.

      Samples with a high number of latent causes. Observed CR (colored line) and simulated CR (gray line) for individual samples with a total number of inferred latent causes exceeding 20. 

      [#31] (3) Parameter estimation: With 10 parameters fitting one-dimensional curves, many parameters (e.g., α and σx) are likely highly correlated and poorly identified. Consider presenting scatter plots of the parameters (e.g., α vs σx) in the Supplement.

      We have provided the scatter plots with a correlation matrix in Figure 4 – figure supplement 1 for the 12-month-old group and Figure 5 – figure supplement 1 for the 6-month-old group. As pointed out by the reviewer, there are significant rank correlations between parameters including ⍺ and σ<sub>x</sub><sup>2</sup> in both the 6 and 12-month-old groups. However, we also noted that there are no obvious linear relationships between the parameters.

      The correlation above raises a potential problem of non-identifiability among parameters. First, we computed the variance inflation index (VIF) for all parameters to examine the risk of multicollinearity, though we did not consider a linear regression between parameters and DI in this study. All VIF values were below the conventional threshold 10 (Appendix 2 – table 5), suggesting that severe multicollinearity is unlikely to bias our conclusions. Second, we have conducted the simulation with different combinations of ⍺, σ<sub>x</sub><sup>2</sup>, and K to clarify their contribution to overgeneralization and overdifferentiation observed in the 12-month-old group. 

      In Appendix 2 – table 6, the values of ⍺ and σ<sub>x</sub><sup>2</sup> were either their upper or lower boundary set in parameter estimation, while the value K was selected heuristically to demonstrate its effect. Given the observed positive correlation between alpha and σ<sub>x</sub><sup>2</sup>, and their negative correlation with K (Figure 4 - figure supplement 1), we consider the product of K \= {4, 35}, ⍺ \= {1, 3} and σ<sub>x</sub><sup>2</sup> \= {0.01, 3}. Among these combinations, the representative condition for the control group is α = 3, σ<sub>x</sub><sup>2</sup> = 3, and that for the App<sup>NL-G-F</sup> group is α = 1, σ<sub>x</sub><sup>2</sup> = 0.01. In the latter condition, overgeneralization and overdifferentiation, which showed higher test 1 CR, lower number of acquisition latent causes (K<sub>acq</sub>), lower test 3 CR, lower DI between test 3 and test 1, and higher number of latent causes after extinction (K<sub>rem</sub>), was extremely induced. 

      We found conditions that fall outside of empirical correlation, such as ⍺ = 3, σ<sub>x</sub><sup>2</sup> = 0.01, also reproduced overgeneralization and overdifferentiation. Similarly, the combination, ⍺ = 1, σ<sub>x</sub><sup>2</sup> = 3, exhibited control-like behavior when K = 4 but shifted toward App<sup>NL-G-F</sup>-like behavior when K = 36. The effect of K was also evident when ⍺ = 3 and σ<sub>x</sub><sup>2</sup> = 3, where K = 36 led to over-differentiation. We note that these conditions were artificially set and likely not representative of biologically plausible. These results underscore the non-identifiability concern raised by the reviewer. Therefore, we acknowledge that merely attributing overgeneralization to lower ⍺ or overdifferentiation to lower σ<sub>x</sub><sup>2</sup> may be overly reductive. Instead, these patterns likely arise from the joint effect of ⍺, σ<sub>x</sub><sup>2</sup>, and K. We have revised the manuscript accordingly in Results and Discussion (page 11-13, 18-19).

      [#32] (4) Data normalization: Normalizing the data between 0 and 1 removes the interpretability of % freezing, making mice with large changes in freezing indistinguishable seem similar to mice with small changes.

      As we describe in our reply to comment #26, the conditioned response in the latent cause model was scaled between 0 and 1, and we assume 0 and 1 mean the minimal and maximal CR within each mouse, respectively. Furthermore, although we initially tried to fit simulated CRs to raw CRs, we found that the fitting level was low due to the individual difference in the degree of behavioral expression: some mice exhibited a larger range of CR, while others showed a narrower one. Thus, we decided to normalize the data. We agree that this processing will make the mice with high changes in freezing% indistinguishable from those with low changes. However, the freezing% changes within the mouse were preserved and did not affect the discrimination index.

      [#33] (5) Overlooking parameter differences: Differences in parameters, like w<sub>0</sub>, that didn't fit the hypothesis may have been ignored.

      Our initial hypothesis is that internal states were altered in App<sup>NL-G-F</sup> mice, and we did not have a specific hypothesis on which parameter would contribute to such a state. We mainly focus on the parameters (1) that are significantly different between control and App</sup>NL-G</sup>- mice and (2) that are significantly correlated to the empirical behavioral data, DI between test 3 and test 1. 

      In the 12-month-old group, besides ⍺ and σ<sub>x</sub><sup>2</sup>, w<sub>0</sub> and K showed marginal p-value in Mann-Whitney U test (Table S7) and moderate correlation with the DI (Table S8). While differences in K were already discussed in the manuscript, we did miss the point that w<sub>0</sub> could contribute to the differences in w between control and App<sup>NL-G-F</sup> (Figure 4G) in the previous manuscript. We explain the contribution of w<sub>0</sub> on the reinstatement results here. When other parameters are fixed, higher w<sub>0</sub> would lead to higher CR in test 3, because higher w<sub>0</sub> would allow increasing w<sub>context</sub> by the unsignaled shock, leading to reinstatement (Appendix 2 – table 7). It is likely that higher w<sub>0</sub> would be sampled through the parameter estimation in the 12-month-old control but not App<sup>NL-G-F</sup>. On the other hand, the number of latent causes is not sensitive to w<sub>0</sub> when other parameters were fixed at the initial guess value (Appendix 2 – table 1), suggesting w<sub>0</sub> has a small contribution to memory modification process. 

      Thus, we speculate that although the difference in w<sub>0</sub> between control and App<sup>NL-G-F</sup> mice may arise from the sampling process, resulting in a positive correlation with DI between test 3 and test 1, its contribution to diverged internal states would be smaller relative to α or σ<sub>x</sub><sup>2</sup> as a wide range of w<sub>0</sub> has no effect on the number of latent causes (Appendix 2 – table 7). We have added the discussion of differences in w<sub>0</sub> in the 12-month-old group in manuscript Line 357-359.

      In the 6-month-old group, besides ⍺ and σ<sub>x</sub><sup>2</sup>, 𝜃 is significantly higher in the AD mice group (Table S10) but not correlated with the DI (Table S11). We have already discussed this point in the manuscript.  

      [#34] (6) Initial response: Higher initial responses in the model at the start of the experiment may reflect poor model fit.

      Please refer to our reply to comment #26 for our explanation of what contributes to high initial responses in the latent cause model.

      In addition, achieving a good fit for the acquisition CRs was not our primary purpose, as the response measured in the acquisition phase includes not only a conditioned response to the CS and context but also an unconditioned response to the novel stimuli (CS and US). This mixed response presumably increased the variance of the measured freezing rate over individuals, therefore we did not cover the results in the discussion.

      Rather, we favor models at least replicating the establishment of conditioning, extinction and reinstatement of fear memory in order to explain the memory modification process. As we mentioned in the reply for comment #4, alternative models, the latent state model and the Rescorla-Wagner model, failed to replicate the observation (cf. Figure 3 – figure supplement 1A-1C). Thus, we chose to stand on the latent cause model as it aligns better with the purpose of this study. 

      [#35] In addition, please be transparent if data is excluded, either during the fitting procedure or when performing one-way ANCOVA. Avoid discarding data when possible, but if necessary, provide clarity on the nature of excluded data (e.g., how many, why were they excluded, which group, etc?).

      We clarify the information of excluded data as follows. We had 25 mice for the 6-month-old control group, 26 mice for the 6-month-old App<sup>NL-G-F</sup> group, 29 mice for the 12-month-old control group, and 26 mice for the 12-month-old App<sup>NL-G-F</sup> group (Table S1). 

      Our first exclusion procedure was applied to the freezing rate data in the test phase. If the mouse had a freezing rate outside of the 1.5 IQR in any of the test phases, it is regarded as an outlier and removed from the analysis (see Statistical analysis in Materials and Methods). One mouse in the 6-month-old control group, one mouse in the 6-month-old App<sup>NL-G-F</sup> group, five mice in the 12-month-old control group, and two mice in the 12-month-old App<sup>NL-G-F</sup> group were excluded.

      Our second exclusion procedure was applied during the fitting and parameter estimation (see parameter estimation in Materials and Methods). We have provided the number of anomaly samples during parameter estimation in Appendix 1 – figure 2.   

      Lastly, we would like to state that all the sample sizes written in the figure legends do not include outliers detected through the exclusion procedure mentioned above.

      [#36] Finally, since several statistical tests were used and the differences are small, I suggest noting that multiple comparisons were not controlled for, so p-values should be interpreted cautiously.

      We have provided power analyses in Tables S21 and S22 with methods described in the manuscript (Line 897-898) and added a note that not all of the multiple comparisons were corrected for in the manuscript (Line 898-899).

      References cited in the response letter only 

      Bellio, T. A., Laguna-Torres, J. Y., Campion, M. S., Chou, J., Yee, S., Blusztajn, J. K., & Mellott, T. J. (2024). Perinatal choline supplementation prevents learning and memory deficits and reduces brain amyloid Aβ42 deposition in App<sup>NL-G-F</sup> Alzheimer’s disease model mice. PLOS ONE, 19(2), e0297289. https://doi.org/10.1371/journal.pone.0297289

      Blei, D. M., & Frazier, P. I. (2011). Distance Dependent Chinese Restaurant Processes. Journal of Machine Learning Research, 12(74), 2461–2488.

      Cochran, A. L., & Cisler, J. M. (2019). A flexible and generalizable model of online latent-state learning. PLOS Computational Biology, 15(9), e1007331. https://doi.org/10.1371/journal.pcbi.1007331

      Curiel Cid, R. E., Crocco, E. A., Duara, R., Vaillancourt, D., Asken, B., Armstrong, M. J., Adjouadi, M., Georgiou, M., Marsiske, M., Wang, W., Rosselli, M., Barker, W. W., Ortega, A., Hincapie, D., Gallardo, L., Alkharboush, F., DeKosky, S., Smith, G., & Loewenstein, D. A. (2024). Different aspects of failing to recover from proactive semantic interference predicts rate of progression from amnestic mild cognitive impairment to dementia. Frontiers in Aging Neuroscience, 16. https://doi.org/10.3389/fnagi.2024.1336008

      Giustino, T. F., Fitzgerald, P. J., Ressler, R. L., & Maren, S. (2019). Locus coeruleus toggles reciprocal prefrontal firing to reinstate fear. Proceedings of the National Academy of Sciences, 116(17), 8570–8575. https://doi.org/10.1073/pnas.1814278116

      Gu, X., Wu, Y.-J., Zhang, Z., Zhu, J.-J., Wu, X.-R., Wang, Q., Yi, X., Lin, Z.-J., Jiao, Z.-H., Xu, M., Jiang, Q., Li, Y., Xu, N.-J., Zhu, M. X., Wang, L.-Y., Jiang, F., Xu, T.-L., & Li, W.-G. (2022). Dynamic tripartite construct of interregional engram circuits underlies forgetting of extinction memory. Molecular Psychiatry, 27(10), 4077–4091. https://doi.org/10.1038/s41380-022-01684-7

      Lacagnina, A. F., Brockway, E. T., Crovetti, C. R., Shue, F., McCarty, M. J., Sattler, K. P., Lim, S. C., Santos, S. L., Denny, C. A., & Drew, M. R. (2019). Distinct hippocampal engrams control extinction and relapse of fear memory. Nature Neuroscience, 22(5), 753–761. https://doi.org/10.1038/s41593-019-0361-z

      Loewenstein, D. A., Curiel, R. E., Greig, M. T., Bauer, R. M., Rosado, M., Bowers, D., Wicklund, M., Crocco, E., Pontecorvo, M., Joshi, A. D., Rodriguez, R., Barker, W. W., Hidalgo, J., & Duara, R. (2016). A Novel Cognitive Stress Test for the Detection of Preclinical Alzheimer’s Disease: Discriminative Properties and Relation to Amyloid Load. The American Journal of Geriatric Psychiatry : Official Journal of the American Association for Geriatric Psychiatry, 24(10), 804–813. https://doi.org/10.1016/j.jagp.2016.02.056

      Loewenstein, D. A., Greig, M. T., Curiel, R., Rodriguez, R., Wicklund, M., Barker, W. W., Hidalgo, J., Rosado, M., & Duara, R. (2015). Proactive Semantic Interference Is Associated With Total and Regional Abnormal Amyloid Load in Non-Demented Community-Dwelling Elders: A Preliminary Study. The American Journal of Geriatric Psychiatry : Official Journal of the American Association for Geriatric Psychiatry, 23(12), 1276–1279. https://doi.org/10.1016/j.jagp.2015.07.009

      Valles-Salgado, M., Gil-Moreno, M. J., Curiel Cid, R. E., Delgado-Á lvarez, A., Ortega-Madueño, I., Delgado-Alonso, C., Palacios-Sarmiento, M., López-Carbonero, J. I., Cárdenas, M. C., MatíasGuiu, J., Díez-Cirarda, M., Loewenstein, D. A., & Matias-Guiu, J. A. (2024). Detection of cerebrospinal fluid biomarkers changes of Alzheimer’s disease using a cognitive stress test in persons with subjective cognitive decline and mild cognitive impairment. Frontiers in Psychology, 15. https://doi.org/10.3389/fpsyg.2024.1373541

      Zaki, Y., Mau, W., Cincotta, C., Monasterio, A., Odom, E., Doucette, E., Grella, S. L., Merfeld, E., Shpokayte, M., & Ramirez, S. (2022). Hippocampus and amygdala fear memory engrams reemerge after contextual fear relapse. Neuropsychopharmacology, 47(11), 1992–2001. https://doi.org/10.1038/s41386-022-01407-0

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Manuscript number: RC-2025-03064

      Corresponding author(s): Massimo, Hilliard; Sean, Coakley

      1. General Statements

      We are grateful to the reviewers for taking time to review our manuscript and for providing such clear, insightful and actionable suggestions. The consensus between 4 independent reviewers that this story is of general interest to cell biologists, neurobiologists and clinical researchers is remarkable. In addition to our mechanistic insights into the regulation of GTPase activity, we think that the experimental systems we have developed will be of great value to study how GTPases their associated GAPs and GEFs function to maintain the nervous system, especially due to the demonstrated conservation of these molecules. We believe that our data provides a powerful and tractable model to study such molecules in a physiological context.

      We agree with the reviewers' concerns and propose the following plan below to address them.

      2. Description of the planned revisions

      Reviewer #1(Evidence, reproducibility and clarity (Required)):


      __Summary Stability of the PLM axon in C. elegans is maintained through interactions with the epidermis. Previous studies by this group found that loss of the tbc-10 Rab GTPase Activating Protein strongly enhanced the PLM axon break phenotype of unc-70/beta-spectrin mutants. TBC-10 is a GAP for RAB-35 and thus loss of rab-35 suppresses the tbc-10 phenotype. Of the two RAB-35 GEFs, loss of RME-4 partially suppressed the tbc-10 phenotype and FLCN-1 was not involved suggesting that there may be an additional GEF involved. Here Bonacossa-Pereira et al identify a point mutation in agef-1a (vd92) as a suppressor of tbc-10 PLM axon break phenotype (all experiments also have a dominant allele of unc-70) and confirm that point mutation is causative by replicating the mutation via genome editing (vd123). Rescue experiments demonstrate that AGEF-1a is required in the epidermis and not PLM as previous demonstrated with tbc-10 and unc-70. Rescue is dependent on a functional SEC7/GEF activity. AGEF-1a is a functional ortholog to human BIG2/ArfGEF2 as its expression fully rescues tbc-10. AGEF-1a functions upstream of RAB-35 as expression of activated RAB-35 can suppress loss of agef-1. AGEF-1a functions in parallel to RME-4 as the double has stronger suppression of tbc-10. AGEF-1a is an ARF GEF, however it functions independently of ARF-1.2 as loss of arf-1.2 does not suppress tbc-10. They demonstrate that AGEF-1a interacts with RAB-35 through colocalization experiments suggesting that AGEF-1a could directly activate RAB-35. Finally, they demonstrate that AGEF-1a regulates the localization of the LET-805 epidermal attached complex component as it restores localization in a tbc-10 mutant.

      Major comments

      The manuscript is well written and easy to understand.

      The experiments are well done and controlled.

      I enjoyed reading this paper. However...

      Some of the claims are not supported by the data.__

      __1) The claim that AGEF-1a directly interacts with RAB-35 was not demonstrated. The evidence provided to support a direct interaction are colocalization experiments in Figure 3. AGEF-1a does partially colocalize with RAB-35 in the epidermis. However, colocalization does not indicate a physical interaction direct or indirect. A simple fix would be to change the claim to that they partially colocalize. Optional, a physical interaction could be done with the split-GFP since they already have the AGEF-1 strain or they could perform co-IP experiments, though neither of those are proof of direct interactions.

      __

      We agree that the biochemical co-IP experiment could provide some answers, however, using a full length AGEF-1a would not only represent a significant technical challenge but will also not prove a direct interaction in a physiological context. To overcome this limitation, and to directly test their interaction in vivo, we propose to use a split-GFP approach as suggested by the reviewer. In this experiment, we will generate an endogenously tagged GFP1-10::rab-35 allele and combine it with the previously generated and available tagged agef-1a::GFP11x7. If AGEF-1 and RAB-35 closely interact, we should observe the reconstitution of full length GFP. It is possible that the endogenously tagged versions only provide a very weak GFP signal that will be difficult to detect. As an alternative approach, we will generate the same tagged molecules as overexpressed transgenes under epidermal-specific promoters (such as Pdpy-7). If the results are still negative, we agree to temper our claim that these molecules physically interact and rephrase the manuscript to reflect the new data.

      • *

      2) The claim that AGEF-1a facilitates RAB-35 activation is not supported. While it is likely that AGEF-1a facilitates RAB-35 activation based on the epistasis experiments as well as studies in mammalian cells there were no experiments to demonstrate that modulating AGEF-1a activity resulted in a change in RAB-35 activity. I would suggest tempering this claim to something along the line that the data are consistent with AGEF-1a regulating RAB-35 activity as shown in mammalian cells. An optional experiment would be to look at the colocalization of RAB-35 with a known effector in wild type and agef-1(vd92) with the expectation that there would be a higher level of colocalization in agef-1 mutants. Effector pull-down experiments or perhaps a cell based GEF assay could be used (PMID: 35196081).


      We welcome this suggestion and acknowledge the limitations of these experiments. While we might be able to determine if AGEF-1 and RAB-35 physically interact in vivo with the experiments proposed above, screening for the relevant rab-35 effector in this context and/or doing effector pull-down/cell based GEF assays would be a significant technical challenge. We propose to temper our claim as suggested.

      3) The claim that AGEF-1a functions independently of ARF-1.2 is not well supported. The fact that the ARF-1.2 mutant does not suppress tbc-10 suggests that ARF-1.2 may not be involved but does not eliminate the possibility that ARF-1.2 functions redundantly with ARF-5 or WARF-1/ARF-1.1. This can be resolved by toning down the claim. Alternatively, this can be tested by RNAi of arf-5 and warf-1 in tbc-10 and arf-1.2; tbc-10 mutants.

      We agree that warf-1 and arf-5 could be functioning redundantly with arf-1.2. We have attempted to generate an AID::arf-5 allele to test the effect of cell-specific degradation, but homozygous AID::arf-5 animals were lethal. We have not yet examined warf-1. We believe the best way to test these two molecules is through RNAi knockdown, and we propose to do this experiment and adjust our interpretation and discussion according to the new data.

      Minor comments

      Figure 1C the CRISPR generated allele (vd123) is referred to as [S784L] and then in 1E vd92 is referred to as [S784L]. Perhaps it would be clearer if the allele name was used instead of the amino acid change.

      We will reformat the manuscript to include the allele names instead of amino acid change.

      Page 6 "We reasoned that if the S784L mutation we isolated causes a similar loss of the GTPase activation function, then SKIN::AGEF-1a[E608K] would not have the capacity to restore the rate of PLM axon breaks to background levels in agef-1[S784L]; tbc-10; vdSi2 animals." It was unclear to me whether you were testing if the S784L mutation could be disrupting a GEF independent function or might disrupt the nucleotide exchange activity as might be tested in a biochemical assay. There are many reasons this change could cause a loss of function phenotype (ie. Improper folding, mislocalization, etc.). The most clear explanation would be that you were testing if GEF function was required for rescue rather than testing if the S784L mutation disrupted GEF activity.

      Indeed, this experiment reveals that reducing the activation of the AGEF-1 target phenocopies the effect of S784L and does not further enhance the effect of S784L. However, it does not answer if, specifically, the GEF function is affected by S784L. We propose to rewrite the quoted sentence as follows: "We asked whether the GEF function is required for axonal damage. If that is the case, then SKIN::AGEF-1a[E608K] overexpression should phenocopy the effect of AGEF-1a[S784L]."

      • *

      Page 13. It was unclear how testing if AGEF-1, RME-4, ARF-5 and RAB-35 form complexes in vivo (I assume you are suggesting colocalize based on figure 3 interpretation) would resolve how AGEF-1 was regulating RAB-35.


      We apologize that our phrasing was not clear. We will rewrite this section to better reflect the following idea. Given literature data showing an allosteric interaction between RME-4/DENND1 and ARF-5/Arf5, and our own data showing that AGEF-1 regulates RAB-35, we believe these molecules could form a complex. Considering that we do not have data to support this notion, mostly due to the inability to test the effect of ARF-5, we will present this possibility in the discussion section.


      __**Cross-commenting**

      I agree with the comments made by the other reviewers and I stand by my own as well. I will echo that it is important to know the nature of their agef-1 allele.

      Reviewer #1 (Significance (Required)):

      Bonacossa-Pereira et al identify AGEF-1 as a regulator of axon integrity that functions in a pathway with RAB-35 in the epidermis is an exciting finding. As pointed out in the discussion, mutations in the human ortholog cause neurodevelopmental defects which leads to obvious characterization of BIG2/ArfGEF2 in neurons while this study indicates that this protein can have cell non-autonomous roles in regulating neurons. These findings could have important implications for understanding the etiology of these defects that would be of interest to neurobiologists and clinical researchers.

      The finding of this paper would also be of interest to cell biologists and particularly those studying the roles of Rab and Arf GTPases in membrane trafficking, such as myself. The idea that AGEF-1 might function as a Rab35 GEF is provocative and would generate a lot of interest and skepticism from the field. However, there is no data to support that AGEF-1 would be a direct regulator of Rab35 over the previously demonstrated cross regulation of Rab35 by Arf GTPases. Therefore, it would be fine to speculate in the discussion a direct interaction, but I would refrain from suggesting this as a model and elsewhere in the manuscript.

      __

      Although we agree that current evidence is not sufficient to support the model where AGEF-1 is a direct regulator of RAB-35, our data points to the direction where there is an important genetic relationship between these molecules in a physiological context in a living animal, with a defined phenotype relevant to the nervous system maintenance. We think that the proposed revision experiments will provide a better understanding of how AGEF-1 functions with RAB-35 and we agree with the suggestion to rephrase our manuscript to reflect the limitations of our results.


      __Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      This interesting manuscript reports the outcome of a fruitful C. elegans genetic screen with a complex but clever design. Through it, the authors identify AGEF-1 as a GEF that likely regulates the active state of the GTPase RAB-35 in the skin to protect touch receptor axons from mechanical breakage.

      Major points: 1. Based on localization experiments, the authors claim "AGEF-1a interacts with RAB-35 in the epidermis" (Results heading) and state "these data demonstrate that AGEF-1a interacts with a subset of RAB-35 molecules in the epidermis." In general, localization studies cannot be used to conclude physical interaction (with some exceptions such as single-molecule kinetics). In this case, the data in my view do not even make a compelling argument for co-localization. There is a lot of AGEF-1 and RAB-35 signal everywhere and it may not be meaningful that the signals sometimes overlap. A more quantitative approach with controls would be needed to conclude meaningful co-localization. Importantly, this would still not demonstrate interaction.__

      We thank the reviewer for the comment. Indeed, co-localization does prove a physical interaction, and we appreciate the concern about our imaging data not making a compelling argument. To address this notion, we plan to perform an experiment using a more robust, quantitative and physiologically relevant strategy. We will generate an endogenously tagged mScarlet3::rab-35 allele for precise endogenous localization. In addition, as a positive control, we will generate an endogenous rme-4::GFP11x7 allele to cell-specifically demonstrate the level of colocalization of RME-4 with mScarlet3::RAB-35 within the epidermis. To address the possible interaction between AGEF-1a and RAB-35 we will leverage a split-GFP approach to assess their interaction in vivo, in the context relevant to the phenotypes we observed (see reply to reviewer #1 point 1).

      __2. The effect of the AGEF-1(S784L) mutation is not clear to me. Naively, as the S784L mutation lies in the auto-inhibitory domain, I would have expected AGEF-1 to become constitutively active, not inactive as the authors seem to suggest. Is the idea that it is constitutively auto-inhibited? The main evidence for a loss of function effect seems to be that a putative dominant negative mutation AGEF-1(E608K) does not further supress axon breakage when co-expressed in trans to AGEF(S784L), but in my view this only shows that, once the defect is suppressed, it cannot be suppressed any further. Defining the nature of the S784L allele is important. Some suggestions, although the authors may come up with different approaches: use of an inducible or cell-specific depletion system like AID/TIR1, Cre/lox, or FLP/FRT to circumvent the lethality of agef-1(0) and reveal what a true loss-of-function looks like; testing if deletion of the auto-inhibitory domain phenocopies S784L to test if this mutation impairs autoinhibition.

      __

      This is an very insightful comment. To address this point, we will follow the reviewer's suggestion and deplete AGEF-1 cell-specifically in the epidermis using the auxin-inducible degron system. Specifically, we will generate an agef-1::AID allele to degrade this molecule in a spatially and temporally controlled fashion, which will allow to circumvent the lethality of agef-1(0) and determine whether the S784L allele mimics the depletion of AGEF-1.

      Although it would be interesting to further dissect the effect of this mutation on AGEF-1 activity, we believe that this falls outside of the scope of this manuscript. As an alternative, we propose to elaborate more in the discussion the implications of the possible roles for the S784L mutation to clarify our model of its function. Our data supports a model in which this mutation reduces AGEF-1 function leading to a reduction in the activity of its downstream target GTPases. It is possible that this is due to AGEF-1 becoming constitutively autoinhibited, or that this mutation affects the structure of the molecule in a way that it reduces its affinity towards its downstream effectors.

      Minor points: 1. I am not able to see the "vesicle-like structures with a clear luminal space" or RAB-35 being "notably enriched at the membrane near the epidermal furrow" in Fig. 3. The "3D surface rendering" in Fig. 3e is grossly oversampled and should not be included.

      We will rectify this section and include new super-resolved images using Airyscan confocal microscopy. We hope these will yield a better-quality representation of these concepts. __ 2. As the agef-1a isoform is specifically referenced throughout, please describe the different agef-1 isoforms somewhere to save readers from having to look this up.__

      Yes, we will include a description of the isoforms. In C. elegans there are two: AGEF-1a which has been confirmed by cDNA and AGEF-1b which is predicted and partially confirmed by cDNA. The mutation we isolated exclusively affects AGEF-1a.

      3. The authors include an interesting speculation in the Discussion: "Future investigations of BIG2-associated neurological disorders should consider... hyper-activity of BIG2 as a driver of neuropathology." If the authors have the tools to test the effect of hyperactive BIG2 in this system, it could be an exciting addition.


      This is an exciting idea that we would like to keep in the Discussion. The biology of BIG2 activity regulation is a nascent field of research and we believe that to accurately generate and characterise a hyperactive BIG2 would be beyond the scope of this manuscript.

      __ On a personal note, since GEFs act oppositely to GTPase Activating Proteins (GAPs), I had to stop and re-read carefully whenever the authors referred to a GEF "activating" a GTPase. I understand their meaning (i.e., putting the GTPase in its active GTP-bound state, not activating its GTPase function) but I wanted to point out this potential confusion in case there is a way to better define terms in the Introduction or change word choice. I realize this may be a standard jargon in the field.__

      Indeed, this is confusing nomenclature and a difficult concept to deliver in an accurate and succinct manner. We propose to include a clearer, more didactic explanation of their function. In a simple explanation, GTPases perform cellular functions when bound to GTP. GAPs terminate GTPase activity by catalysing GTP hydrolysis, generating GDP. GEFs initiate GTPase activity by catalysing the release of GDP and allowing GTP binding.

      __ Please check the correct nomenclature for CRISPR/Cas9.__


      We will rectify where appropriate.

      __6. p.7 "these molecules act in synergy", consider replacing with "redundantly".

      __

      We will rectify where appropriate.

      __Reviewer #2 (Significance (Required)):

      The significance of this story is to show that GEF-GTPases pairing can be highly context-dependent. Previous studies have identified GEFs that pair with RAB-35 and GTPases that pair with AGEF-1, but the authors find that these factors have at best a modest role in the context of skin-axon interactions. Instead, the authors suggest a novel GTPase-GEF pairing of RAB-35 with AGEF-1 and provide evidence that this relationship is conserved in the human homolog of AGEF-1. These results suggest that GTPase-GEF pairings depend not only on chemical affinity but also cellular context.

      The main strength of the study is its clever genetics. For the screen, the authors looked for suppressors of a synthetic defect in axon integrity caused in part by elevated activity of RAB-35 due to loss of its GAP TBC-10. It is satisfying that this screen isolated a mutation in a GEF that in principle could counterbalance the loss of a GAP.

      The main weakness of the study is the lack of direct evidence for an AGEF-1/RAB-35 interaction. While not necessary for publication, the inclusion of biochemical data to support the role of AGEF-1 as a GEF for RAB-35 and the effect of the S784L mutation on this activity would strongly elevate the study. The genetic data for this interaction are consistent with the model but not conclusive, and in my view the colocalization data are not compelling. Nevertheless this is a solid genetic story with a clever screen.__

      __ __We appreciate the feedback and are grateful for the positive comments on the significance of our study. As explained in the significance section related to Reviewer 1, if we find evidence of a direct interaction between AGEF-1 and RAB-35 in the proposed new experiments, we will include it in the manuscript; alternatively, we will present it as a possibility in the discussion section, as suggested. We agree that a more nuanced understanding of the effect of the S784L is interesting and that our colocalization data can be improved, and we have proposed experiments to address these concerns.

      __Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      This paper investigates the mechanism by which molecular pathways in the skin protect the processes of nerves that innervate them from damage. The authors previously showed that spectrin and the small GTPase RAB-35 act in the epidermis of C. elegans to protect mechanosensory axons from breaking. In this paper they used a suppression screen to identify another gene involved in this process, an ARF-GEF called AGEF-1. Partial loss-of-function mutations in agef-1 suppress the axon-breakage phenotype of spectrin mutations, and genetic experiments by the authors are consistent with the possibility that AGEF-1 could act directly as an exchange factor for RAB-35. Consistent with this model, they show that AGEF-1 and RAB-35 colocalise in the skin.

      Major comments: The experiments in this paper are well-designed and well-controlled, and the interpretations of the results are all reasonable. On the other hand, I don't think the authors' hypothesis that AGEF-1 acts directly as an exchange factor for RAB-35, or that these two proteins directly interact, is definitively proven. This is not an issue of the authors overinterpreting their data--the paper is very carefully and thoughtfully written. However, the most interesting and counterintuitive finding--that an ARF-GEF could also be a RAB-GEF--might be strengthened with more experiments (for example, could they more directly show protein-protein interaction through co-IP or mass spec?).__

      We thank the reviewer for the suggestion. We propose to further investigate the notion that AGEF-1a might be a direct interactor of RAB-35 using a split-GFP approach to assess whether these molecules closely interact, in vivo, in the physiological context that is relevant for the maintenance of the touch sensing neurons (please see reply to reviewer #1 major point 1 and reviewer #2 major point 1 for more details).

      Minor comments: There are also two places where the fact that null mutations are lethal (for agef-1 and arf-5) prevented the authors from addressing the effect of agef-1 loss of function in the skin, and addressing whether ARF-5 could be an AGEF-1 target, respectively. In principle, they could have tried to make a CRISPR line in which these genes could be cell-specifically deleted in the skin (using a dpy-7-driven recombinase). I don't think either of these experiments are essential, but if it is feasible to make these lines it would tie up a couple of loose ends.

      We agree to explore the roles of agef-1 and arf-5 loss-of-function. We propose to tissue-specifically degrade agef-1 using an auxin-inducible degradation strategy (please see reviewer #2 major point 2 reply for more details). For arf-5, we propose knocking-down its function using RNAi to overcome lethality (please see reviewer #1 major point 3 reply for more details).

      __Reviewer #3 (Significance (Required)):

      Overall I think this is an interesting paper on a topic of general interest. The most interesting finding is that an exchange factor for an ARF (a small GRPase involved in vesicle coating/uncoating) could also be an exchange factor for a RAB (a small GTPase involved in vesicle tethering). The evidence presented is suggestive and intriguing, though as noted above not completely definitive. In summary, I think it is an interesting paper in its current form, and anything it could do to more firmly establish a direct interaction between AGEF-1 and RAB-35 would increase its impact and importance.

      __

      We thank the reviewer for the positive evaluation of the significance of our study.

      __ Reviewer #4 (Evidence, reproducibility and clarity (Required)):

      Summary: In this study Bonacossa-Pereira et al. identify AGEF-1a, an Arf-GEF, as a factor that functions in the epidermis through RAB-35 to regulate axonal integrity of the PLM mechanosensory neurons in C. elegans. Specifically, epidermal attachment sites are regulated by these genes form the epidermis and compromising these attachment sites results in axonal degeneration. The study provides some evidence that that RAB-35 and AGEF-1 at least partially colocalize in the skin. Finally, the authors provide evidence that the human orthologue BIG2 is capable of functionally replacing AGEF-1a in C. elegans. Overall, the experiments are well designed and the paper is clear and succinct. The conclusions are supported by the findings and provide an important extension of the author's findings a few back, when they identified the role of rab-35 in mediating the epidermal-neuronal attachment sites.

      Major comments: 1. AGEF-1/BIG2 are known to regulate other GTPases such as ARF-5 or ARF-2. The authors exclude a non-redundant function for ARF-2, but are unable to establish a role for ARF-5 because of the lethality associated with the mutation. Alternative approaches, such as cell specific knock out or knock down experiment. In addition, studies to test potentially physical interaction such as pull-down assays, co-IP experiments and FRET could be used to test whether AGEF-can bind RAB-35 or ARF-5.__

      We thank the reviewer for this suggestion. We propose addressing these concerns using a tissue-specific degradation for AGEF-1a (please see reviewer #1 major point 2 for details). To establish a role for ARF-5 we propose to do an RNAi mediated knock-down to overcome lethality (please see reviewer #1 major point 3 for details). Finally, we plan to use a split-GFP approach to test the physical interaction between agef-1a and rab-35 in vivo (please see reviewer #1 major point 1 for details)

      __ Phenotypic readout has been limited to only axon breaks. It may be interesting to also test other aspects such as axonal deformities including swellings and vesiculation in other parts of the nervous system. Moreover, behavioral or functional experiments such as response to gentle touch or synaptic integrity could be informative.__

      We have not observed any obvious touch receptor neurons axonal phenotypes other than axonal breaks in these mutants, and we will include a statement that reflects this concept. In relation to the behavior, we have not tested it as the results will be difficult to interpret for two reasons: first, the breaks are not always bilateral and one neuron is sufficient to provide mechanical response; second, the mixed identity of the PLM neurite allows it to retain some function despite being severed. However, if deemed essential, we will perform these experiments.

      __ Overexpression constructs such as SKIN::RAB-35[Q69L], SKIN::BIG2, SKIN::AGEF-1a[E608K] in extrachromosomal transgenes could lead to non-physiological localization or effects. Single copy expression using MosSCI or CRISPR insertions are generally considered better approaches (other than endogenous reporters) to provide accurate insights at the physiological level. While the authors tacitly acknowledge this by conducting the experiments in a rab-35 mutant background and very low transgene concentration, at the very least this caveat regarding the localization should be discussed.__

      This is an important remark, and we appreciate the comment. We acknowledge that experiments using extrachromosomal arrays have inherent caveats, especially for localization studies. To address the RAB-35 localization concern we plan to repeat the localization studies using an endogenously tagged RAB-35 using CRISPR to overcome the possible artifacts caused by extrachromosomal array driven expression (please see reviewer #1 point 1 for more details). For the cell-specific rescues or dominant-negative constructs expression, we believe that using extrachromosomal arrays is sufficient, since this allows us to compare genetically identical transgenic vs non-transgenic siblings of independent lines. Moreover, given these constructs are already driven by a tissue-specific promoter that is inherently stronger than their respective endogenous promoters, even a single-copy insertion would have the same caveats.

      __4. The study does not address clearly whether AGEF-1a acts in parallel to spectrin or upstream/ downstream to it. Epistasis experiments could help to figure out the signaling pathway involved.

      __

      Indeed, this is a concept that we need to communicate more clearly. We have data showing that a mutation in agef-1 does not cause axonal damage on its own, and that it has no effect on the axonal damage caused by unc-70 dominant negative mutation alone. We only detect an effect of agef-1 when tbc-10 is mutated together with unc-70 (Fig. 1a of manuscript). Together, these data indicate that agef-1 functions upstream of rab-35, thus acting in parallel to unc-70 (see schematic below) to ensure the mechanical stability of neuron epidermal attachment. We plan to include this data and the following schematic as a supplement to better convey the idea and discuss the results appropriately.

      __ The finding that BIG2 rescues the mutant defect is an important finding and rightfully finds its place in the abstract. I wonder whether a reference to the human diseases caused by loss of BIG2 in the abstract and introduction would not increase interest/impact for the study, rather than burying this potentially interesting connection in the discussion.

      __

      We appreciate the reviewer's comment, and welcome the suggestion. We propose to include relevant background about BIG2-related human diseases in the abstract and introduction as suggested and expand the discussion regarding BIG2 mutations.

      __Minor comments:

      1. Some explanation about how mutating the autoinhibitory domain could impact the catalytic activity of a GEF might be helpful.__

      2. *

      We acknowledge that this notion was not well communicated. We propose to elaborate more about why we think a mutation in the autoinhibitory domain might be affecting the GEF activity and we plan to do further experiments to dissect how this might be happening. Please see reviewer #2 major point 2 for a more detailed explanation.

      __ The paper refers to rme-4(b1001) as a null allele while wormbase refers to the same as a missense allele. It would be more accurate to refer rme-4(b1001) as a strong loss of function or putative null.__

      We agree and will refer to b1001 as a strong loss-of-function.

      __ The paper does not clearly discuss limitations of the hypomorphic agef-1[S784L] and that the observed phenotypes in this hypomorph might underestimate the complete role of AGEF-1a.__

      • *

      We thank the reviewer for this suggestion. We propose to elaborate more on these limitations, especially considering the possible new results from the experiments suggested in reply to reviewer #2 major comment point 2.

      __ In figure 1, where there really only one extrachromosomal transgenic line for some of the construct tested? __

      • *

      For the Pdpy-7::AGEF-1a lines we have scored 3 transgenic lines (data not included) and only one yielded a full rescue. For all extrachromosomal lines presented, we tested 3 independent transgenic lines. For brevity, we only included the result for the positive rescues (1 for BIG2 and 1 for AGEF-1a), except for the Pmec-4 lines, of which none rescued the phenotype (data included in Table S2). We will update Table S2 to include all the lines tested.

      __ The concentrations of transgenes vary in different transgenes. Is there a rationale behind this? __

      Yes, we have attempted multiple concentrations of injections for each transgene and there was some variability for each construct injected, thus we only included the ones where we observed an effect. As mentioned in point 4 above, we will update Table S2 to include details of all lines tested.

      __ In Fig.1e: I may be useful to also show the "WT" phenotype, i.e. the strong defects to get a visual comparison for the degree of rescue. __

      • *

      We think this suggestion will help the readers. We will include this as a representative dashed line showing the WT phenotype.

      __Reviewer #4 (Significance (Required)):

      The study has identified AGEF-1a as a regulator of axonal maintenance, functioning to protect neurons against mechanical stress by acting through RAB-35. Additionally, this epidermal GEF, AGEF-1a is functionally conserved as its human orthologue BIG2 can replace AGEF-1a in C. elegans for axonal protection. Important points here are that the findings extend prior work by the authors of non-autonomous mechanism that regulates epidermal-neuronal attachment. In my humble opinion, the human disease connection, in particular with regard to the unexplained neuronal phenotypes in patients could be better developed in the manuscript. It may also increase impact/interest of a wonderful story that right now reads a bit 'wormy'.__


      This is an important remark and we are grateful for the positive comments. The fact that human BIG2 is also conserved in C. elegans points to a fundamental role of this molecule in multicellular life, and it provides a tractable model to investigate the function of this molecule in a physiological context. We welcome the suggestion to elaborate more the connection with the unexplained neuronal phenotypes in patients and use a more accessible language to convey our findings to a wider audience.


      3. Description of the revisions that have already been incorporated in the transferred manuscript

      N/A

      4. Description of analyses that authors prefer not to carry out

      __Reviewer #1 __


      "...studies to test potentially physical interaction such as pull-down assays, co-IP experiments and FRET could be used to test whether AGEF-can bind RAB-35 or ARF-5."


      While pull-down assays, co-IP and FRET would reveal whether AGEF-1a can form a complex with RAB-35, we believe that using a full length AGEF-1a would not only represent a significant technical challenge but will also not prove a direct interaction in a physiological context.


      "...An optional experiment would be to look at the colocalization of RAB-35 with a known effector in wild type and agef-1(vd92) with the expectation that there would be a higher level of colocalization in agef-1 mutants. Effector pull-down experiments or perhaps a cell based GEF assay could be used (PMID: 35196081)."


      We think that screening for the relevant rab-35 effector in this context and/or doing effector pull-down/cell based GEF assays would be a significant technical challenge. We propose to address this concern by tempering our claim as suggested by the reviewer.


      "...It may be interesting to also test other aspects such as axonal deformities including swellings and vesiculation in other parts of the nervous system. Moreover, behavioral or functional experiments such as response to gentle touch or synaptic integrity could be informative."

      As indicated above in major point 2 of reviewer 4, these are interesting ideas that might answer how the function of these neurons might be affected. However, in addition to the challenges indicated above, they will not provide further insights into how their integrity is maintained. We believe these will fall outside the scope of the manuscript, but if deemed essential we will perform behavioral analysis.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      McDougal et al. aimed to characterize the antiviral activity of mammalian IFIT1 orthologs. They first performed three different evolutionary selection analyses within each major mammalian clade and identified some overlapping positive selection sites in IFIT1. They found that one site that is positively selected in primates is in the RNA-binding exit tunnel of IFIT1 and is tolerant of mutations to amino acids with similar biochemical properties. They then tested 9 diverse mammalian IFIT1 proteins against VEEV, VSV, PIV3, and SINV and found that each ortholog has distinct antiviral activities. Lastly, they compared human and chimpanzee IFIT1 and found that the determinant of their differential anti-VEEV activity may be partly attributed to their ability to bind Cap0 RNA. 

      Strengths: 

      The study is one of the first to test the antiviral activity of IFIT1 from diverse mammalian clades against VEEV, VSV, PIV3, and SINV. Cloning and expressing these 39 IFIT1 orthologs in addition to single and combinatorial mutants is not a trivial task. The positive connection between anti-VEEV activity and Cap0 RNA binding is interesting, suggesting that differences in RNA binding may explain differences in antiviral activity. 

      Weaknesses: 

      The evolutionary selection analyses yielded interesting results, but were not used to inform follow-up studies except for a positively selected site identified in primates. Since positive selection is one of the two major angles the authors proposed to investigate mammalian IFIT1 orthologs with, they should integrate the positive selection results with the rest of the paper more seamlessly, such as discussing the positive selection results and their implications, rather than just pointing out that positively selected sites were identified. The paper should elaborate on how the positive selection analyses PAML, FUBAR, and MEME complement one another to explain why the tests gave them different results. Interestingly, MEME which usually provides more sites did not identify site 193 in primates that was identified by both PAML and FUBAR. The authors should also provide the rationale for choosing to focus on the 3 sites identified in primates only. One of those sites, 193, was also found to be positively selected in bats, although the authors did not discuss or integrate that finding into the study. In Figure 1A, they also showed a dN/dS < 1 from PAML, which is confusing and would suggest negative selection instead of positive selection. Importantly, since the authors focused on the rapidly evolving site 193 in primates, they should test the IFIT1 orthologs against viruses that are known to infect primates to directly investigate the impact of the evolutionary arms race at this site on IFIT1 function. 

      We thank the reviewer for their assessment and for acknowledging the breadth of our dataset regarding diverse IFIT1s, number of viruses tested, and the functional data that may correlate biochemical properties of IFIT1 orthologous proteins with antiviral function. We have expanded the introduction and results sections to better explain and distinguish between PAML, FUBAR, and MEME analyses. Furthermore, we have expanded the discussion to incorporate the observation that site 193 is rapidly evolving in bats, as well as the observation that nearby sites to the TPR4 loop were identified as rapidly evolving in all clades of mammals tested. We also do observe an overall gene dN/dS of <1, however this is simply the average across all codons of the entire gene and does not rule out positive selection at specific sites. This is observed for other restriction factors, as many domains are undergoing purifying selection to retain core functions (e.g enzymatic function, structural integrity) while other domains (e.g. interfaces with viral antagonists or viral proteins) show strong positive selection. Specific examples include the restriction factors BST-2/Tetherin (PMID: 19461879) and MxA (PMID: 23084925). Furthermore, we agree that testing more IFIT1-sensitive viruses that naturally infect primates with our IFIT1 193 mutagenesis library would shed light on the influence of host-virus arms races at this site. However, VEEV naturally does also infect humans as well as at least one other species of primate (PMID: 39983680).

      Below we individually address the reviewers' claims of inaccurate data interpretation.

      Some of the data interpretation is not accurate. For example: 

      (1) Lines 232-234: "...western blot analysis revealed that the expression of IFIT1 orthologs was relatively uniform, except for the higher expression of orca IFIT1 and notably lower expression of pangolin IFIT1 (Figure 4B)." In fact, most of the orthologs are not expressed in a "relatively uniform" manner e.g. big brown bat vs. shrew are quite different. 

      We have now included quantification of the western blots to allow the reader to compare infection results with the infection data (Updated Figure 4B and 4G). We have also removed the phrase “relatively uniform” from the text and have instead included text describing the quantified expression differences.

      (2) Line 245: "...mammalian IFIT1 species-specific differences in viral suppression are largely independent of expression differences." While it is true that there is no correlation between protein expression and antiviral activity in each species, the authors cannot definitively conclude that the species-specific differences are independent of expression differences. Since the orthologs are clearly not expressed in the same amounts, it is impossible to fully assess their true antiviral activity. At the very least, the authors should acknowledge that the protein expression can affect antiviral activity. They should also consider quantifying the IFIT1 protein bands and normalizing each to GAPDH for readers to better compare protein expression and antiviral activity. The same issue is in Line 267. 

      We have now included quantification and normalization of the western blots to allow the reader to compare infection results with the infection data (Updated Figure 4B and 4G). Furthermore, we acknowledge in the text that expression differences may affect antiviral potency in infection experiments.

      (3) Line 263: "SINV... was modestly suppressed by pangolin, sheep, and chinchilla IFIT1 (Figure 4E)..." The term "modestly suppressed" does not seem fitting if there is 60-70% infection in cells expressing pangolin and chinchilla IFIT1. 

      We have modified the text to say “significantly suppressed” rather than “modestly suppressed.”

      (4) The study can be significantly improved if the authors can find a thread to connect each piece of data together, so the readers can form a cohesive story about mammalian IFIT1. 

      We appreciate the reviewer’s suggestion and have tried to make the story including more cohesive through commentary on positive selection and by using the computational analysis to first inform potential evolutionary consequences of IFIT1 functionality first by an intraspecies (human) approach, and then later an interspecies approach with diverse mammals that have great sequence diversity. Furthermore, we point out that almost all IFIT1s tested in the ortholog screen were also included in our computational analysis allowing for the potential to connect functional observations with those seen in the evolutionary analyses.

      Reviewer #2 (Public review): 

      McDougal et al. describe the surprising finding that IFIT1 proteins from different mammalian species inhibit the replication of different viruses, indicating that the evolution of IFIT1 across mammals has resulted in host speciesspecific antiviral specificity. Before this work, research into the antiviral activity and specificity of IFIT1 had mostly focused on the human ortholog, which was described to inhibit viruses including vesicular stomatitis virus (VSV) and Venezuelan equine encephalitis virus (VEEV) but not other viruses including Sindbis virus (SINV) and parainfluenza virus type 3 (PIV3). In the current work, the authors first perform evolutionary analyses on IFIT1 genes across a wide range of mammalian species and reveal that IFIT1 genes have evolved under positive selection in primates, bats, carnivores, and ungulates. Based on these data, they hypothesize that IFIT1 proteins from these diverse mammalian groups may show distinct antiviral specificities against a panel of viruses. By generating human cells that express IFIT1 proteins from different mammalian species, the authors show a wide range of antiviral activities of mammalian IFIT1s. Most strikingly, they find several IFIT1 proteins that have completely different antiviral specificities relative to human IFIT1, including IFIT1s that fail to inhibit VSV or VEEV, but strongly inhibit PIV3 or SINV. These results indicate that there is potential for IFIT1 to inhibit a much wider range of viruses than human IFIT1 inhibits. Electrophoretic mobility shift assays (EMSAs) suggest that some of these changes in antiviral specificity can be ascribed to changes in the direct binding of viral RNAs. Interestingly, they also find that chimpanzee IFIT1, which is >98% identical to human IFIT1, fails to inhibit any tested virus. Replacing three residues from chimpanzee IFIT1 with those from human IFIT1, one of which has evolved under positive selection in primates, restores activity to chimpanzee IFIT1. Together, these data reveal a vast diversity of IFIT1 antiviral specificity encoded by mammals, consistent with an IFIT1-virus evolutionary "arms race". 

      Overall, this is a very interesting and well-written manuscript that combines evolutionary and functional approaches to provide new insight into IFIT1 antiviral activity and species-specific antiviral immunity. The conclusion that IFIT1 genes in several mammalian lineages are evolving under positive selection is supported by the data, although there are some important analyses that need to be done to remove any confounding effects from gene recombination that has previously been described between IFIT1 and its paralog IFIT1B. The virology results, which convincingly show that IFIT1s from different species have distinct antiviral specificity, are the most surprising and exciting part of the paper. As such, this paper will be interesting for researchers studying mechanisms of innate antiviral immunity, as well as those interested in species-specific antiviral immunity. Moreover, it may prompt others to test a wide range of orthologs of antiviral factors beyond those from humans or mice, which could further the concept of host-specific innate antiviral specificity. Additional areas for improvement, which are mostly to clarify the presentation of data and conclusions, are described below. 

      Strengths: 

      (1) This paper is a very strong demonstration of the concept that orthologous innate immune proteins can evolve distinct antiviral specificities. Specifically, the authors show that IFIT1 proteins from different mammalian species are able to inhibit the replication of distinct groups of viruses, which is most clearly illustrated in Figure 4G. This is an unexpected finding, as the mechanism by which IFIT1 inhibits viral replication was assumed to be similar across orthologs. While the molecular basis for these differences remains unresolved, this is a clear indication that IFIT1 evolution functionally impacts host-specific antiviral immunity and that IFIT1 has the potential to inhibit a much wider range of viruses than previously described. 

      (2) By revealing these differences in antiviral specificity across IFIT1 orthologs, the authors highlight the importance of sampling antiviral proteins from different mammalian species to understand what functions are conserved and what functions are lineage- or species-specific. These results might therefore prompt similar investigations with other antiviral proteins, which could reveal a previously undiscovered diversity of specificities for other antiviral immunity proteins. 

      (3) The authors also surprisingly reveal that chimpanzee IFIT1 shows no antiviral activity against any tested virus despite only differing from human IFIT1 by eight amino acids. By mapping this loss of function to three residues on one helix of the protein, the authors shed new light on a region of the protein with no previously known function. 

      (4) Combined with evolutionary analyses that indicate that IFIT1 genes are evolving under positive selection in several mammalian groups, these functional data indicate that IFIT1 is engaged in an evolutionary "arms race" with viruses, which results in distinct antiviral specificities of IFIT1 proteins from different species. 

      Weaknesses: 

      (1) The evolutionary analyses the authors perform appear to indicate that IFIT1 genes in several mammalian groups have evolved under positive selection. However, IFIT1 has previously been shown to have undergone recurrent instances of recombination with the paralogous IFIT1B, which can confound positive selection analyses such as the ones the authors perform. The authors should analyze their alignments for evidence of recombination using a tool such as GARD (in the same HyPhy package along with MEME and FUBAR). Detection of recombination in these alignments would invalidate their positive selection inferences, in which case the authors need to either analyze individual non-recombining domains or limit the number of species to those that are not undergoing recombination. While it is likely that these analyses will still reveal a signature of positive selection, this step is necessary to ensure that the signatures of selection and sites of positive selection are accurate. 

      (2) The choice of IFIT1 homologs chosen for study needs to be described in more detail. Many mammalian species encode IFIT1 and IFIT1B proteins, which have been shown to have different antiviral specificity, and the evolutionary relationship between IFIT1 and IFIT1B paralogs is complicated by recombination. As such, the assertion that the proteins studied in this manuscript are IFIT1 orthologs requires additional support than the percent identity plot shown in Figure 3B. 

      (3) Some of the results and discussion text could be more focused on the model of evolution-driven changes in IFIT1 specificity. In particular, the chimpanzee data are interesting, but it would appear that this protein has lost all antiviral function, rather than changing its antiviral specificity like some other examples in this paper. As such, the connection between the functional mapping of individual residues with the positive selection analysis is somewhat confusing. It would be more clear to discuss this as a natural loss of function of this IFIT1, which has occurred elsewhere repeatedly across the mammalian tree. 

      (4) In other places in the manuscript, the strength of the differences in antiviral specificity could be highlighted to a greater degree. Specifically, the text describes a number of interesting examples of differences in inhibition of VSV versus VEEV from Figure 3C and 3D, but it is difficult for a reader to assess this as most of the dots are unlabeled and the primary data are not uploaded. A few potential suggestions would be to have a table of each ortholog with % infection by VSV and % infection by VEEV. Another possibility would be to plot these data as an XY scatter plot. This would highlight any species that deviate from the expected linear relationship between the inhibition of these two viruses, which would provide a larger panel of interesting IFIT1 antiviral specificities than the smaller number of species shown in Figure 4. 

      We thank the reviewer for their fair assessment of our manuscript. As the reviewer requested, we performed GARD analysis on our alignments used for PAML, FUBAR, and MEME (New Supp Fig 1). By GARD, we found 1 or 2 predicted breakpoints in each clade. However, much of the sequence was after or between the predicted breakpoints. Therefore, we were able to reanalyze for sites undergoing positive selection in the large region of the sequence that do not span the breakpoints. We were able to validate almost all sites originally identified as undergoing positive selection still exhibit signatures of positive selection taking these breakpoints into account: primates (11/12), bats (14/16), ungulates (30/37), and carnivores (2/4). To further validate our positive selection analysis, we used Recombination Detection Program 4 (RDP4) to remove inferred recombinant sequences from the primate IFIT1 alignment and performed PAML, FUBAR, and MEME. Once again, the sites in our original anlaysis were largely validated by this method. Importantly, sites 170, 193, and 366 in primates, which are discussed in our manuscript, were found to be undergoing positive selection in 2 of the 3 analyses using alignments after the indicated breakpoint in GARD and after removal of recombinant sequences by RDP4. We have updated the text to acknowledge IFIT1/IFIT1B recombination more clearly and include the GARD analysis as well as PAML, FUBAR, and MEME reanalysis taking into account predicted breakpoints by GARD and RDP4. Furthermore, to increase evidence that the sequences used in this study for both computational and functional analysis are IFIT1 orthologs rather than IFIT1B, we have included a maximum likelihood tree after aligning coding sequences on the C-terminal end (corresponding to bases 907-1437 of IFIT1). In Daughtery et al. 2016 (PMID: 27240734) this strategy was used to distinguish between IFIT1 and IFITB. All sequences used in our study grouped with IFIT1 sequences (including many confirmed IFIT1 sequences used in Daughterty et al.) rather than IFIT1B sequences or IFIT3. This new data, including the GARD, RDP4, and maximum likelihood tree is included as a new Supplementary Figure 1.

      We also agree with the reviewer that it is possible that chimpanzee IFIT1 has lost antiviral function due to the residues 364 and 366 that differ from human IFIT1. We have updated the discussion sections to include the possibility that chimpanzee IFIT1 is an example of a natural loss of function that has occurred in other species over evolution as well as the potential consequences of this occurrence. Regarding highlighting the strength of differences in antiviral activity between IFIT1 orthologs, we have included several updates to strengthen the ability of the reader to assess these differences. First, we have included a supplementary table that includes the infection data for each ortholog from the VEEV and VSV screen to allow for readers to evaluate ranked antiviral activity of the species that suppress these viruses. In addition, the silhouettes next to the dot plots indicate the top ranked hits in order of viral inhibition (with the top being the most inhibitory) giving the reader a visual representation in the figure of top antiviral orthologs during our screen. We have also updated the figure legend to inform the reader of this information.

      Reviewer #3 (Public Review):  

      Summary: 

      This manuscript by McDougal et al, demonstrates species-specific activities of diverse IFIT1 orthologs and seeks to utilize evolutionary analysis to identify key amino acids under positive selection that contribute to the antiviral activity of this host factor. While the authors identify amino acid residues as important for the antiviral activity of some orthologs and propose a possible mechanism by which these residues may function, the significance or applicability of these findings to other orthologs is unclear. However, the subject matter is of interest to the field, and these findings could be significantly strengthened with additional data.

      Strengths:

      Assessment of multiple IFIT1 orthologs shows the wide variety of antiviral activity of IFIT1, and identification of residues outside of the known RNA binding pocket in the protein suggests additional novel mechanisms that may regulate IFIT1 activity.

      Weaknesses:

      Consideration of alternative hypotheses that might explain the variable and seemingly inconsistent antiviral activity of IFIT1 orthologs was not really considered. For example, studies show that IFIT1 activity may be regulated by interaction with other IFIT proteins but was not assessed in this study.

      Given that there appears to be very little overlap observed in orthologs that inhibited the viruses tested, it's possible that other amino acids may be key drivers of antiviral activity in these other orthologs. Thus, it's difficult to conclude whether the findings that residues 362/4/6 are important for IFIT1 activity can be broadly applied to other orthologs, or whether these are unique to human and chimpanzee IFIT1. Similarly, while the hypothesis that these residues impact IFIT1 activity in an allosteric manner is an attractive one, there is no data to support this.  

      We thank the reviewer for their fair assessment of our manuscript. To address the weaknesses that the reviewer has pointed out we have expanded the discussion to more directly address alternate hypotheses, such as the possibility of IFIT1 activity being regulated by interaction with other IFIT proteins. Furthermore, we expanded the discussion to include an alternate hypothesis for the role of residues 364 and 366 in primate IFIT1 besides allosteric regulation. In addition, we did not intend to claim or imply that residues 364/6 are the key drivers of antiviral activity for all IFITs tested. However, we speculate that within primates these residues may play a key role as these residues differ between chimpanzee IFIT1 (which lacks significant antiviral activity towards the viruses tested in this study) and human IFIT1 (which possesses significant antiviral activity). In addition, these residues seem to be generally conserved in primate species, apart from chimpanzee IFIT1. We have included changes to the text to more clearly indicate that we highlight the importance of these residues specifically for primate IFIT1, but not necessarily for all IFIT1 proteins in all clades.

      Reviewer #1 (Recommendations for the authors): 

      (1) The readers would benefit from a more detailed background on the concept and estimation of positive selection for the readers, including the M7/8 models in PAML. 

      We have included more information in the text to provide a better background for the concepts of positive selection and how PAML tests for this using M7 and M8 models.

      (2) Presentation of data 

      a) Figure 3C and 3D: is there a better way to present the infection data so the readers can tell the ranked antiviral activity of the species that suppress VEEV? 

      We have included a supplementary table that includes the infection data for each ortholog from the VEEV and VSV screen to allow for readers to evaluate ranked antiviral activity of the species that suppress these viruses. In addition, the silhouettes next to the dot plots indicate the top ranked hits in order of viral inhibition (with the top being the most inhibitory). We have updated the figure legend to inform the reader of this information as well.

      b) Figure 4C and 4D: consider putting the western blot in Supplementary Figure 1 underneath the infection data or with the heatmap so readers can compare it with the antiviral activity. 

      We have also included quantification of the western blots performed to evaluate IFIT1 expression during the experiments shown in Figure 4C and 4D in an updated Figure 4B. We have also included normalized expression values with the heatmap shown in an updated Figure 4G so the reader can evaluate potential impact of protein expression on antiviral activity for all infection experiments shown in figure 4.

      (3) Line 269-270: as a rationale for narrowing the species to human, black flying fox, and chimp IFIT1, human and black flying fox were chosen because they strongly inhibit VEEV, but pangolin wasn't included even though it had the strongest anti-VEEV activity? 

      The rationale for narrowing the species to human, black flying fox, and chimpanzee IFIT1 was related to the availability of biological tools, high quality genome/transcriptome sequencing databases, and other factors. Specifically human and chimp IFIT1 are closely related but have variable antiviral activities, making their comparison highly relevant. Bats are well established as reservoirs for diverse viruses, whereas the reservoir status of many other mammals is less well defined. Furthermore, purifying large amounts of high quality IFIT1 protein after bacterial expression was another limitation to functional studies. We have added this information into the manuscript text.

      (4) Figure 5A: to strengthen the claim that "species-specific antiviral activities of IFIT1s can be partly explained by RNA binding potential", it would be good to include one more positive and one more negative control. In other words, test the cap0 RNA binding activity of an IFIT1 ortholog that strongly inhibits VEEV and an ortholog that does not. It would also be good to discuss why chimp IFIT1 still shows dose-dependent RNA binding yet it is one of the weakest at inhibiting VEEV. 

      We appreciate the reviewer's suggestion to include more controls and expand the dataset. While we understand the potential value of expanding the dataset, we believe that human IFIT1 serves as a robust positive control and human IFIT1 R187 (RNA-binding deficient) serves as an established negative control. Future experiments with other purified IFITs from other species will indeed strengthen evidence linking IFIT1 species-specific activity and RNA-binding.

      Regarding chimpanzee IFIT1, we acknowledge there appears to be some dose-dependent Cap0 RNA-binding. However, the binding affinity is much weaker than that of human or black flying fox IFIT1. We speculate that during viral infection reduced binding affinity could impair the ability of chimpanzee IFIT1 to efficiently sequester viral RNA and inhibit viral translation. This reduction in binding affinity may, therefore, allow the cell to be overwhelmed by the exponential increase in viral RNA during replication resulting in an ineffective antiviral IFIT1. In the literature, a similar phenomenon is observed by Hyde et. al (PMID: 24482115). In this study, the authors test mouse Ifit1 Cap0 RNA binding by EMSA of the 5’ UTR sequence of VEEV RNA containing an A or G at nucleotide position 3. EMSA shows binding of both the A3 and G3 Cap0 VEEV RNA sequences, however stronger Ifit1 binding is observed for A3 Cap0 RNA sequence. The consequences of the reduced Ifit1 binding of the G3 Cap0 VEEV RNA are observed in vitro by a substantial increase in viral titers produced from cells as well as an increase in protein produced in a luciferase-based translation assay. The authors also show in vivo relevance of this reduction of Ifit1 binding as WT B6 mice infected with VEEV containing the A3 UTR exhibited 100% survival, while WT B6 mice infected with VEEV containing the G3 UTR survived at a rate of only ~25%. Therefore, the literature supports that a decrease in Cap0 RNA binding by an IFIT protein (while still exhibiting Cap0 RNA binding) observed by EMSA can result in considerable alterations of viral infection both in vitro and in vivo.

      Minor: 

      (1) Line 82: "including 5' triphosphate (5'-ppp-RNA), or viral RNAs..." having a comma here will make the sentence clearer. 

      We have improved the clarity of this sentence. It now reads, “IFIT1 binds uncapped 5′triphosphate RNA (5′-ppp-RNA) and capped but unmethylated RNA (Cap0, an m<sup>7</sup>G cap lacking 2′-O methylation).”

      (2) Line 100: "...similar mechanisms have been at least partially evolutionarily conserved in IFIT proteins to restrict viral infection by IFIT proteins". 

      We have updated the text to improve clarity by revising the sentence to “VEEV TC-83 is sensitive to human IFIT1 and mouse Ifit1B, indicating at least partial conservation of antiviral function by IFIT proteins."

      (3) Line 109: "signatures of rapid evolution or positive selection" would put positive selection second because that is the more technical term that can benefit from the more layperson term (rapid evolution). 

      We have updated this sentence incorporating this suggestion. “Positive selection, or rapid evolution, is denoted by a high ratio of nonsynonymous to synonymous substitutions (dN/dS >1).”

      (4) Lines 116-117: "However, this was only assessed in a few species" would benefit from a citation. 

      We have inserted the citation.

      (5) Line 127 heading: "IFIT1 is rapidly evolving in mammals" would be more accurate to say "in major clades of mammals". 

      We have updated the text to include this suggestion.

      (6) Line 165: "IFIT1 L193 mutants". 

      We have updated the text to rephrase this for clarity.

      (7) Line 170: two strains of VEEV were mentioned in the Intro, so it would be good to specify which strain of VEEV was used?

      We have updated the text to clarify the VEEV strain. In this study, all experiments were performed using the VEEV TC-83 strain.

      (8) Line 174: "Indeed, all mutants at position 193, whether hydrophobic or positively charged, inhibited VEEV similarly to the WT..." It should read "all hydrophobic and positively charged mutants inhibited VEEV similarly to the WT...". 

      We corrected as suggested. 

      (9) Line 204: what are "control cells"? Cells that are mock-infected, or cells without IFIT1? 

      We have updated the text to improve clarity. What we refer to as control cells, were cells expressing an empty vector control rather than an IFIT1.

      (10) Need to clarify n=2 and n=3 replicates throughout the manuscript. Does that refer to three independent experiments? Or an experiment with triplicate wells/samples? 

      We have updated the text to say “independent experiments” instead of “biological replicates” to prevent any confusion.  All n=2 or n=3 replicates denote independent experiments.

      (11) Line 254: "dominant antiviral effector against the related human parainfluenza virus type 5..." 

      We have updated the text to improve clarity.

      (12) Line 271: "The black flying fox (Pteropus alecto), is a model megabat species..." scientific name was italicized here but not elsewhere. Remove comma.

      We have updated the text accordingly.

      (13) Line 293: "...chimpanzee IFIT1 lacked these properties" but chimp IFIT1 can bind cap0 RNA, just at a lower level. 

      We have updated the text to acknowledge that chimpanzee IFIT1 can bind cap0 RNA, albeit at a lower level than human IFIT1.

      (14) Figure 6B: please fix the x-axis labels. They're very cramped. 

      We have updated the x-axis labels for figure 6B and figure 6D to improve clarity.

      (15) Line 609: "...trimmed and aligned"? 

      Our phrasing is to indicate that coding sequences were aligned, and gaps were removed to reduce the chance of false positive signal by underrepresented codons such as gaps or short insertions. We have removed “trimmed” from the text and changed the text to say “aligned sequences” to increase clarity.

      Reviewer #2 (Recommendations for the authors): 

      (1) Numbers less than 10 should be spelled out throughout the manuscript (e.g. line 138). 

      We have updated the text to reflect the request.

      (2) Line 165: "expression of IFIT1 193 mutants" should be rephrased. 

      We have updated the text to rephrase this sentence for clarity.

      (3) A supplemental table or file should be included that contains the accession number and species names of sequences used for evolutionary analyses and for functional testing. In addition, the alignments that were used for positive selection can be included.  

      We have included a supplemental file containing accession numbers, species names for evolutionary analysis and functional studies. In addition, this table includes the infection data for each IFIT1 homolog for the screen performed in figure 3.

      (4) The discussion of potential functions of the C-terminus of IFIT1 should include possible interactions with other proteins. In particular, the C-terminus of IFIT1 has been shown to interact with IFIT3 in a way that modulates its activity (PMID: 29525521). Although residues 362-366 were not shown in that paper to interact with a fragment of IFIT3, it is possible that these residues may be important for interaction with full-length IFIT3 or some other IFIT1 binding partner. 

      We thank the reviewer for their suggestion. We have expanded the discussion to explore the possibility that residues 364 and 366 of IFIT1 may be involved in IFIT1-IFIT3 interactions and consequently Cap0 RNA-binding and antiviral activity.

      (5) The quantification of the EMSAs should be described in more detail. In particular, from looking at the images shown in Figure 5A, it would appear that human and chimpanzee IFIT1 show similar degrees of probe shift, while the human R187H panel shows no shifting at all. However, the quantification shows chimpanzee IFIT1 as being statistically indistinguishable from human R187H. Additional information on how bands were quantified and whether they were normalized to unshifted RNA would be helpful in attempting to resolve this visual discordance. 

      EMSAs were quantified by determining Adj. Vol. Intensity in ImageLab (BioRad), which subtracts background signal, after imaging at the same exposure and SYBR Gold staining time. To determine Adj. Vol. Intensity, we drew a box (same size for each gel and lane for each replicate) for each lane above the free probe. These values were not normalized to unshifted RNA, however equal RNA was loaded. While the ANOVA shows no significant difference, between human R187H and chimpanzee IFIT1 band shift intensity, this is potentially due to the between group variance in the ANOVA. The increase in the AUC value for chimpanzee IFIT1 is 36.4% higher than R187H.

      The AUC of Adj. Vol. Intensity of human IFIT1 band shift is roughly 2-fold more than that of chimpanzee IFIT1. We believe this matches with the visual representation as well, as human IFIT1 has a darker “upper” band in the shift, as well as a clear dark “lower” band that is not well defined in the chimpanzee shift. Furthermore, the upper band of the chimpanzee IFIT1 shift appears to be as intense in the 400nM as the upper band in the 240nM human IFIT1 lane, without taking into account the lower band seen for human IFIT1 as well. We included this quantification as kD was unable to be calculated due to no clear probe disappearance and we do not intend for this quantification to act as a substitute for binding affinity calculations, rather to aid the reader in data interpretation.

      Reviewer #3 (Recommendations for the authors): 

      (1) IFIT1 has been demonstrated to function in conjunction with other IFIT proteins, do you think the absence of antiviral activity is due to isolated expression of IFIT1 without these cofactors, and therefore might explain why there was little overlap observed in orthologs that inhibited the viruses tested (Figure 3, lines 209-210). 

      We do not believe that isolated expression of IFIT1 without cofactors (such as orthologous IFIT proteins) would fully explain the disparities in antiviral activity as many IFIT1s that expressed inhibited either VSV or VEEV in our screen. However, we acknowledge that the expression of IFIT1 alone does create a limitation in our study as IFIT1 antiviral activity and RNA-binding can be modulated by interactions with other IFIT proteins. Therefore, we do believe that it is possible that co-expression of IFIT1 with other IFITs from a given species might potentially enhance antiviral activity. Future studies may shed light on this.

      (2) Figure 5 - Calculating the Kd for each protein would be more informative. How does the binding affinity of these IFIT1 proteins compare to that which has previously been reported? 

      We are unable to accurately determine kD as there is not substantial diminished signal of the free probe. Therefore, we are only able to compare IFIT1 protein binding between species without accurate mathematical calculation of binding affinity. Our result does appear similar to that of mouse Ifit1 binding to VEEV RNA (PMID: 24482115), in which the authors also do not calculate a kD for their RNA EMSA.

      (3) Mutants 364 and 366 may not have direct contact with RNA, but RNA EMSA data presented suggest that the binding affinity may be different (though this is hard to conclude without Kd data). Additional biochemical data with these mutants might provide more insight here. 

      We agree that further studies using 364 and 366 double mutant human and chimpanzee protein in EMSAs would provide additional biochemical data and provide insight into the role of these residues in direct RNA binding. We acknowledge this is a limitation of our study as we provide only genetic data demonstrating the importance of these residues.

      (4) Given that there appears to be very little overlap observed in orthologs that inhibited the viruses tested, it's possible that other amino acids may be key drivers of antiviral activity in these other orthologs. Thus, it's difficult to conclude whether the findings that residues 362/4/6 are important for IFIT1 activity can be broadly applied to other orthologs. A more systematic assessment of the role of these mutations across multiple diverse orthologs would provide more insight here. Do other antiviral proteins show this trend (ie exhibit little overlap in orthologs that inhibit these viruses). What do you think might be driving this? 

      We agree that other residues outside of 364 and 366 may be key drivers of antiviral activity across the IFTI1 orthologs tested. We do not hypothesize that this will broadly apply across IFIT1 from diverse clades of mammals as overall amino acid identity can differ by over 30%. However, based on the chimpanzee and human IFIT1 data, as well as sequence alignment within primates specifically, we believe these residues may be key for primate (but not necessarily other clades of mammals) IFIT1 antiviral activity.

      Regarding if other antiviral proteins show little overlap in orthologs that inhibit a given virus, to our knowledge such a functional study with this large and divergent dataset of orthologs has not been performed. However, there are many examples of restriction factors exhibiting speciesspecific antiviral activity when ortholog screens have been performed. For example, HIV was reported to be suppressed by MX2 orthologs from human, rhesus macaque, and African green monkey, but not sheep or dog MX2 (PMID: 24760893). In addition, foamy virus was inhibited by the human and rhesus macaque orthologs of PHF11, but not the mouse and feline orthologs (PMID: 32678836). Furthermore, studies from our lab have shown variability in RTP4 ortholog antiviral activity inhibition towards viruses much as hepatitis C virus (HCV), West Nile virus (WNV), and Zika virus (ZIKV) (PMID: 33113352).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review): 

      Summary: 

      Weiss and co-authors presented a versatile probabilistic tool. aTrack helps in classifying tracking behaviors and understanding important parameters for different types of single particle motion types: Brownian, Confined, or Directed motion. The tool can be used further to analyze populations of tracks and the number of motion states. This is a stand-alone software package, making it user-friendly for a broad group of researchers. 

      Strengths: 

      This manuscript presents a novel method for trajectory analysis. 

      Weaknesses: 

      (1) In the results section, is there any reason to choose the specific range of track length for determining the type of motion? The starting value is fine, and would be short enough, but do the authors have anything to report about how much is too long for the model? 

      We chose to test the range of track lengths (five-to-hundreds of steps) to cover the broad range of scenarios arising from single proteins or fluorophores to brighter objects with more labels.  While there is no upper-limit per se, the computation time of our method scales linearly with track length, 100 time-points takes ~2 minutes to run on a standard consumer-level desktop CPU. We have added the following sentence to note the time-cost with trajectory length:  

      “The recurrent formula enables our model computation time to scale linearly with the number of time points.”

      (2) Robustness to model mismatches is a very important section that the authors have uplifted diligently. Understanding where and how the model is limited is important. For example, the authors mentioned the limitation of trajectory length, do the authors have any information on the trajectory length range at which this method works accurately? This would be of interest to readers who would like to apply this method to their own data. 

      We agree that limitations are important to estimate, and trajectory length is an important consideration when choosing how to analyze a dataset. We report the categorization certainty, i.e. the likelihood differences, for a range of track lengths (Fig. 2 a,c, Fig. 3c-d, and Fig. 4 c,g.).

      For example, here are the key plots from Fig. 2 quantifying the relative likelihoods, where being within the light region is necessary. The light areas represent a useful likelihood ratio.

      We only performed analysis up to track lengths of 600 time steps but parameter estimations and significance can only improve when increasing the track length as long as the model assumptions are verified. The broader limitations and future opportunities for new methods are now expanded upon in the discussion, for example switching between states and model and state and model ambiguities (bound vs very slow diffusion vs very slow motion).

      (3) aTrack extracts certain parameters from the trajectories to determine the motion types. However, it is not very clear how certain parameters are calculated. For example, is the diffusion coefficient D calculated from fitting, and how is the confinement factor defined and estimated, with equations? This information will help the readers to understand the principles of this algorithm.

      We apologize for the confusion. All the model parameters are fit using the maximum likelihood approach. To make this point clearer in the manuscript, we have made three changes:

      (1) We modified the following sentence to replace “determined” with "fit”:

      “Finally, Maximum Likelihood Estimation (MLE) is used to fit the underlying parameter value”

      (2) We added the following sentence in the main text :

      “In our model, the velocity is the characteristic parameter of directed motion and the confinement factor represents the force within a potential well. More precisely, the confinement factor $l$ is defined such that at each time step the particle position is updated by $l$ times the distance particle/potential well center (see the Methods section for more details).”.

      (3) We have added a new section in the methods, called Fitting Method, where we have added the explanation below:

      “For the pure Brownian model, the parameters are the diffusion coefficient and the localization error. For the confinement model, the parameters are the diffusion coefficient, the localization error, confinement factor, and the diffusion coefficientof the potential well. For the directed model, the parameters are the diffusion coefficient, the localization error, the initial velocity and the acceleration variance.

      These parameters are estimated using the maximum likelihood approach which consists in finding the parameters that maximize the likelihood. We realize this fitting step using gradient descent via a TensorFlow model. All the estimates presented in this article are obtained from a single set of initial parameters to demonstrate that the convergence capacity of aTrack is robust to the initial parameter values.”

      (4) The authors mentioned the scenario where a particle may experience several types of motion simultaneously. How do these motions simulated and what do they mean in terms of motion types? Are they mixed motion (a particle switches motion types in the same trajectory) or do they simply present features of several motion types? It is not intuitive to the readers that a particle can be diffusive (Brownian) and direct at the same time. 

      In the text, we present an example where one can observe this type of motion to help the reader understand when this type of motion can be met: “Sometimes, particles undergo diffusion and directed motion simultaneously, for example, particles diffusing in a flowing medium (Qian 1991).”

      This is simulated by the addition of two terms affecting the hidden position variable before adding a localization term to create the observed variable. In the analysis, this manifests as non-zero values for the diffusion coefficient and the linear velocity. For example, Figure 4g and the associated text, where a single particle moves with a directed component and a Brownian diffusion component at each step.

      We did not simulate transitions between types of motion. Switching is not treated by this current model; however, this limitation is described in the discussion and our team and others are currently working on addressing this challenge.

      Reviewer #2 (Public Review): 

      Summary: 

      The authors present a software package "aTrack" for identification of motion types and parameter estimation in single-particle tracking data. The software is based on maximum likelihood estimation of the time-series data given an assumed motion model and likelihood ratio tests for model selection. They characterized the performance of the software mostly on simulated data and showed that it is applicable to experimental data. 

      Strengths: 

      A potential advantage of the presented method is its wide applicability to different motion types. 

      Weaknesses: 

      (1) There has been a lot of similar work in this field. Even though the authors included many relevant citations in the introduction, it is still not clear what this work uniquely offers. Is it the first time that direct MLE of the time-series data was developed? Suggestions to improve would include (a) better wording in the introduction section, (b) comparing to other popular methods (based on MSD, step-size statistics (Spot-On, eLife 2018;7:e33125), for example) using the simulated dataset generated by the authors, (c) comparing to other methods using data set in challenges/competitions (Nat. Comm (2021) 12:6253).  

      We thank the reviewer for this suggestion and agree that the explanation of the innovative aspects of our method in the introduction was not clear enough. We have now modified the introduction to better explain what is improved here compared to previous approaches.

      “The main innovations of this model are: 1) it uses analytical recurrence formulas to perform the integration step for complex motion, improving speed and accuracy; 2) it handles both confined and directed motion; 3) anomalous parameters, such as the center of the potential well and the velocity vector are allowed to change through time to better represent tracks with changing directed motion or confinement area; and lastly 4) for a given track or set of tracks, aTrack can determine whether tracks can be statistically categorized as confined or directed, and the parameters that best describe their behavior, for example, diffusion coefficient, radius of confinement, and speed of directed motion.”

      Regarding alternatives, we compare our method in the text to the best-performing algorithm of the

      2021 Anomalous Diffusion (AnDi) Challenge challenge mentioned by the reviewer in Figure 6 (RANDI, Argun et al, arXiv, 2021, Muñoz-Gil et al, Nat Com. 2021). Notably, both methods performed similarly on fBm, but ours was more robust in cases where there were small differences between the process underlying the data and the model assumptions, a likely scenario in real datasets. Regarding Spot-On, this was not mentioned as it only deals with multiple populations of Brownian diffusers, preventing a quantitative comparison.

      (2) The Hypothesis testing method presented here has a number of issues: first, there is no definition of testing statistics. Usually, the testing statistics are defined given a specific (Type I and/or Type II) error rate. There is also no discussion of the specificity and sensitivity of the testing results (i.e. what's the probability of misidentification of a Brownian trajectory as directed? etc).

      We now explain our statistical approach and how to perform hypothesis testing with our metric in a new supplementary section, Statistical test. 

      We use the likelihood ratio as a more conservative alternative to the p-value. In Fig S2, we show that our metric is an upper bound of the p-value and can be used to perform hypothesis testing with a chosen type I error rate. 

      Related, it is not clear what Figure 2e (and other similar plots) means, as the likelihood ratio is small throughout the parameter space. Also, for likelihood ratio tests, the authors need to discuss how model complexity affects the testing outcome (as more complex models tend to be more "likely" for the data) and also how the likelihood function is normalized (normalization is not an issue for MLE but critical for ratio tests). 

      We present the likelihood ratio as an upper bound of the p-value. Therefore, we can reject the null hypothesis if it is smaller than a given threshold, e.g. 0.05, but this number should be decreased if multiple tests are performed. The colorscale we show in the figure is meant to highlight the working range (light), and ambiguous range (dark) of the method.

      As the reviewer mentions, we expect the alternative hypothesis to result in higher likelihoods than the simpler null hypothesis for null hypothesis tracks, but, as seen in the Fig S2, the likelihood ratio of a dataset corresponding to the null hypothesis is strongly skewed toward its upper limit 1. This means that for most of the tracks, the likelihood is not (or little) affected by the model complexity. The likelihoods of all the models are normalized so their integrals over the data equals 1/A with A the area of the field of view which is independent of the model complexity.

      (3) Relating to the mathematical foundation (Figure 1b). The measured positions are drawn as direct arrows from the real position states: this infers instantaneous localization. In reality, there is motion blur which introduces a correlation of the measured locations. Motion blur is known to introduce bias in SPT analysis, how does it affect the method here? 

      The reviewer raises an important point as our model does not explicitly consider motion blur. We have now added a paragraph that presents how our model performs in case of motion blur in the section called Robustness to model mismatches. This section and the corresponding new Supplemental Fig. S7 demonstrate that the estimated diffusion length is accurate so long as the static localization error is higher than the dynamic localization error. If the dynamic localization error is higher, our model systematically underestimates the diffusion length by a factor 0.81 = (2/3)<sup>0.5</sup> which can be corrected for with an added post-processing step.  

      (4) The authors did not go through the interpretation of the figure. This may be a matter of style, but I find the figures ambiguous to interpret at times.  

      We thank the reviewer for their feedback on improving the readability. To avoid overly repetitive and lengthy sections of text, we have opted for a concise approach. This allows us to present closely related panels at the same point in the text, while not ignoring important variations and tests. Considering this feedback and the reviewers, we have added more information and interpretation throughout our manuscript to improve interpretability.

      (5) It is not clear to me how the classification of the 5 motion types was accomplished. 

      We have modified the specific text related to this figure to describe an illustrative example to show how one could use aTrack on a dataset where not that much is known: First, we present the method to determine the number of states; second, we verify the parameter estimates correspond to the different states.  

      Classifying individual tracks is possible. While not done in the section corresponding to Fig. 5, this is done in Fig. 7 and a new supplementary plot, Fig. S9b (shown below). In brief, this is accomplished with our method by computing the likelihood of each track given each state. The probability that a given track is in state k equals the likelihood of the track given the state divided by the sum of the likelihoods given the different states. 

      (6) Figure 3. Caption: what is ((d_{est}-0.1)/0.1)? Also panel labeled as "d" should be "e". 

      Thank you for bringing these errors to our attention, the panel and caption have been corrected.

      Reviewer #3 (Public Review): 

      Summary: 

      In this work, Simon et al present a new computational tool to assess non-Brownian single-particle dynamics (aTrack). The authors provide a solid groundwork to determine the motion type of single trajectories via an analytical integration of multiple hidden variables, specifically accounting for localization uncertainty, directed/confined motion parameters, and, very novel, allowing for the evolution of the directed/confined motion parameters over time. This last step is, to the best of my knowledge, conceptually new and could prove very useful for the field in the future. The authors then use this groundwork to determine the motion type and its corresponding parameter values via a series of likelihood tests. This accounts for obtaining the motion type which is statistically most likely to be occurring (with Brownian motion as null hypothesis). Throughout the manuscript, aTrack is rigorously tested, and the limits of the methods are fully explored and clearly visualised. The authors conclude with allowing the characterization of multiple states in a single experiment with good accuracy and explore this in various experimental settings. Overall, the method is fundamentally strong, wellcharacterised, and tested, and will be of general interest to the single-particle-tracking field. 

      Strengths: 

      (1) The use of likelihood ratios gives a strong statistical relevance to the methodology. There is a sharp decrease in likelihood ratio between e.g. confinement of 0.00 and 0.05 and velocity of 0.0 and 0.002 (figure 2c), which clearly shows the strength of the method - being able to determine 2nm/timepoint directed movement with 20 nm loc. error and 100 nm/timepoint diffusion is very impressive. 

      We apologize for the confusion, the directed tracks in Fig 2 have no Brownian-motion component, i.e. D=0. We have made this clearer in the main text. Specifically, this section of the text refers to a track in linear motion with 2 nm displacements per step. With 70 time points (69 steps), a single particle which moved from 138 nm with a localization error of 20 nm (95% uncertainty range of 80 nm) can be statistically distinguished from slow diffusive motion.

      In Fig. 4g, we explore the capabilities of our method to detect if a diffusive particle also has a directed motion component. 

      (2) Allowing the hidden variables of confinement and directed motion to change during a trajectory (i.e. the q factor) is very interesting and allows for new interpretations of data. The quantifications of these variables are, to me, surprisingly accurate, but well-determined. 

      (3) The software is well-documented, easy to install, and easy to use. 

      Weaknesses: 

      (1) The aTrack principle is limited to the motions incorporated by the authors, with, as far as I can see, no way to add new analytical non-Brownian motion. For instance, being able to add a dynamical stateswitching model (i.e. quick on/off switching between mobile and non-mobile, for instance, repeatable DNA binding of a protein), could be of interest. I don't believe this necessarily has to be incorporated by the authors, but it might be of interest to provide instructions on how to expand aTrack.  

      We agree that handling dynamic state switching is very useful and highlight this potential future direction in the discussion. The revised text reads:

      “An important limitation of our approach is that it presumes that a given track follows a unique underlying model with fixed parameters. In biological systems, particles often transition from one motion type to another; for example, a diffusive particle can bind to a static substrate or molecular motor (46). In such cases, or in cases of significant mislinkings, our model is not suitable. However, this limitation can be alleviated by implicitly allowing state transitions with a hidden Markov Model (15) or alternatives such as change-point approaches (30, 47, 48), and spatial approaches (49).”

      (2) The experimental data does not very convincingly show the usefulness of aTrack. The authors mention that SPBs are directed in mitosis and not in interphase. This can be quantified and studied by microscopy analysis of individual cells and confirming the aTrack direction model based on this, but this is not performed. Similarly, the size of a confinement spot in optical tweezers can be changed by changing the power of the optical tweezer, and this would far more strongly show the quantitative power of aTrack. 

      We agree with the reviewer and have revised the biological experiment section significantly to better illustrate the potential of aTrack in various use cases.

      Now, we show an experiment to quantify the effect of LatA, an actin inhibitor, on the fraction of directed tracks obtained with aTrack. We find that LatA significantly decreases directed motion while a LatA-resistant mutant is not affected (Fig7a-c).

      As suggested by the reviewer, we have expanded the optical tweezer experiment by varying the laser power. As expected, increasing the laser power decreases the confinement radius.

      (3) The software has a very strict limit on the number of data points per trajectory, which is a user input. Shorter trajectories are discarded, while longer trajectories are cut off to the set length. It is not explained why this is necessary, and I feel it deletes a lot of useful data without clear benefit (in experimental conditions).

      We thank the reviewer for this recommendation; we have now modified the architecture of our model to enable users to consider tracks of multiple lengths. Note that the computation time is proportional to the longest track length times the number of tracks.  

      Reviewer #2 (Recommendations For The Authors): 

      Develop a better mathematical foundation for the likelihood ratio tests. 

      We added more explanation of the likelihood ratio tests and their interpretation a new section entitled Statistical test in the supplementary information to address this recommendation.

      Place this work in clearer contexts. 

      We have now revised the introduction to better contextualize this work.

      Improve manuscript clarity. 

      Based on reviewer feedback and input from others, we have addressed this point throughout the article to improve readability.

      Make the code available. 

      The code is available on https://github.com/FrancoisSimon/aTrack, now including code for track generation.

      Reviewer #3 (Recommendations For The Authors): 

      (1) I believe the underlying model presented in Figure 1 is of substantial impact, especially when considering it as a simulation tool. I would suggest the authors make their method also available as a simulator (as far as I can tell, this is not explicitly done in their code repository, although logically the code required for the simulator should already be in the codebase somewhere). 

      Thank you for this suggestion, the simulation scripts are now on the Github repository together with the rest of the analysis method. https://github.com/FrancoisSimon/aTrack

      (2) The authors should explore and/or discuss the effects of wrong trajectory linking to their method. Throughout the text, fully correct trajectory linking is assumed and assessed, while in real experiments, it is often the case that trajectory linking is wrong, e.g. due to blinking emitters, imaging artefacts, high-density localizations, etc etc. This would have a major impact on the accuracy of trajectories, and it is extremely relevant to explore how this is translated to the output of aTrack. 

      As the reviewer notes, our current model does not account for track mislinking. This limits the method to data with lower fluorophore-densities, which is the typical use-case for SPT. We have added a brief description of the issue into the discussion of limitations.  

      (3) aTrack only supports 2D-tracking, but I don't believe there is a conceptual reason not to have this expanded to three dimensions. 

      The stand-alone software is currently limited to 2D tracks, however, the aTrack Python package works for any number of dimensions (i.e. 1-3). Note that since the current implementation assumes a single localization error for all axes, more modifications may be required for some types of 3D tracking. See https://github.com/FrancoisSimon/aTrack for more details about aTrack implementations.

      (4) Crucial information is missing in the experimental demonstrations. Especially in the NP-bacteria dataset, I miss scalebars, and information on the number of tracks. It is not explained why 5 different states are obtained - especially because I would naively expect three states: immobile NPs (e.g. stuck to glass), diffusing NPs, and NPs attached to bacteria, and thus directed. Figure 7e shows three diffusive states (why more than one?), no immobile states (why?), and two directed states (why?). 

      We thank the reviewer for pointing out these issues. We have now added scalebars and more experimental details to the figure and text as well as modifying the plot to more clearly emphasize the directed nanoparticles that are attached to cells from the diffusive nanoparticles.  

      Likely, our focal plane was too high to see the particles stuck on glass. The multiple diffusive states may be caused by different sizes of nanoparticle complexes, the multiple directed states can be caused by the fact that directed motion of the cell-attached-nanoparticles occasionally shows drastic changes of orientations. We have also clarified in the text how multiple states can help handle a heterogeneous population as was shown by Prindle et al. 2022, Microbiol Spectr. The characterization and phenotyping of microbial populations by nanoparticle tracking was published in Zapata et al. 2022, Nanoscale. 

      (5) I don't think I agree that 'robustness to model mismatches' is a good thing. Very crudely, the fact that aTrack finds fractional Brownian motion to be normal Brownian motion is technically a downside - and this should be especially carefully positioned if (in the future) a fractional Brownian motion model would be added to aTrack. I think that the author's point can be better tested by e.g. widely varying simulated vs fitted loc precision/diffusion coefficient (which are somewhat interchangeable).

      In this context, our intention in describing the robustness to “model mismatches” refers to classifying subdiffusion as subdiffusive irrespective of the exact subdiffusion motion physics (as well as superdiffusion), that is, to use aTrack how MSD analysis is often deployed. This is important in the context of real-world applications where simple mathematical models cannot perfectly represent real tracks with greater complexity. 

      Inevitably, some fraction of tracks with a pure Brownian motion may appear to match with a fractional Brownian motion, and thus statistical tests are needed to determine if this is significant. In general, aTrack finds fBm to be normal Brownian motion only when the anomalous coefficient is near 1, i.e. when the two models are indeed the same. When analysing fBm tracks with anomalous coefficients of 0.5 or 1.5, aTrack find that these tracks are better explained by our confined diffusion model or directed motion model, respectively (Please see Fig. 6a, copied below). 

      To better clarify our objective, the section now has a brief introduction that reads:

      “One of the most important features of a method is its robustness to deviations from its assumptions. Indeed, experimental tracking data will inevitably not match the model assumptions to some degree, and models need to be resilient to these small deviations.”  

      Smaller points: 

      (1) It is not clear what a biological example is of rotational diffusion. 

      We modified the text to better explain the use of rotational diffusion.

      (2) The text in the section on experimental data should be expanded and clarified, there currently are multiple 'floating sentences' that stop halfway, and it does not clearly describe the biological relevance and observed findings.  

      We thank the reviewer for pointing out this issue. We have reworked the experimental section to better and more clearly explain the biological relevance of the findings.

      (3) Caption of figure 3: 'd' should be 'e'. 

      (4) Caption of Figure 7: log-likelihood should be Lconfined - Lbrownian, I believe. 

      (5) Equation number missing in SI first sentence. 

      (6) Supplementary Figure 1 top part access should be Lc-Lb instead of Ld-Lb. 

      We have made these corrections, thank you for bringing them to our attention.

    1. Author response:

      The following is the authors’ response to the current reviews.

      We thank the editors and the reviewers for their helpful comments. We have provided a response to reviewers’ recommendations and made some revisions on the manuscript. 

      Reviewer #1 (Recommendations for the authors): 

      In the newly added population factor analysis, several methodological decisions remain unclear to me:

      In Figure 7, why do the authors compare the mean distance between conditions in the latent spaces of MIo and SIo? Since these latent spaces are derived separately, they exist on di@erent scales (with MIo appearing roughly four times larger than SIo), and this discrepancy is reflected in the reported mean distances (Figure 7, inset plots). Wouldn't this undermine a direct comparison?

      Thank you for this helpful feedback. The reviewer is correct that the latent spaces are derived separately for MIo and SIo, thus they exist on diGerent scales as we have noted in the caption of Figure 7: “Axes for SIo are 1/4 scale of MIo.” 

      To allow for a direct comparison between MIo and SIo, we corrected the analysis by comparing their normalized mean inter-trajectory distances obtained by first calculating the geometric index (GI) of the inter-trajectory distances, d, between each pair of population trajectories per region as: GI= (d<sub>1</sub>-d<sub>2</sub>)/ (d<sub>1</sub>+d<sub>2</sub>). We then performed the statistics on the GIs and found a significant diGerence between mean inter-trajectory distances in MIo vs. SIo. We performed the same analysis comparing the distance travelled between MIo and SIo trajectories by getting the normalized diGerence in distances travelled and still found a significant diGerence in both tasks. We have updated the results and figure inset to reflect these changes.

      In Figure 12, unlike Figure 7 which shows three latent dimensions, only two factors are plotted. While the methods section describes a procedure for selecting the optimal number of latent factors, Figure 7 - figure supplement 3 shows that variance explained continues to increase up to about five latent dimensions across all areas. Why, then, are fewer dimensions shown?

      Thank you for the opportunity to clarify the figure. The m obtained from the 3-fold crossvalidation varied for the full sample and was 20 factors for the subsample. We clarify that all statistical analyses were done using 20 latent factors. Using the full sample of neurons, the first 3 factors explained 81% of variance in feeding data compared to 71% in drinking data. When extended to 5 factors, feeding maintained its advantage with 91% variance explained versus 82% for drinking. Because feeding showed higher variance explained than drinking across 3 or 5 factors, only three factors were shown in Figure 7 for better visualization. We added this clarification to the Methods and Results.

      Figure 12 shows the diGerences in the neural trajectories between the control and nerve block conditions. The control vs. nerve block comparison complicated the visualization of the results. Thus, we plotted only the two latent factors with the highest separation between population trajectories. This was clarified in the Methods and caption of Figure 12.

      In Figure 12, factor 2 and 3 are plotted against each other? and factor 1 is left out?

      This observation is incorrect; Factor 1 was included: Top subplots (feeding) show Factor 1 vs 3 (MIo) and Factor 1 vs 2 (SIo) while the bottom subplots (drinking) show Factor 2 vs 3 (MIo) and Factor 1 vs 2 (SIo).  We have clarified this in the Methods and caption of Figure 12.

      Finally, why are factor analysis results shown only for monkey R? 

      Factor analysis results were performed on both animals, but the results were shown only for monkey R to decrease the number of figures in the manuscript. Figure 7- figure supplement 1 shows the data for both monkeys. Here are the equivalent Figure 7 plots for monkey Y. 

      Author response image 1.

      Reviewer #2 (Recommendations for the authors): 

      Overall, the manuscript has been improved. 

      New analyses provide improved rigor (as just one example, organizing the feeding data into three-category split to better match the three-direction drinking data decoding analysis and also matching the neuron counts).

      The updated nerve block change method (using an equal number of trials with a similar leftright angle of movement in the last 100 ms of the tongue trajectory) somewhat reduces my concern that kinematic diGerences could account for the neural changes, but on the other hand the neural analyses use 250 ms (meaning that the neural diGerences could be related to behavioral diGerences earlier in the trial). Why not subselect to trials with similar trajectories throughout the whole movement(or at least show that as an additional analysis, albeit one with lower trial counts). 

      As the reviewer pointed out, selecting similar trajectories throughout the whole movement would result in lower trial counts that lead to poor statistical power. We think that the 100 ms prior to maximum tongue protrusion is a more important movement segment to control for similar kinematics between the control and nerve block conditions since this represents the subject’s intended movement endpoint. 

      A lot of the Results seemed like a list of measurements without suGicient hand-holding or guide-posting to explain what the take-away for the reader should be. Just one example to make concrete this broadly-applicable feedback: "Cumulative explained variance for the first three factors was higher in feeding (MIo: 82%, SIo: 81%) than in drinking (MIo: 74%, SIo: 63%) when all neurons were used for the factor analysis (Fig. 7)": why should we care about 3 factors specifically? Does this mean that in feeding, the neural dimensionality is lower (since 3 factors explain more of it)? Does that mean feeding is a "simpler" behavior (which is counter-intuitive and does not conform to the authors' comments about the higher complexity of feeding). And from later in that paragraph: what are we do make of the diGerences in neural trajectory distances (aside from quantifying using a diGerent metric the same larger changes in firing rates that could just as well be quantified as statistics across single-neuron PETHs)?

      Thank you for the feedback on the writing style. We have made some revisions to describe the takeaway for the reader. That fewer latent factors explain 80% of the variance in the feeding data means that the underlying network activity is relatively simple despite apparent complexity. When neural population trajectories are farther away from each other in state space, it means that the patterns of activity across tongue directions are more distinct and separable, thus, less likely to be confused with each other. This signifies that neural representations of 3D tongue directions are more robust. When there is better neural discrimination and more reliable information processing, it is easier for downstream brain regions to distinguish between diGerent tongue directions.  

      The addition of more population-level analyses is nice as it provides a more eGicient summary of the neural measurements. However, it's a surface-level dive into these methods; ultimately the goal of ensemble "computation through dynamics" analyses is to discover simpler structure / organizational principles at the ensemble level (i.e., show things not evidence from single neurons), rather than just using them as a way to summarize data. For instance, here neural rotations are remarked upon in the Results, without referencing influential prior work describing such rotations and why neural circuits may use this computational motif to separate out conditions and shape muscle activity-generating readouts (Churchland et al. Nature 2012 and subsequent theoretical iterations including the Russo et al.). That said, the Russo et al tangling study was well-referenced and the present tangling results were eGectively contextualized with respect to that paper in terms of the interpretation. I wish more of the results were interpreted with comparable depth. 

      Speaking of Russo et al: the authors note qualitative diGerences in tangling between brain areas, but do not actually quantify tangling in either. These observations would be stronger if quantified and accompanied with statistics.

      Contrary to the reviewer’s critique, we did frame these results in the context of structure/organizational principles at the ensemble level. We had already cited prior work of Churchland et al., 2012; Michaels et al., 2016and Russo et al., 2018. In the Discussion, DiGerences across behaviors, we wrote: “In contrast, MIo trajectories in drinking exhibited a consistent rotational direction regardless of spout location (Fig. 7). This may reflect a predominant non-directional information such as condition-independent time-varying spiking activity during drinking (Kaufman et al., 2016; Kobak et al., 2016; Arce-McShane et al., 2023).” 

      Minor suggestions: 

      Some typos, e.g. 

      • no opening parenthesis in "We quantified directional diGerences in population activity by calculating the Euclidean distance over m latent factors)"

      • missing space in "independent neurons(Santhanam et al., 2009;..."); 

      • missing closing parentheses in "followed by the Posterior Inferior (Figure 3 - figure supplement 1."

      There is a one-page long paragraph in the Discussion. Please consider breaking up the text into more paragraphs each organized around one key idea to aid readability.

      Thank you, we have corrected these typos.

      Could it be that the Kaufman et al 2013 reference was intended to be Kaufman et al 2015 eNeuro (the condition-invariant signal paper)?

      Thank you, we have corrected this reference.

      At the end of the Clinical Implications subsection of the Discussion, the authors note the growing field of brain-computer interfaces with references for motor read-out or sensory write-in of hand motor/sensory cortices, respectively. Given that this study looks at orofacial cortices, an even more clinically relevant development is the more recent progress in speech BCIs (two     recent reviews: https://www.nature.com/articles/s41583-024-00819-9, https://www.annualreviews.org/content/journals/10.1146/annurev-bioeng-110122012818) many of which record from human ventral motor cortex and aspirations towards FES-like approaches for orofacial movements (e.g., https://link.springer.com/article/10.1186/s12984-023-01272-y).  

      Thank you, we have included these references.

      Reviewer #3 (Recommendations for the authors): 

      Major Suggestions 

      (1) For the factor analysis of feeding vs licking, it appears that the factors were calculated separately for the two behaviors. It could be informative to calculate the factors under both conditions and project the neural data for the two behaviors into that space. The overlap/separations of the subspace could be informative. 

      We clarify that we performed a factor analysis that included both feeding and licking for MIo, as stated in the Results: “To control for factors such as diGerent neurons and kinematics that might influence the results, we performed factor analysis on stable neurons across both tasks using all trials (Fig. 7- figure supplement 2A) and using trials with similar kinematics (Fig. 7- figure supplement 2B).” We have revised the manuscript to reflect this more clearly.

      (2) For the LSTM, the Factor analyses and the decoding it is unclear if the firing rates are mean subtracted and being normalized (the methods section was a little unclear). Typically, papers in the field either z-score the data or do a softmax.

      The firing rates were z-scored for the LSTM and KNN. For the factor analysis, the spike counts were not z-scored, but the results were normalized. We clarified this in the Methods section.

      Minor: 

      Page 1: Abstract- '... how OSMCx contributes to...' 

      Since there are no direct causal manipulations of OSMCx in this manuscript, this study doesn't directly study the OSMCx's contribution to movement - I would recommend rewording this sentence.

      Similarly, Page 2: 'OSMCx plays an important role in coordination...' the citations in this paragraph are correlative, and do not demonstrate a causal role.

      There are similar usages of 'OSMCx coordinates...' in other places e.g. Page 8. 

      Thank you, we revised these sentences.

      Page 7: the LSTM here has 400 units, which is a very large network and contains >12000 parameters. Networks of this size are prone to memorization, it would be wise to test the rsquare of the validation set against a shuGled dataset to see if the network is actually working as intended. 

      Thank you for bringing up this important point of verifying that the network is learning meaningful patterns versus memorizing. Considering the size of our training samples, the ratio of samples to parameters is appropriate and thus the risk of memorization is low. Indeed, validation tests and cross-validation performed indicated expected network behavior and the R squared values obtained here were similar to those reported in our previous paper (Laurence-Chasen et al., 2023).


      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In their paper, Hosack and Arce-McShane investigate how the 3D movement direction of the tongue is represented in the orofacial part of the sensory-motor cortex and how this representation changes with the loss of oral sensation. They examine the firing patterns of neurons in the orofacial parts of the primary motor cortex (MIo) and somatosensory cortex (SIo) in non-human primates (NHPs) during drinking and feeding tasks. While recording neural activity, they also tracked the kinematics of tongue movement using biplanar videoradiography of markers implanted in the tongue. Their findings indicate that most units in both MIo and SIo are directionally tuned during the drinking task. However, during the feeding task, directional turning was more frequent in MIo units and less prominent in SIo units. Additionally, in some recording sessions, they blocked sensory feedback using bilateral nerve block injections, which resulted in fewer directionally tuned units and changes in the overall distribution of the preferred direction of the units.

      Strengths:

      The most significant strength of this paper lies in its unique combination of experimental tools. The author utilized a video-radiography method to capture 3D kinematics of the tongue movement during two behavioral tasks while simultaneously recording activity from two brain areas. Moreover, they employed a nerve-blocking procedure to halt sensory feedback. This specific dataset and experimental setup hold great potential for future research on the understudied orofacial segment of the sensory-motor area.

      Weaknesses:

      Aside from the last part of the result section, the majority of the analyses in this paper are focused on single units. I understand the need to characterize the number of single units that directly code for external variables like movement direction, especially for less-studied areas like the orofacial part of the sensory-motor cortex. However, as a field, our decadelong experience in the arm region of sensory-motor cortices suggests that many of the idiosyncratic behaviors of single units can be better understood when the neural activity is studied at the level of the state space of the population. By doing so, for the arm region, we were able to explain why units have "mixed selectivity" for external variables, why the tuning of units changes in the planning and execution phase of the movement, why activity in the planning phase does not lead to undesired muscle activity, etc. See (Gallego et al. 2017; Vyas et al. 2020; Churchland and Shenoy 2024) for a review. Therefore, I believe investigating the dynamics of the population activity in orofacial regions can similarly help the reader go beyond the peculiarities of single units and in a broader view, inform us if the same principles found in the arm region can be generalized to other segments of sensorymotor cortex.

      We thank and agree with the reviewer on the value of information gained from studying population activity. We also appreciate that population analyses have led to the understanding that individual neurons have “mixed selectivity”. We have shown previously that OSMCx neurons exhibit mixed selectivity in their population activity and clear separation between latent factors associated with gape and bite force levels (Arce-McShane FI, Sessle BJ, Ram Y, Ross CF, Hatsopoulos NG (2023) Multiple regions of primate orofacial sensorimotor cortex encode bite force and gape. Front Systems Neurosci. doi: 10.3389/fnsys.2023.1213279. PMID: 37808467 PMCID: 10556252), and chew-side and food types (Li Z & Arce-McShane FI (2023). Cortical representation of mastication in the primate orofacial sensorimotor cortex. Program No. NANO06.05. 2023 Neuroscience Meeting Planner. Washington, D.C.: Society for Neuroscience, 2023. Online.). 

      The primary goal of this paper was to characterize single units in the orofacial region and to do a follow-up paper on population activity. In the revised manuscript, we have now incorporated the results of population-level analyses. The combined results of the single unit and population analyses provide a deeper understanding of the cortical representation of 3D direction of tongue movements during natural feeding and drinking behaviors. 

      Further, for the nerve-blocking experiments, the authors demonstrate that the lack of sensory feedback severely alters how the movement is executed at the level of behavior and neural activity. However, I had a hard time interpreting these results since any change in neural activity after blocking the orofacial nerves could be due to either the lack of the sensory signal or, as the authors suggest, due to the NHPs executing a different movement to compensate for the lack of sensory information or the combination of both of these factors. Hence, it would be helpful to know if the authors have any hint in the data that can tease apart these factors. For example, analyzing a subset of nerve-blocked trials that have similar kinematics to the control.

      Thank you for bringing this important point. We agree with the reviewer that any change in the neural activity may be attributed to lack of sensory signal or to compensatory changes or a combination of these factors. To tease apart these factors, we sampled an equal number of trials with similar kinematics for both control and nerve block feeding sessions. We added clarifying description of this approach in the Results section of the revised manuscript: “To confirm this e ect was not merely due to altered kinematics, we conducted parallel analyses using carefully subsampled trials with matched kinematic profiles from both control and nerve-blocked conditions.”

      Furthermore, we ran additional analysis for the drinking datasets by subsampling a similar distribution of drinking movements from each condition. We compared the neural data from an equal number of trials with a similar left-right angle of movement in the last 100 ms of the tongue trajectory, nearest the spout. We compared the directional tuning across an equal number of trials with a similar left-right angle of movement in the last 100 ms of the tongue trajectory, nearest the spout. These analyses that control for similar kinematics showed that there was still a decrease in the proportion of directionally modulated neurons with nerve block compared to the control. This confirms that the results may be attributed to the lack of tactile information. These are now integrated in the revised paper under Methods section: Directional tuning of single neurons, as well as Results section: E ects of nerve block: Decreased directional tuning of MIo and SIo neurons and Figure 10 – figure supplement 1.

      Reviewer #2 (Public review):

      Summary:

      This manuscript by Hosack and Arce-McShane examines the directional tuning of neurons in macaque primary motor (MIo) and somatosensory (SIo) cortex. The neural basis of tongue control is far less studied than, for example, forelimb movements, partly because the tongue's kinematics and kinetics are difficult to measure. A major technical advantage of this study is using biplanar video-radiography, processed with modern motion tracking analysis software, to track the movement of the tongue inside the oral cavity. Compared to prior work, the behaviors are more naturalistic behaviors (feeding and licking water from one of three spouts), although the animals were still head-fixed.

      The study's main findings are that:

      • A majority of neurons in MIo and a (somewhat smaller) percentage of SIo modulated their firing rates during tongue movements, with different modulations depending on the direction of movement (i.e., exhibited directional tuning). Examining the statistics of tuning across neurons, there was anisotropy (e.g., more neurons preferring anterior movement) and a lateral bias in which tongue direction neurons preferred that was consistent with the innervation patterns of tongue control muscles (although with some inconsistency between monkeys).

      • Consistent with this encoding, tongue position could be decoded with moderate accuracy even from small ensembles of ~28 neurons.

      • There were di erences observed in the proportion and extent of directional tuning between the feeding and licking behaviors, with stronger tuning overall during licking. This potentially suggests behavioral context-dependent encoding.

      • The authors then went one step further and used a bilateral nerve block to the sensory inputs (trigeminal nerve) from the tongue. This impaired the precision of tongue movements and resulted in an apparent reduction and change in neural tuning in Mio and SIo.

      Strengths:

      The data are difficult to obtain and appear to have been rigorously measured, and provide a valuable contribution to this under-explored subfield of sensorimotor neuroscience. The analyses adopt well-established methods, especially from the arm motor control literature, and represent a natural starting point for characterizing tongue 3D direction tuning.

      Weaknesses:

      There are alternative explanations for some of the interpretations, but those interpretations are described in a way that clearly distinguishes results from interpretations, and readers can make their own assessments. Some of these limitations are described in more detail below.

      One weakness of the current study is that there is substantial variability in results between monkeys, and that only one session of data per monkey/condition is analyzed (8 sessions total). This raises the concern that the results could be idiosyncratic. The Methods mention that other datasets were collected, but not analyzed because the imaging pre-processing is very labor-intensive. While I recognize that time is precious, I do think in this case the manuscript would be substantially strengthened by showing that the results are similar on other sessions.

      We acknowledge the reviewer’s concern about inter-subject variability. Animal feeding and drinking behaviors are quite stable across sessions, thus, we do not think that additional sessions will address the concern that the results could be idiosyncratic. Each of the eight datasets analyzed here have su icient neural and kinematic data to capture neural and behavioral patterns.  Nevertheless, we performed some of the analyses on a second feeding dataset from Monkey R. The results from analyses on a subset of this data were consistent across datasets; for example, (1) similar proportions of directionally tuned neurons, (2) similar distances between population trajectories (t-test p > 0.9), and (3) a consistently smaller distance between Anterior-Posterior pairs than others in MIo (t-test p < 0.05) but not SIo (p > 0.1). 

      This study focuses on describing directional tuning using the preferred direction (PD) / cosine tuning model popularized by Georgopoulous and colleagues for understanding neural control of arm reaching in the 1980s. This is a reasonable starting point and a decent first-order description of neural tuning. However, the arm motor control field has moved far past that viewpoint, and in some ways, an over-fixation on static representational encoding models and PDs held that field back for many years. The manuscript benefits from drawing the readers' attention (perhaps in their Discussion) that PDs are a very simple starting point for characterizing how cortical activity relates to kinematics, but that there is likely much richer population-level dynamical structure and that a more mechanistic, control-focused analytical framework may be fruitful. A good review of this evolution in the arm field can be found in Vyas S, Golub MD, Sussillo D, Shenoy K. 2020. Computation Through Neural Population Dynamics. Annual Review of Neuroscience. 43(1):249-75

      Thank you for highlighting this important point. Research on orofacial movements hasn't progressed at the same pace as limb movement studies. Our manuscript focused specifically on characterizing the 3D directional tuning properties of individual neurons in the orofacial area—an analysis that has not been conducted previously for orofacial sensorimotor control. While we initially prioritized this individual neuron analysis, we recognize the value of broader population-level insights.

      Based on your helpful feedback, we have incorporated additional population analyses to provide a more comprehensive picture of orofacial sensorimotor control and expanded our discussion section. We appreciate your expertise in pushing our work to be more thorough and aligned with current neuroscience approaches.

      Can the authors explain (or at least speculate) why there was such a large difference in behavioral e ect due to nerve block between the two monkeys (Figure 7)?

      We acknowledge this as a variable inherent to this type of experimentation. Previous studies have found large kinematic variation in the effect of oral nerve block as well as in the following compensatory strategies between subjects. Each animal’s biology and response to perturbation vary naturally. Indeed, our subjects exhibited different feeding behavior even in the absence of nerve block perturbation (see Figure 2 in Laurence-Chasen et al., 2022). This is why each individual serves as its own control.

      Do the analyses showing a decrease in tuning after nerve block take into account the changes (and sometimes reduction in variability) of the kinematics between these conditions? In other words, if you subsampled trials to have similar distributions of kinematics between Control and Block conditions, does the effect hold true? The extreme scenario to illustrate my concern is that if Block conditions resulted in all identical movements (which of course they don't), the tuning analysis would find no tuned neurons. The lack of change in decoding accuracy is another yellow flag that there may be a methodological explanation for the decreased tuning result.

      Thank you for bringing up this point. We accounted for the changes in the variability of the kinematics between the control and nerve block conditions in the feeding dataset where we sampled an equal number of trials with similar kinematics for both control and nerve block. However, we did not control for similar kinematics in the drinking task. In the revised manuscript, we have clarified this and performed similar analysis for the drinking task. We sampled a similar distribution of drinking movements from each condition. We compared the neural data from an equal number of trials with a similar left-right angle of movement in the last 100 ms of the tongue trajectory, nearest the spout. There was a decrease in the percentage of neurons that were directionally modulated (between 30 and 80%) with nerve block compared to the control. These results have been included in the revised paper under Methods section: Directional tuning of single neurons, as well as Results section: E ects of nerve block: Decreased directionality of MIo and SIo neurons.

      While the results from decoding using KNN did not show significant differences between decoding accuracies in control vs. nerve block conditions, the results from the additional factor analysis and decoding using LSTM were consistent with the decrease in directional tuning at the level of individual neurons.  

      The manuscript states that "Our results suggest that the somatosensory cortex may be less involved than the motor areas during feeding, possibly because it is a more ingrained and stereotyped behavior as opposed to tongue protrusion or drinking tasks". Could an alternative explanation be more statistical/technical in nature: that during feeding, there will be more variability in exactly what somato sensation afferent signals are being received from trial to trial (because slight differences in kinematics can have large differences in exactly where the tongue is and the where/when/how of what parts of it are touching other parts of the oral cavity)? This variability could "smear out" the apparent tuning using these types of trial-averaged analyses. Given how important proprioception and somatosensation are for not biting the tongue or choking, the speculation that somatosensory cortical activity is suppressed during feedback is very counter-intuitive to this reviewer.

      Thank you for bringing up this point. We have now incorporated this in our revised Discussion (see Comparison between MIo and SIo). We agree with the reviewer that trialby-trial variability in the a erent signals may account for the lower directional signal in SIo during feeding than in drinking. Indeed, SIo’s mean-matched Fano factor in feeding was significantly higher than those in drinking (Author response image 1). Moreover, the results of the additional population and decoding analyses also support this.  

      Author response image 1.

      Comparison of mean-matched Fano Factor between Sio neurons during feeding and drinking control tasks across both subjects (Wilcoxon rank sum test, p < 0.001).

      Reviewer #3 (Public review):

      Summary:

      In this study, the authors aim to uncover how 3D tongue direction is represented in the Motor (M1o) and Somatosensory (S1o) cortex. In non-human primates implanted with chronic electrode arrays, they use X-ray-based imaging to track the kinematics of the tongue and jaw as the animal is either chewing food or licking from a spout. They then correlate the tongue kinematics with the recorded neural activity. Using linear regressions, they characterize the tuning properties and distributions of the recorded population during feeding and licking. Then, they recharacterize the tuning properties after bilateral lidocaine injections in the two sensory branches of the trigeminal nerve. They report that their nerve block causes a reorganization of the tuning properties. Overall, this paper concludes that M1o and S1o both contain representations of the tongue direction, but their numbers, their tuning properties, and susceptibility to perturbed sensory input are different.

      Strengths:

      The major strengths of this paper are in the state-of-the-art experimental methods employed to collect the electrophysiological and kinematic data.

      Weaknesses:

      However, this paper has a number of weaknesses in the analysis of this data.

      It is unclear how reliable the neural responses are to the stimuli. The trial-by-trial variability of the neural firing rates is not reported. Thus, it is unclear if the methods used for establishing that a neuron is modulated and tuned to a direction are susceptible to spurious correlations. The authors do not use shuffling or bootstrapping tests to determine the robustness of their fits or determining the 'preferred direction' of the neurons. This weakness colors the rest of the paper.

      Thank you for raising these points. We have performed the following additional analyses: (1) We have added analyses to ensure that the results could not be explained by neural variability. To show the trial-by-trial variability of the neural firing rates, we have calculated the Fano factor (mean overall = 1.34747; control = 1.46471; nerve block = 1.23023). The distribution was similar across directions, suggesting that responses of MIo and SIo neurons to varying 3D directions were reliable. (2) We have used a bootstrap procedure to ensure that directional tuning cannot be explained by mere chance. (3) To test the robustness of our PDs we also performed a bootstrap test, which yielded the same results for >90% of neurons, and a multiple linear regression test for fit to a cosine-tuning function. In the revised manuscript, the Methods and Results sections have been updated to include these analyses.  

      Author response image 2.

      Comparison of Fano Factor across directions for MIo and SIo Feeding Control (Kruskal-Wallis, p > 0.7).

      The authors compare the tuning properties during feeding to those during licking but only focus on the tongue-tip. However, the two behaviors are different also in their engagement of the jaw muscles. Thus many of the differences observed between the two 'tasks' might have very little to do with an alternation in the properties of the neural code - and more to do with the differences in the movements involved. 

      Using the tongue tip for the kinematic analysis of tongue directional movements was a deliberate choice as the anterior region of the tongue is highly mobile and sensitive due to a higher density of mechanoreceptors. The tongue tip is the first region that touches the spout in the drinking task and moves the food into the oral cavity for chewing and subsequent swallowing. 

      We agree with the reviewer that the jaw muscles are engaged differently in feeding vs. drinking (see Fig. 2). For example, a wider variety of jaw movements along the three axes are observed in feeding compared to the smaller amplitude and mostly vertical jaw movements in drinking. Also, the tongue movements are very different between the two behaviors. In feeding, the tongue moves in varied directions to position the food between left-right tooth rows during chewing, whereas in the drinking task, the tongue moves to discrete locations to receive the juice reward. Moreover, the tongue-jaw coordination differs between tasks; maximum tongue protrusion coincides with maximum gape in drinking but with minimum gape in the feeding behavior. Thus, the different tongue and jaw movements required in each behavior may account for some of the differences observed in the directional tuning properties of individual neurons and population activity. These points have been included in the revised Discussion.

      Author response image 3.

      Tongue tip position (mm) and jaw pitch(degree) during feeding (left) and drinking (right) behaviors. Most protruded tongue position coincides with minimum gape (jaw pitch at 0°) during  feeding but with maximum gape during drinking.

      Many of the neurons are likely correlated with both Jaw movements and tongue movements - this complicates the interpretations and raises the possibility that the differences in tuning properties across tasks are trivial.

      We thank the reviewer for raising this important point. In fact, we verified in a previous study whether the correlation between the tongue and jaw kinematics might explain di erences in the encoding of tongue kinematics and shape in MIo (see Supplementary Fig. 4 in Laurence-Chasen et al., 2023): “Through iterative sampling of sub-regions of the test trials, we found that correlation of tongue kinematic variables with mandibular motion does not account for decoding accuracy. Even at times where tongue motion was completely un-correlated with the jaw, decoding accuracy could be quite high.” 

      The results obtained from population analyses showing distinct properties of population trajectories in feeding vs. drinking behaviors provide strong support to the interpretation that directional information varies between these behaviors.

      The population analyses for decoding are rudimentary and provide very coarse estimates (left, center, or right), it is also unclear what the major takeaways from the population decoding analyses are. The reduced classification accuracy could very well be a consequence of linear models being unable to account for the complexity of feeding movements, while the licking movements are 'simpler' and thus are better accounted for.

      We thank the reviewer for raising this point. The population decoding analyses provide additional insight on the directional information in population activity,  as well as a point of comparison with the results of numerous decoding studies on the arm region of the sensorimotor cortex. In the revised version, we have included the results from decoding tongue direction using a long short-term memory (LSTM) network for sequence-tosequence decoding. These results di ered from the KNN results, indicating that a linear model such as KNN was better for drinking and that a non-linear and continuous decoder was better suited for feeding.  These results have been included in the revised manuscript.

      The nature of the nerve block and what sensory pathways are being affected is unclear - the trigeminal nerve contains many different sensory afferents - is there a characterization of how e ectively the nerve impulses are being blocked? Have the authors confirmed or characterized the strength of their inactivation or block, I was unable to find any electrophysiological evidence characterizing the perturbation.

      The strength of the nerve block is characterized by a decrease in the baseline firing rate of SIo neurons, as shown in Supplementary Figure 6 of “Loss of oral sensation impairs feeding performance and consistency of tongue–jaw coordination” (Laurence-Chasen et al., 2022)..

      Overall, while this paper provides a descriptive account of the observed neural correlations and their alteration by perturbation, a synthesis of the observed changes and some insight into neural processing of tongue kinematics would strengthen this paper.

      We thank the reviewer for this suggestion. We have revised the Discussion to provide a synthesis of the results and insights into the neural processing of tongue kinematics.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) The procedure for anesthesia explained in the method section was not clear to me. The following information was missing: what drug/dose was used? How long the animal was under anesthesia? How long after the recovery the experiments were done?

      The animals were fully sedated with ketamine (100 mg/ml, 10 mg/kg) for less than 30 minutes, and all of the data was collected within 90 minutes after the nerve block was administered.

      (2) In Figure 10, panels A and B are very close together, it was not at first clear whether the text "Monkey R, Monkey Y" belongs to panel A or B.

      We have separated the two panels further in the revised figure.

      (3) I found Figure 11 very busy and hard to interpret. Separating monkeys, fitting the line for each condition, or using a bar plot can help with the readability of the figure.

      Thank you for the suggestion. We agree with you and have reworked this figure. To simplify it we have shown the mean accuracy across iterations.

      (4) I found the laterality discussions like "This signifies that there are more neurons in the left hemisphere contributes toward one direction of tongue movement, suggesting that there is some laterality in the PDs of OSMCx neurons that varies between individuals" bit of an over-interpretation of data, given the low n value and the dissimilarity in how strongly the nerve blocking altered monkies behavior.

      Thank you for sharing this viewpoint. We do think that laterality is a good point of comparison with studies on M1 neurons in the arm/hand region. In our study, we found that the peak of the PD distribution coincides with leftward tongue movements in feeding. The distribution of PDs provides insight into how tongue muscles are coordinated during movement. Intrinsic and extrinsic tongue muscles are involved in shaping the tongue (e.g., elongation, broadening) and positioning the tongue (e.g., protrusion/retraction, elevation/depression), respectively. These muscles receive bilateral motor innervation except for genioglossus. Straight tongue protrusion requires the balanced action of the right and left genioglossi while the lateral protrusion involves primarily the contralateral genioglossus. Given this unilateral innervation pattern, we hypothesized that left MIo/SIo neurons would preferentially respond to leftward tongue movements, corresponding to right genioglossus activation. 

      Reviewer #2 (Recommendations for the authors):

      Are the observation of tuning peaks being most frequently observed toward the anterior and superior directions consistent with the statistics of the movements the tongue typically makes? This could be analogous to anisotropies previously reported in the arm literature, e.g., Lillicrap TP, Scott SH. 2013. Preference Distributions of Primary Motor Cortex Neurons Reflect Control Solutions Optimized for Limb Biomechanics. Neuron. 77(1):168-79

      Thank you for bringing our attention to analogous findings by Lillicrap & Scott, 2013. Indeed, we do observe the highest number of movements in the Anterior Superior directions, followed by the Posterior Inferior. This does align with the distribution of tuning peaks that we observed. Author response image 4 shows the proportions of observed movements in each group of directions across all feeding datasets. We have incorporated this data in the Results section: Neuronal modulation patterns di er between MIo and SIo, as well as added this point in the Discussion.

      Author response image 4.

      Proportion of feeding trials in each group of directions. Error bars represent ±1 standard deviation across datasets (n = 4).

      "The Euclidean distance was used to identify nearest neighbors, and the number of nearest neighbors used was K = 7. This K value was determined after testing different Ks which yielded comparable results." In general, it's a decoding best practice to tune hyperparameters (like K) on fully held-out data from the data used for evaluation. Otherwise, this tends to slightly inflate performance because one picks the hyperparameter that happened to give the best result. It sounds like that held-out validation set wasn't used here. I don't think that's going to change the results much at all (especially given the "comparable results" comment), but providing this suggestion for the future. If the authors replicate results on other datasets, I suggest they keep K = 7 to lock in the method.

      K = 7 was chosen based on the size of our smallest training dataset (n = 55). The purpose of testing different K values was not to select which value gave the best result, but to demonstrate that similar K values did not affect the results significantly. We tested the di erent K values on a subset of the feeding data, but that data was not fully held-out from the training set. We will keep your suggestion in mind for future analysis.

      The smoothing applied to Figure 2 PSTHs appears perhaps excessive (i.e., it may be obscuring interesting finer-grained details of these fast movements). Can the authors reduce the 50 ms Gaussian smoothing (I assume this is the s.d.?) ~25 ms is often used in studying arm kinematics. It also looks like the movement-related modulation may not be finished in these 200 ms / 500 ms windows. I suggest extending the shown time window. It would also be helpful to show some trial-averaged behavior (e.g. speed or % displacement from start) under or behind the PSTHs, to give a sense of what phase of the movement the neural activity corresponds to.

      Thank you for the suggestion. We have taken your suggestions into consideration and modified Figure 2 accordingly. We decreased the Gaussian kernel to 25 ms and extended the time window shown. The trial-averaged anterior/posterior displacement was also added to the drinking PSTHs.

      Reviewer #3 (Recommendations for the authors):

      The major consideration here is that the data reported for feeding appears to be very similar to that reported in a previous study:

      "Robust cortical encoding of 3D tongue shape during feeding in macaques"

      Are the neurons reported here the same as the ones used in this previous paper? It is deeply concerning that this is not reported anywhere in the methods section.

      These are the same neurons as in our previous paper, though here we include several additional datasets of the nerve block and drinking sessions. We have now included this in the methods section.

      Second, I strongly recommend that the authors consider a thorough rewrite of this manuscript and improve the presentation of the figures. As written, it was not easy to follow the paper, the logic of the experiments, or the specific data being presented in the figures.

      Thank you for this suggestion. We have done an extensive rewrite of the manuscript and revision of the figures.

      A few recommendations:

      (1) Please structure your results sections and use descriptive topic sentences to focus the reader. In the current version, it is unclear what the major point being conveyed for each analysis is.

      Thank you for this suggestion. We have added topic sentences to the begin each section of the results.

      (2) Please show raster plots for at least a few example neurons so that the readers have a sense of what the neural responses look like across trials. Is all of Figure 2 one example neuron or are they different neurons? Error bars for PETH would be useful to show the reliability and robustness of the tuning.

      Figure 2 shows different neurons, one from MIo and one from SIo for each task. There is shading showing ±1 standard error around the line for each direction, however this was a bit difficult to see. In addition to the other changes we have made to these figures, we made the lines smaller and darkened the error bar shading to accentuate this. We also added raster plots corresponding to the same neurons represented in Figure 2 as a supplement.

      (3) Since there are only two data points, I am not sure I understand why the authors have bar graphs and error bars for graphs such as Figure 3B, Figure 5B, etc. How can one have an error bar and means with just 2 data points?

      Those bars represent the standard error of the proportion. We have changed the y-axis label on these figures to make this clearer.

      (4) Results in Figure 6 could be due to differential placement of the electrodes across the animals. How is this being accounted for?

      Yes, this is a possibility which we have mentioned in the discussion. Even with careful placement there is no guarantee to capture a set of neurons with the exact same function in two subjects, as every individual is different. Rather we focus on analyses of data within the same animal. The purpose of Figure 6 is to show the di erence between MIo and SIo, and between the two tasks, within the same subject. The more salient result from calculating the preferred direction is that there is a change in the distribution between control and nerve block within the same exact population. Discussions relating to the comparison between individuals are speculative and cannot be confirmed without the inclusion of many more subjects.

      (5) For Figure 7, I would recommend showing the results of the Sham injection in the same figure instead of a supplement.

      Thank you for the suggestion, we have added these results to the figure.

      (6) I think the e ects of the sensory block on the tongue kinematics are underexplored in Figure 7 and Figure 8. The authors could explore the deficits in tongue shape, and the temporal components of the trajectory.

      Some of these effects on feeding have been explored in a previous paper, LaurenceChasen et al., 2022. We performed some additional analyses on changes to kinematics during drinking, including the number of licks per 10 second trial and the length of individual licks. The results of these are included below. We also calculated the difference in the speed of tongue movement during drinking, which generally decreased and exhibited an increase in variance with nerve block (f-test, p < 0.001). However, we have not included these figures in the main paper as they do not inform us about directionality.

      Author response image 5.

      Left halves of hemi-violins (black) are control and right halves (red) are nerve block for an individual. Horizontal black lines represent the mean and horizontal red lines the median. Results of two-tailed t-test and f-test are indicated by asterisks and crosses, respectively: *,† p < 0.05; **,†† p < 0.01; ***,††† p < 0.001.

      (9) In Figures 9 and 10. Are the same neurons being recorded before and after the nerve block? It is unclear if the overall "population" properties are different, or if the properties of individual neurons are changing due to the nerve block.

      Yes, the same neurons are being recorded before and after nerve block. Specifically, Figure 9B shows that the properties of many individual neurons do change due to the nerve block. Di erences in the overall population response may be attributed to some of the units having reduced/no activity during the nerve block session.

      Additionally, I recommend that the authors improve their introduction and provide more context to their discussion. Please elaborate on what you think are the main conceptual advances in your study, and place them in the context of the existing literature. By my count, there are 26 citations in this paper, 4 of which are self-citations - clearly, this can be improved upon.

      Thank you for this suggestion. We have done an extensive rewrite of the Introduction and Discussion. We discussed the main conceptual advances in our study and place them in the context of the existing literature.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      We thank the reviewers for their careful assessment and enthusiastic appreciation of our work.

      __Reviewer #1 (Evidence, reproducibility and clarity (Required)): __In this article, Thomas et al. use a super-resolution approach in living cells to track proteins involved in the fusion event of sexual reproduction. They study the spatial organization and dynamics of the actin fusion focus, a key structure in cell-cell fusion in Schizosaccharomyces pombe. The researchers have adapted a high-precision centroid mapping method using three-color live-cell epifluorescence imaging to map the dynamic architecture of the fusion focus during yeast mating. The approach relies on tracking the centroid of fluorescence signals for proteins of interest, spatially referenced to Myo52-mScarlet-I (as a robust marker) and temporally referenced using a weakly fluorescent cytosolic protein (mRaspberry), which redistributes strongly upon fusion. The trajectories of five key proteins, including markers of polarity, cytoskeleton, exocytosis and membrane fusion, were compared to Myo52 over a 75-minute window spanning fusion. Their observations indicate that secretory vesicles maintain a constant distance from the plasma membrane whereas the actin network compacts. Most importantly, they discovered a positive feedback mechanism in which myosin V (Myo52) transports Fus1 formin along pre-existing actin filaments, thereby enhancing aster compaction.

      This article is well written, the arguments are convincing and the assertions are balanced. The centroid tracking method has been clearly and solidly controlled. Overall, this is a solid addition to our understanding of cytoskeletal organization in cell fusion.

      Major comments: No major comment.

      Minor comments: _ Page 8 authors wrote "Upon depletion of Myo52, Ypt3 did not accumulate at the fusion focus (Figure 3C). A thin, wide localization at the fusion site was occasionally observed (Figure 3C, Movies S3)" : Is there a quantification of this accumulation in the mutant?

      We will provide the requested quantification. The localization is very faint, so we are not sure that quantification will capture this faithfully, but we will try.

      _ The framerate of movies could be improved for reader comfort: For example, movie S6 lasts 0.5 sec.

      We agree that movies S3 and S6 frame rates could be improved. We will provide them with slower frame rate.

      Reviewer #1 (Significance (Required)):

      This study represents a conceptual and technical breakthrough in our understanding of cytoskeletal organization during cell-cell fusion. The authors introduce a high-precision, three-color live-cell centroid mapping method capable of resolving the spatio-temporal dynamics of protein complexes at the nanometer scale in living yeast cells. This methodological innovation enables systematic and quantitative mapping of the dynamic architecture of proteins at the cell fusion site, making it a powerful live-cell imaging approach. However, it is important to keep in mind that the increased precision achieved through averaging comes at the expense of overlooking atypical or outlier behaviors. The authors discovered a myosin V-dependent mechanism for the recruitment of formin that leads to actin aster compaction. The identification of Myo52 (myosin V) as a transporter of Fus1 (formin) to the fusion focus adds a new layer to our understanding of how polarized actin structures are generated and maintained during developmentally regulated processes such as mating.

      Previous studies have shown the importance of formins and myosins during fusion, but this paper provides a quantitative and dynamic mapping that demonstrates how Myo52 modulates Fus1 positioning in living cells. This provides a better understanding of actin organization, beyond what has been demonstrated by fixed-cell imaging or genetic perturbation.

      Audience: Cell biologists working on actin dynamics, cell-cell fusion and intracellular transport. Scientists involved in live-cell imaging, single particle tracking and cytoskeleton modeling.

      I have expertise in live-cell microscopy, image analysis, fungal growth machinery and actin organization.

      We thank the reviewer for their appreciation of our work.

      __Reviewer #2 (Evidence, reproducibility and clarity (Required)): __ A three-color imaging approach to use centroid tracking is employed to determine the high resolution position over time of tagged actin fusion focus proteins during mating in fission yeast. In particular, the position of different protein components (tagged in a 3rd color) were determined in relation to the position (and axis) of the molecular motor Myo52, which is tagged with two different colors in the mating cells. Furthermore, time is normalized by the rapid diffusion of a weak fluorescent protein probe (mRaspberry) from one cell to the other upon fusion pore opening. From this approach multiple important mechanistic insights were determined for the compaction of fusion focus proteins during mating, including the general compaction of different components as fusion proceeds with different proteins having specific stereotypical behaviors that indicate underlying molecular insights. For example, secretory vesicles remain a constant distance from the plasma membrane, whereas the formin Fus1 rapidly accumulates at the fusion focus in a Myo52-dependent manner.

      I have minor suggestions/points: (1) Figure 1, for clarity it would be helpful if the cells shown in B were in the same orientation as the cartoon cells shown in A. Similarly, it would be helpful to have the orientation shown in D the same as the data that is subsequently presented in the rest of the manuscript (such as Figure 2) where time is on the X axis and distance (position) is on the Y axis.

      We have turned each image in panel B by 180° to match the cartoon in A. For panel D, we are not sure what the reviewer would like. This panel shows the coordinates of each Myo52 position, whereas Figure 2 shows oriented distance (on the Y axis) over time (on the X axis). Perhaps the reviewer suggests that we should display panel D with a rotation onto the Y axis rather than the X axis. We feel that this would not bring more clarity and prefer to keep it as is.

      (2) Figure 2, for clarity useful to introduce how the position of Myo52 changes over time with respect to the fusion site (plasma membrane) earlier, and then come back to the positions of different proteins with respect to Myo52 shown in 2E. Currently the authors discuss this point after introducing Figure 2E, but better for the reader to have this in mind beforehand.

      We have added a sentence at the start of the section describing Figure 2, pointing out that the static appearance of Myo52 is due to it being used as reference, but that in reality, it moves relative to the plasma membrane: “Because Myo52 is the reference, its trace is flat, even though in reality Myo52 also moves relative to other proteins and the plasma membrane (see Figure 2E)”. This change is already in the text.

      (3) First sentence of page 8 "..., peaked at fusion time and sharply dropped post-fusion (Figure S3)." Figure S3 should be cited so that the reader knows where this data is presented.

      Thanks, we have added the missing figure reference to the text.

      (4) Figure 3D-H, why is Exo70 used as a marker for vesicles instead of Ypt3 for these experiments? Exo70 seems to have a more confusing localization than Ypt3 (3C vs 3D), which seems to complicate interpretations.

      There are two main reasons for this choice. First, the GFP-Ypt3 fluorescence intensity is lower than that of Exo70-GFP, which makes analysis more difficult and less reliable. Second, in contrast to Exo70-GFP where the endogenous gene is tagged at the native genomic locus, GFP-Ypt3 is expressed as additional copy in addition to endogenous untagged Ypt3. Although GFP-Ypt3 was reported to be fully functional as it can complement the lethality of a ypt3 temperature sensitive mutant (Cheng et al, MBoC 2002), its expression levels are non-native and we do not have a strain in which ypt3 is tagged at the 5’ end at the native genomic locus. For these reasons, we preferred to examine in detail the localization of Exo70. We do not think it complicates interpretations. Exo70 faithfully decorates vesicles and exhibits the same localization as Ypt3 in WT cells (see Figure 2D) and in myo52-AID (see Figure 3C-D). We realize that our text was a bit confusing as we opposed the localization of Exo70 and Ypt3, when all we wanted to state was that the Exo70-GFP signal is stronger. We have corrected this in the text.

      (5) Page 10, end of first paragraph, "We conclude...and promotes separation of Myo52 from the vesicles." This is an interesting hypothesis/interpretation that is consistent with the spatial-temporal organization of vesicles and the compacting fusion focus, but the underlying molecular mechanism has not be concluded.

      This is an interpretation that is in line with our data. Firm conclusion that the organization of the actin fusion focus imposes a steric barrier to bulk vesicle entry will require in vitro reconstitution of an actin aster driven by formin-myosin V feedback and addition of myosin V vesicle-like cargo, which can be a target for future studies. To make clear that it is an interpretation and not a definitive statement, we have added “likely” to the sentence, as in: “We conclude that the distal position of vesicles in WT cells is a likely steric consequence of the architecture of the fusion focus, which restricts space at the center of the actin aster and promotes separation of Myo52 from the vesicles”.

      (6) Figure 5F and 5G, the results are confusing and should be discussed further. Depletion of Myo52 decreases Fus1 long-range movements, indicating that Fus1 is being transported by Myo52 (5F). Similarly, the Fus1 actin assembly mutant greatly decreases Fus1 long-range movements and prevents Myo52 binding (5G), perhaps indicating that Fus1-mediated actin assembly is important. It seems the author's interpretations are oversimplified.

      We show that Myo52 is critical for Fus1 long-range movements, as stated by the reviewer. We also show that Fus1-mediated actin assembly is important. The question is in what way.

      One possibility is that FH2-mediated actin assembly powers the movement, which in this case represents the displacement of the formin due to actin monomer addition on the polymerizing filament. A second possibility is that actin filaments assembled by Fus1 somehow help Myo52 move Fus1. This could be for instance because Fus1-assembled actin filaments are preferred tracks for Myo52-mediated movements, or because they allow Myo52 to accumulate in the vicinity of Fus1, enhancing their chance encounter and thus the number of long-range movements (on any actin track). Based on the analysis of the K1112A point mutant in Fus1 FH2 domain, our data cannot discriminate between these three different options, which is why we concluded that the mutant allele does not allow us to make a firm conclusion. However, the Myo52-dependence clearly shows that a large fraction of the movements requires the myosin V. We have clarified the end of the paragraph in the following way: “Therefore, analysis of the K1112A mutant phenotype does not allow us to clearly distinguish between Fus1-powered from Myo52-powered movements. Future work will be required to test whether, in addition to myosin V-dependent transport, Fus1-mediated actin polymerization also directly contributes to Fus1 long-range movements.”

      (7) Figure 6, why not measure the fluorescence intensity of Fus1 as a proxy for the number of Fus1 molecules (rather than the width of the Fus1 signal), which seems to be the more straight-forward analysis?

      The aim of the measurement was to test whether Myo52 and Fus1 activity help focalize the formin at the fusion site, not whether these are required for localization in this region. This is why we are measuring the lateral spread of the signal (its width) rather than the fluorescence intensity of the signal. We know from previous work that Fus1 localizes to the shmoo tip independently of myosin V (Dudin et al, JCB 2015), and we also show this in Figure 6. However, the precise distribution of Fus1 is wider in absence of the myosins.

      We can and will measure intensities to test whether there is also a quantitative difference in the number of molecules at the shmoo tip.

      (8) Figure 7, the authors should note (and perhaps discuss) any evidence as to whether activation of Fus1 to facilitate actin assembly depends upon Fus1 dissociating from Myo52 or whether Fus1 can be activated while still associated with Myo52, as both circumstances are included in the figure.

      This is an interesting point. We have no experimental evidence for or against Fus1 dissociating from Myo52 to assemble actin. However, it is known that formins rotate along the actin filament double helix as they assemble it, a movement that seems poorly compatible with processive transport by myosin V. In Figure 7, we do not particularly want to imply that Myo52 associates with Fus1 linked or not with an actin filament. The figure serves to illustrate the focusing mechanism of myosin V transporting a formin, which is more evident when we draw the formin attached to a filament end. We have now added a sentence in the figure legend to clarify this point: “Note that it is unknown whether Myo52 transports Fus1 associated or not with an actin filament.”

      (9) Figure 7, the color of secretory vesicles should be the same in A and B.

      This is now corrected.

      Reviewer #2 (Significance (Required)):

      This is an impactful and high quality manuscript that describes an elegant experimental strategy with important insights determined. The experimental imaging strategy (and analysis), as well as the insight into the pombe mating fusion focus and its comparison to other cytoskeletal compaction events will be of broad scientific interest.

      We thank the reviewer for their appreciation of our work.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      Summary:

      Fission yeast cell-cell fusion during mating is mediated by an actin-based structure called the 'fusion focus', which orchestrates actin polymerization by the mating-specific formin, Fus1, to direct polarized secretion towards the mating site. In the current study, Thomas and colleagues quantitatively map the spatial distribution of proteins mediating cell-cell fusion using a three-color fluorescence imaging methodology in the fission yeast Schizosaccharomyces pombe. Using Myo52 (Type V myosin) as a fluorescence reference point, the authors discover that proteins known to localize to the fusion focus have distinct spatial distributions and accumulation profiles at the mating site. Myo52 and Fus1 form a complex in vivo detected by co-immunoprecipitation and each contribute to directing secretory vesicles to the fusion focus. Previous work from this group has shown that the intrinsically disordered region (IDR) of Fus1 plays a critical role in forming the fusion focus. Here, the authors swap out the IDR of fission yeast Fus1 for the IDR of an unrelated mammalian protein, coincidentally called 'fused in sarcoma' (FUS). They express the Fus1∆IDR-FUSLC-27R chimera in mitotically dividing fission yeast cells, where Fus1 is not normally expressed, and discover that the Fus1∆IDR-FUSLC-27R chimera can travel with Myo52 on actively polymerizing actin cables. Additionally, they show that acute loss of Myo52 or Fus1 function, using Auxin-Inducible Degradation (AID) tags and point mutations, impair the normal compaction of the fusion focus, suggesting that direct interaction and coordination of Fus1 and Myo52 helps shape this structure.

      Major Comments:

      (1) In the Results section for Figure 2, the authors claim that actin filaments become shorter and more cross-linked they move away from the fusion site during mating, and suggest that this may be due to the presence of Myo51. However, the evidence to support this claim is not made clear. Is it supported by high-resolution electron microscopy of the actin filaments, or some other results? This needs to be clarified.

      Sorry if our text was unclear. The basis for the claim that actin filaments become shorter comes from our observation that the average position of tropomyosin and Myo51, both of which decorate actin filaments, is progressively closer to both Fus1 and the plasma membrane. Thus, the actin structure protrudes less into the cytosol as fusion progresses. The basis for claiming that Myo51 promotes actin filament crosslinking comes mainly from previously published papers, which had shown that 1) Myo51 forms complexes with the Rng8 and Rng9 proteins (Wang et al, JCB 2014), and 2) the Myo51-Rng8/9 not only binds actin through Myo51 head domain but also binds tropomyosin-decorated actin through the Rng8/9 moiety (Tang et al, JCB 2016; reference 27 in our manuscript). We had also previously shown that these proteins are necessary for compaction of the fusion focus (Dudin et al, PLoS Genetics 2017; reference 28 in our manuscript). Except for measuring the width of Fus1 distribution in myo51∆ mutants, which confirms previous findings, we did not re-investigate here the function of Myo51.

      We have now re-written this paragraph to present the previous data more clearly: “The distal localization of Myo51 was mirrored by that of tropomyosin Cdc8, which decorates linear actin filaments (Figure 2B) (Hatano et al, 2022). The distal position of the bulk of Myo51-decorated actin filaments was confirmed using Airyscan super-resolution microscopy (Figure 2B, right). Thus, the average position of actin filaments and decreasing distance to Myo52 indicates they initially extend a few hundred nanometers into the cytosol and become progressively shorter as fusion proceeds. Previous work had shown that Myo51 cross-links and slides Cdc8-decorated actin filaments relative to each other (Tang et al, 2016) and that both proteins contribute to compaction of the fusion focus in the lateral dimension along the cell-cell contact area (perpendicular to the fusion axis) (Dudin et al, 2017). We confirmed this function by measuring the lateral distribution of Fus1 along the cell-cell contact area (perpendicular to the fusion axis), which was indeed wider in myo51∆ than WT cells (see below Figure 6A-B).”

      (2) In Figure 4, the authors comment that disrupting Fus1 results in more disperse Myo52 spatial distribution at the fusion focus, raising the possibility that Myo52 normally becomes focused by moving on the actin filaments assembled by Fus1. This can be tested by asking whether latrunculin treatment phenocopies the 'more dispersed' Myo52 localization seen in fus1∆ cells? If Myo52 is focused instead by its direct interaction with Fus1, the latrunculin treatment should not cause the same phenotype.

      This is in principle a good idea, though it is technically challenging because pharmacological treatment of cell pairs in fusion is difficult to do without disturbing pheromone gradients which are critical throughout the fusion process (see Dudin et al, Genes and Dev 2016). We will try the experiment but are unsure about the likelihood of technical success.

      We note however that a similar experiment was done previously on Fus1 overexpressed in mitotic cells (Billault-Chaumartin et al, Curr Biol 2022; Fig 1D). Here, Fus1 also forms a focus and latrunculin A treatment leads to Myo52 dispersion while keeping the Fus1 focus, which is in line with our proposal that Myo52 becomes focused by moving on Fus1-assembled actin filaments. Similarly, we showed in Figure 5B that Latrunculin A treatment of mitotic cells expressing Fus1∆IDR-FUSLC-27R also results in Myo52, but not Fus1 dispersion.

      (3) The Fus1∆IDR-FUSLC-27R chimera used in Figure 5 is an interesting construct to examine actin-based transport of formins in cells. I was curious if the authors could provide the rates of movement for Myo52 and for Fus1∆IDR-FUSLC-27R, both before and after acute depletion of Myo52. It would be interesting to see if loss of Myo52 alters the rate of movement, or instead the movement stems from formin-mediated actin polymerization.

      We will measure these rates.

      (4) Also, Myo52 is known to interact with the mitotic formin For3. Does For3 colocalize with Myo52 and Fus1∆IDR-FUSLC-27R along actin cables?

      This is an interesting question for which we do not have an answer. For technical reasons, we do not have the tools to co-image For3 with Fus1∆IDR-FUSLC-27R because both are tagged with GFP. We feel that this question goes beyond the scope of this paper.

      (5) If Fus1∆IDR-FUSLC-27R is active, does having ectopic formin activity in mitotic cells affect actin cable architecture? This could be assessed by comparing phalloidin staining for wildtype and Fus1∆IDR-FUSLC-27R cells.

      We are not sure what the purpose of this experiment is, or how informative it would be. If it is to evaluate whether Fus1∆IDR-FUSLC-27R is active, our current data already demonstrates this. Indeed, Fus1∆IDR-FUSLC-27R recruits Myo52 in a F-actin and FH2 domain-dependent manner (shown in Figure 5B and 5G), which demonstrates that Fus1∆IDR-FUSLC-27R FH2 domain is active. Even though Fus1∆IDR-FUSLC-27R assembles actin, we predict that its effect on general actin organization will be weak. Indeed, it is expressed under endogenous fus1 promoter, leading to very low expression levels during mitotic growth, such that only a subset of cells exhibit a Fus1 focus. Furthermore, most of these Fus1 foci are at or close to cell poles, where linear actin cables are assembled by For3, such that they may not have a strong disturbing effect. Because analysis of actin cable organization by phalloidin staining is difficult (due to the more strongly staining actin patches), cells with clear change in organization predicted to be rare in the population, and the gain in knowledge not transformative, we are not keen to do this experiment.

      Minor Comments:

      Prior studies are referenced appropriately. Text and figures are clear and accurate. My only suggestion would be Figure 1E-H could be moved to the supplemental material, due to their extremely technical nature. I believe this would help the broad audience focus on the experimental design mapped out in Figure 1A-D.

      We are relatively neutral about this. If this suggestion is supported by the Editor, we can move these panels to supplement.

      Reviewer #3 (Significance (Required)):

      Significance: This study provides an improved imaging method for detecting the spatial distributions of proteins below 100 nm, providing new insights about how a relatively small cellular structure is organized. The use of three-color cell imaging to accurately measure accumulation rates of molecular components of the fusion focus provides new insight into the development of this structure and its roles in mating. This method could be applied to other multi-protein structures found in different cell types. This work uses rigorously genetic tools such as knockout, knockdown and point mutants to dissect the roles of the formin Fus1 and Type V myosin Myo52 in creating a proper fusion focus. The study could be improved by biochemical assays to test whether Myo52 and Fus1 directly interact, since the interaction is only shown by co-immunoprecipitation from extracts, which may reflect an indirect interaction.

      Indeed, future studies should dissect the Fus1-Myo52 interaction, to determine whether it is direct and identify mutants that impair it.

      I believe this work advances the cell-mating field by providing others with a spatial and temporal map of conserved factors arriving to the mating site. Additionally, they identified a way to study a mating specific protein in mitotically dividing cells, offering future questions to address.

      This study should appeal to a range of basic scientists interested in cell biology, the cytoskeleton, and model organisms. The three-colored quantitative imaging could be applied to defining the architecture of many other cellular structures in different systems. Myosin and actin scientists will be interested in how this work expands the interplay of these two fields.

      I am a cell biologist with expertise in live cell imaging, genetics and biochemistry.

      We thank the reviewer for their appreciation of our work.

    1. Reviewer #1 (Public review):

      Summary:

      Parise presents another instantiation of the Multisensory Correlation Detector model that can now accept stimulus-level inputs. This is a valuable development as it removes researcher involvement in the characterization/labeling of features and allows analysis of complex stimuli with a high degree of nuance that was previously unconsidered (i.e. spatial/spectral distributions across time). The author demonstrates the power of the model by fitting data from dozens of previous experiments including multiple species, tasks, behavioral modality, and pharmacological interventions.

      Strengths:

      One of the model's biggest strengths, in my opinion, is its ability to extract complex spatiotemporal co-relationships from multisensory stimuli. These relationships have typically been manually computed or assigned based on stimulus condition and often distilled to a single dimension or even single number (e.g., "-50 ms asynchrony"). Thus, many models of multisensory integration depend heavily on human preprocessing of stimuli and these models miss out on complex dynamics of stimuli; the lead modality distribution apparent in figure 3b and c are provocative. I can imagine the model revealing interesting characteristics of the facial distribution of correlation during continuous audiovisual speech that have up to this point been largely described as "present" and almost solely focused on the lip area.

      Another aspect that makes the MCD stand out among other models is the biological inspiration and generalizability across domains. The model was developed to describe a separate process - motion perception - and in a much simpler organism - drosophila. It could then describe a very basic neural computation that has been conserved across phylogeny (which is further demonstrated in the ability to predict rat, primate, and human data) and brain area. This aspect makes the model likely able to account for much more than what has already been demonstrated with only a few tweaks akin to the modifications described in this and previous articles from Parise.

      What allows this potential is that, as Parise and colleagues have demonstrated in those papers since our (re)introduction of the model in 2016, the MCD model is modular - both in its ability to interface with different inputs/outputs and its ability to chain MCD units in a way that can analyze spatial, spectral, or any other arbitrary dimension of a stimulus. This fact leaves wide-open the possibilities for types of data, stimuli, and tasks a simplistic neutrally inspired model can account for.

      And so it's unsurprising (but impressive!) that Parise has demonstrated the model's ability here to account for such a wide range of empirical data from numerous tasks (synchrony/temporal order judgement, localization, detection, etc.) and behavior types (manual/saccade responses, gaze, etc.) using only the stimulus and a few free parameters. This ability is another of the model's main strengths that I think deserves some emphasis: it represents a kind of validation of those experiments - especially in the context of cross-experiment predictions.

      Finally, what is perhaps most impressive to me is that the MCD (and the accompanying decision model) does all this with very few (sometimes zero) free parameters. This highlights the utility of the model and the plausibility of its underlying architecture, but also helps to prevent extreme overfitting if fit correctly.

      Weaknesses:

      The model boasts an incredible versatility across tasks and stimulus configurations and its overall scope of the model is to understand how and what relevant sensory information is extracted from a stimulus. We still need to exercise care when interpreting its parameters, especially considering the broader context of top-down control of perception and that some multisensory mappings may not be derivable purely from stimulus statistics (e.g., the complementary nature of some phonemes/visemes).

    2. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      Parise presents another instantiation of the Multisensory Correlation Detector model that can now accept stimulus-level inputs. This is a valuable development as it removes researcher involvement in the characterization/labeling of features and allows analysis of complex stimuli with a high degree of nuance that was previously unconsidered (i.e., spatial/spectral distributions across time). The author demonstrates the power of the model by fitting data from dozens of previous experiments, including multiple species, tasks, behavioral modalities, and pharmacological interventions.

      Thanks for the kind words!

      Strengths:

      One of the model's biggest strengths, in my opinion, is its ability to extract complex spatiotemporal co-relationships from multisensory stimuli. These relationships have typically been manually computed or assigned based on stimulus condition and often distilled to a single dimension or even a single number (e.g., "-50 ms asynchrony"). Thus, many models of multisensory integration depend heavily on human preprocessing of stimuli, and these models miss out on complex dynamics of stimuli; the lead modality distribution apparent in Figures 3b and c is provocative. I can imagine the model revealing interesting characteristics of the facial distribution of correlation during continuous audiovisual speech that have up to this point been largely described as "present" and almost solely focused on the lip area.

      Another aspect that makes the MCD stand out among other models is the biological inspiration and generalizability across domains. The model was developed to describe a separate process - motion perception - and in a much simpler organism - Drosophila. It could then describe a very basic neural computation that has been conserved across phylogeny (which is further demonstrated in the ability to predict rat, primate, and human data) and brain area. This aspect makes the model likely able to account for much more than what has already been demonstrated with only a few tweaks akin to the modifications described in this and previous articles from Parise.

      What allows this potential is that, as Parise and colleagues have demonstrated in those papers since our (re)introduction of the model in 2016, the MCD model is modular - both in its ability to interface with different inputs/outputs and its ability to chain MCD units in a way that can analyze spatial, spectral, or any other arbitrary dimension of a stimulus. This fact leaves wide open the possibilities for types of data, stimuli, and tasks a simplistic, neutrally inspired model can account for.

      And so it's unsurprising (but impressive!) that Parise has demonstrated the model's ability here to account for such a wide range of empirical data from numerous tasks (synchrony/temporal order judgement, localization, detection, etc.) and behavior types (manual/saccade responses, gaze, etc.) using only the stimulus and a few free parameters. This ability is another of the model's main strengths that I think deserves some emphasis: it represents a kind of validation of those experiments, especially in the context of cross-experiment predictions (but see some criticism of that below).

      Finally, what is perhaps most impressive to me is that the MCD (and the accompanying decision model) does all this with very few (sometimes zero) free parameters. This highlights the utility of the model and the plausibility of its underlying architecture, but also helps to prevent extreme overfitting if fit correctly (but see a related concern below).

      We sincerely thank the reviewer for their thoughtful and generous comments. We are especially pleased that the core strengths of the model—its stimulus-computable architecture, biological grounding, modularity, and cross-domain applicability—were clearly recognized. As the reviewer rightly notes, removing researcher-defined abstractions and working directly from naturalistic stimuli opens the door to uncovering previously overlooked dynamics in complex multisensory signals, such as the spatial and temporal richness of audiovisual speech.

      We also appreciate the recognition of the model’s origins in a simple organism and its generalization across species and behaviors. This phylogenetic continuity reinforces our view that the MCD captures a fundamental computation with wide-ranging implications. Finally, we are grateful for the reviewer’s emphasis on the model’s predictive power across tasks and datasets with few or no free parameters—a property we see as key to both its parsimony and explanatory utility.

      We have highlighted these points more explicitly in the revised manuscript, and we thank the reviewer for their generous and insightful endorsement of the work.

      Weaknesses:

      There is an insufficient level of detail in the methods about model fitting. As a result, it's unclear what data the models were fitted and validated on. Were models fit individually or on average group data? Each condition separately? Is the model predictive of unseen data? Was the model cross-validated? Relatedly, the manuscript mentions a randomization test, but the shuffled data produces model responses that are still highly correlated to behavior despite shuffling. Could it be that any stimulus that varies in AV onset asynchrony can produce a psychometric curve that matches any other task with asynchrony judgements baked into the task? Does this mean all SJ or TOJ tasks produce correlated psychometric curves? Or more generally, is Pearson's correlation insensitive to subtle changes here, considering psychometric curves are typically sigmoidal? Curves can be non-overlapping and still highly correlated if one is, for example, scaled differently. Would an error term such as mean-squared or root mean-squared error be more sensitive to subtle changes in psychometric curves? Alternatively, perhaps if the models aren't cross-validated, the high correlation values are due to overfitting?

      The reviewer is right: the current version of the manuscript only provides limited information about parameter fitting. In the revised version of the manuscript, we included a parameter estimation and generalizability section that includes all information requested by the reviewer.

      To test whether using the MSE instead of Pearson correlation led to a similar estimated set of parameter values, we repeated the fitting using the MSE. The parameter estimated with this method (TauV, TauA, TauBim) closely followed those estimated using Pearson correlation (TauV, TauA, TauBim). Given the similarity of these results, we have chosen not to include further figures, however this analysis is now included in the new section (pages 23-24).

      Regarding the permutation test, it is expected that different stimuli produce analogous psychometric functions: after all, all studies relied on stimuli containing identical manipulation of lags. As a result, MCD population responses tend to be similar across experiments. Therefore, it is not a surprise that the permuted distribution of MCD-data correlation in Supplementary Figure 1K has a mean as high as 0.97. However, what is important is to demonstrate that the non-permuted dataset has an even higher goodness of fit. Supplementary Figure 1K demonstrates that none of the permuted stimuli could outperform the non-permuted dataset; the mean of the non-permuted distribution is 4.7 (standard deviations) above the mean of the already high  permuted distribution.

      We believe the new section, along with the present response, fully addresses the legitimate concerns of the reviewer.

      While the model boasts incredible versatility across tasks and stimulus configurations, fitting behavioral data well doesn't mean we've captured the underlying neural processes, and thus, we need to be careful when interpreting results. For example, the model produces temporal parameters fitting rat behavior that are 4x faster than when fitting human data. This difference in slope and a difference at the tails were interpreted as differences in perceptual sensitivity related to general processing speeds of the rat, presumably related to brain/body size differences. While rats no doubt have these differences in neural processing speed/integration windows, it seems reasonable that a lot of the differences in human and rat psychometric functions could be explained by the (over)training and motivation of rats to perform on every trial for a reward - increasing attention/sensitivity (slope) - and a tendency to make mistakes (compression evident at the tails). Was there an attempt to fit these data with a lapse parameter built into the decisional model as was done in Equation 21? Likewise, the fitted parameters for the pharmacological manipulations during the SJ task indicated differences in the decisional (but not the perceptual) process and the article makes the claim that "all pharmacologically-induced changes in audiovisual time perception" can be attributed to decisional processes "with no need to postulate changes in low-level temporal processing." However, those papers discuss actual sensory effects of pharmacological manipulation, with one specifically reporting changes to response timing. Moreover, and again contrary to the conclusions drawn from model fits to those data, both papers also report a change in psychometric slope/JND in the TOJ task after pharmacological manipulation, which would presumably be reflected in changes to the perceptual (but not the decisional) parameters.

      Fitting or predicting behaviour does not in itself demonstrate that a model captures the underlying neural computations—though it may offer valuable constraints and insights. In line with this, we were careful not to extrapolate the implications of our simulations to specific neural mechanisms.

      Temporal sensitivity is, by definition, a behavioural metric, and—as the reviewer correctly notes—its estimation may reflect a range of contributing factors beyond low-level sensory processing, including attention, motivation, and lapse rates (i.e., stimulus-independent errors). In Equation 21, we introduced a lapse parameter specifically to account for such effects in the context of monkey eye-tracking data. For the rat datasets, however, the inclusion of a lapse term was not required to achieve a close fit to the psychometric data (ρ = 0.981). While it is likely that adding a lapse component would yield a marginally better fit, the absence of single-trial data prevents us from applying model comparison criteria such as AIC or BIC to justify the additional parameter. In light of this, and to avoid unnecessary model complexity, we opted not to include a lapse term in the rat simulations.

      With respect to the pharmacological manipulation data, we acknowledge the reviewer’s point that observed changes in slope and bias could plausibly arise from alterations at either the sensory or decisional level—or both. In our model, low-level sensory processing is instantiated by the MCD architecture, which outputs the MCDcorr and MCDlag signals that are then scaled and integrated during decision-making. Importantly, this scaling operation influences the slope of the resulting psychometric functions, such that changes in slope can arise even in the absence of any change to the MCD’s temporal filters. In our simulations, the temporal constants of the MCD units were fixed to the values estimated from the non-pharmacological condition (see parameter estimation section above), and only the decision-related parameters were allowed to vary. From this modelling perspective, the behavioural effects observed in the pharmacological datasets can be explained entirely by changes at the decisional level. However, we do not claim that such an explanation excludes the possibility of genuine sensory-level changes. Rather, we assert that our model can account for the observed data without requiring modifications to early temporal tuning.

      To rigorously distinguish sensory from decisional effects, future experiments will need to employ stimuli with richer temporal structure—e.g., temporally modulated sequences of clicks and flashes that vary in frequency, phase, rhythm, or regularity (see Fujisaki & Nishida, 2007; Denison et al., 2012; Parise & Ernst, 2016, 2025; Locke & Landy, 2017; Nidiffer et al., 2018). Such stimuli engage the MCD in a more stimulus-dependent manner, enabling a clearer separation between early sensory encoding and later decision-making processes. Unfortunately, the current rat datasets—based exclusively on single click-flash pairings—lack the complexity needed for such disambiguation. As a result, while our simulations suggest that the observed pharmacologically induced effects can be attributed to changes in decision-level parameters, they do not rule out concurrent sensory-level changes.

      In summary, our results indicate that changes in the temporal tuning of MCD units are not necessary to reproduce the observed pharmacological effects on audiovisual timing behaviour. However, we do not assert that such changes are absent or unnecessary in principle. Disentangling sensory and decisional contributions will ultimately require richer datasets and experimental paradigms designed specifically for this purpose. We have now modified the results section (page 6) and the discussion (page 11) to clarify these points.

      The case for the utility of a stimulus-computable model is convincing (as I mentioned above), but its framing as mission-critical for understanding multisensory perception is overstated, I think. The line for what is "stimulus computable" is arbitrary and doesn't seem to be followed in the paper. A strict definition might realistically require inputs to be, e.g., the patterns of light and sound waves available to our eyes and ears, while an even more strict definition might (unrealistically) require those stimuli to be physically present and transduced by the model. A reasonable looser definition might allow an "abstract and low-dimensional representation of the stimulus, such as the stimulus envelope (which was used in the paper), to be an input. Ultimately, some preprocessing of a stimulus does not necessarily confound interpretations about (multi)sensory perception. And on the flip side, the stimulus-computable aspect doesn't necessarily give the model supreme insight into perception. For example, the MCD model was "confused" by the stimuli used in our 2018 paper (Nidiffer et al., 2018; Parise & Ernst, 2025). In each of our stimuli (including catch trials), the onset and offset drove strong AV temporal correlations across all stimulus conditions (including catch trials), but were irrelevant to participants performing an amplitude modulation detection task. The to-be-detected amplitude modulations, set at individual thresholds, were not a salient aspect of the physical stimulus, and thus only marginally affected stimulus correlations. The model was of course, able to fit our data by "ignoring" the on/offsets (i.e., requiring human intervention), again highlighting that the model is tapping into a very basic and ubiquitous computational principle of (multi)sensory perception. But it does reveal a limitation of such a stimulus-computable model: that it is (so far) strictly bottom-up.

      We appreciate the reviewer’s thoughtful engagement with the concept of stimulus computability. We agree that the term requires careful definition and should not be taken as a guarantee of perceptual insight or neural plausibility. In our work, we define a model as “stimulus-computable” if all its inputs are derived directly from the stimulus, rather than from experimenter-defined summary descriptors such as temporal lag, spatial disparity, or cue reliability. In the context of multisensory integration, this implies that a model must account not only for how cues are combined, but also for how those cues are extracted from raw inputs—such as audio waveforms and visual contrast sequences.

      This distinction is central to our modelling philosophy. While ideal observer models often specify how information should be combined once identified, they typically do not address the upstream question of how this information is extracted from sensory input. In that sense, models that are not stimulus-computable leave out a key part of the perceptual pipeline. We do not present stimulus computability as a marker of theoretical superiority, but rather as a modelling constraint that is necessary if one’s aim is to explain how structured sensory input gives rise to perception. This is a view that is also explicitly acknowledged and supported by Reviewer 2.

      Framed in Marr’s (1982) terms, non–stimulus-computable models tend to operate at the computational level, defining what the system is doing (e.g., computing a maximum likelihood estimate), whereas stimulus-computable models aim to function at the algorithmic level, specifying how the relevant representations and operations might be implemented. When appropriately constrained by biological plausibility, such models may also inform hypotheses at the implementational level, pointing to potential neural substrates that could instantiate the computation.

      Regarding the reviewer’s example illustrating a limitation of the MCD model, we respectfully note that the account appears to be based on a misreading of our prior work. In Parise & Ernst (2025), where we simulated the stimuli from Nidiffer et al. (2018), the MCD model reproduced participants’ behavioural data without any human intervention or adjustment. The model was applied in a fully bottom-up, stimulus-driven manner, and its output aligned with observer responses as-is. We suspect the confusion may stem from analyses shown in Figure 6 - Supplement Figure 5 of Parise & Ernst (2025), where we investigated the lack of a frequency-doubling effect in the Nidiffer et al. data. However, those analyses were based solely on the Pearson correlation between auditory and visual stimulus envelopes and did not involve the MCD model. No manual exclusion of onset/offset events was applied, nor was the MCD used in those particular figures. We also note that Parise & Ernst (2025) is a separate, already published study and is not the manuscript currently under review. 

      In summary, while we fully agree that stimulus computability does not resolve all the complexities of multisensory perception (see comments below about speech), we maintain that it provides a valuable modelling constraint—one that enables robust, generalisable predictions when appropriately scoped. 

      The manuscript rightly chooses to focus a lot of the work on speech, fitting the MCD model to predict behavioral responses to speech. The range of findings from AV speech experiments that the MCD can account for is very convincing. Given the provided context that speech is "often claimed to be processed via dedicated mechanisms in the brain," a statement claiming a "first end-to-end account of multisensory perception," and findings that the MCD model can account for speech behaviors, it seems the reader is meant to infer that energetic correlation detection is a complete account of speech perception. I think this conclusion misses some facets of AV speech perception, such as integration of higher-order, non-redundant/correlated speech features (Campbell, 2008) and also the existence of top-down and predictive processing that aren't (yet!) explained by MCD. For example, one important benefit of AV speech is interactions on linguistic processes - how complementary sensitivity to articulatory features in the auditory and visual systems (Summerfield, 1987) allow constraint of linguistic processes (Peelle & Sommers, 2015; Tye-Murray et al., 2007).

      We thank the reviewer for their thoughtful comments, and especially for the kind words describing the range of findings from our AV speech simulations as “very convincing.”

      We would like to clarify that it is not our view that speech perception can be reduced to energetic correlation detection. While the MCD model captures low- to mid-level temporal dependencies between auditory and visual signals, we fully agree that a complete account of audiovisual speech perception must also include higher-order processes—including linguistic mechanisms and top-down predictions. These are critical components of AV speech comprehension, and lie beyond the scope of the current model.

      Our use of the term “end-to-end” is intended in a narrow operational sense: the model transforms raw audiovisual input (i.e., audio waveforms and video frames) directly into behavioural output (i.e., button press responses), without reliance on abstracted stimulus parameters such as lag, disparity or reliability. It is in this specific technical sense that the MCD offers an end-to-end model. We have revised the manuscript to clarify this usage to avoid any misunderstanding.

      In light of the reviewer’s valuable point, we have now edited the Discussion to acknowledge the importance of linguistic processes (page 13) and to clarify what we mean by end-to-end account (page 11). We agree that future work will need to explore how stimulus-computable models such as the MCD can be integrated with broader frameworks of linguistic and predictive processing (e.g., Summerfield, 1987; Campbell, 2008; Peelle & Sommers, 2015; Tye-Murray et al., 2007).

      References

      Campbell, R. (2008). The processing of audio-visual speech: empirical and neural bases. Philosophical Transactions of the Royal Society B: Biological Sciences, 363(1493), 1001-1010. https://doi.org/10.1098/rstb.2007.2155

      Nidiffer, A. R., Diederich, A., Ramachandran, R., & Wallace, M. T. (2018). Multisensory perception reflects individual differences in processing temporal correlations. Scientific Reports 2018 8:1, 8(1), 1-15. https://doi.org/10.1038/s41598-018-32673-y

      Parise, C. V, & Ernst, M. O. (2025). Multisensory integration operates on correlated input from unimodal transient channels. ELife, 12. https://doi.org/10.7554/ELIFE.90841

      Peelle, J. E., & Sommers, M. S. (2015). Prediction and constraint in audiovisual speech perception. Cortex, 68, 169-181. https://doi.org/10.1016/j.cortex.2015.03.006

      Summerfield, Q. (1987). Some preliminaries to a comprehensive account of audio-visual speech perception. In B. Dodd & R. Campbell (Eds.), Hearing by Eye: The Psychology of Lip-Reading (pp. 3-51). Lawrence Erlbaum Associates.

      Tye-Murray, N., Sommers, M., & Spehar, B. (2007). Auditory and Visual Lexical Neighborhoods in Audiovisual Speech Perception: Trends in Amplification, 11(4), 233-241. https://doi.org/10.1177/1084713807307409

      Reviewer #2 (Public review):

      Summary:

      Building on previous models of multisensory integration (including their earlier correlation-detection framework used for non-spatial signals), the author introduces a population-level Multisensory Correlation Detector (MCD) that processes raw auditory and visual data. Crucially, it does not rely on abstracted parameters, as is common in normative Bayesian models," but rather works directly on the stimulus itself (i.e., individual pixels and audio samples). By systematically testing the model against a range of experiments spanning human, monkey, and rat data, the authors show that their MCD population approach robustly predicts perception and behavior across species with a relatively small (0-4) number of free parameters.

      Strengths:

      (1) Unlike prior Bayesian models that used simplified or parameterized inputs, the model here is explicitly computable from full natural stimuli. This resolves a key gap in understanding how the brain might extract "time offsets" or "disparities" from continuously changing audio-visual streams.

      (2) The same population MCD architecture captures a remarkable range of multisensory phenomena, from classical illusions (McGurk, ventriloquism) and synchrony judgments, to attentional/gaze behavior driven by audio-visual salience. This generality strongly supports the idea that a single low-level computation (correlation detection) can underlie many distinct multisensory effects.

      (3) By tuning model parameters to different temporal rhythms (e.g., faster in rodents, slower in humans), the MCD explains cross-species perceptual data without reconfiguring the underlying architecture.

      We thank the reviewer for their positive evaluation of the manuscript, and particularly for highlighting the significance of the model's stimulus-computable architecture and its broad applicability across species and paradigms. Please find our responses to the individual points below.

      Weaknesses:

      (1) The authors show how a correlation-based model can account for the various multisensory integration effects observed in previous studies. However, a comparison of how the two accounts differ would shed light on the correlation model being an implementation of the Bayesian computations (different levels in Marr's hierarchy) or making testable predictions that can distinguish between the two frameworks. For example, how uncertainty in the cue combined estimate is also the harmonic mean of the unimodal uncertainties is a prediction from the Bayesian model. So, how the MCD framework predicts this reduced uncertainty could be one potential difference (or similarity) to the Bayesian model.

      We fully agree with the reviewer that a comparison between the correlation-based MCD model and Bayesian accounts is valuable—particularly for clarifying how the two frameworks differ conceptually and where they may converge.

      As noted in the revised manuscript, the key distinction lies in the level of analysis described by Marr (1982). Bayesian models operate at the computational level, describing what the system is aiming to compute (e.g., optimal cue integration). In contrast, the MCD functions at the algorithmic level, offering a biologically plausible mechanism for how such integration might emerge from stimulus-driven representations.

      In this context, the MCD provides a concrete, stimulus-grounded account of how perceptual estimates might be constructed—potentially implementing computations with Bayesian-like characteristics (e.g., reduced uncertainty, cue weighting). Thus, the two models are not mutually exclusive but can be seen as complementary: the MCD may offer an algorithmic instantiation of computations that, at the abstract level, resemble Bayesian inference.

      We have now updated the manuscript to explicitly highlight this relationship (pages 2 and 11). In the revised manuscript, we also included a new figure (Figure 5) and movie (Supplementary Movie 3), to show how the present approach extends previous Bayesian models for the case of cue integration (i.e., the ventriloquist effect).

      (2) The authors show a good match for cue combination involving 2 cues. While Bayesian accounts provide a direction for extension to more cues (also seen empirically, for eg, in Hecht et al. 2008), discussion on how the MCD model extends to more cues would benefit the readers.

      We thank the reviewer for this insightful comment: extending the MCD model to include more than two sensory modalities is a natural and valuable next step. Indeed, one of the strengths of the MCD framework lies in its modularity. Let us consider the MCDcorr​ output (Equation 6), which is computed as the pointwise product of transient inputs across modalities. Extending this to include a third modality, such as touch, is straightforward: MCD units would simply multiply the transient channels from all three modalities, effectively acting as trimodal coincidence detectors that respond when all inputs are aligned in time and space.

      By contrast, extending MCDlag is less intuitive, due to its reliance on opponency between two subunits (via subtraction). A plausible solution is to compute MCDlag in a pairwise fashion (e.g., AV, VT, AT), capturing relative timing across modality pairs.

      Importantly, the bulk of the spatial integration in our framework is carried by MCDcorr, which generalises naturally to more than two modalities. We have now formalised this extension and included a graphical representation in a supplementary section of the revised manuscript.

      Likely Impact and Usefulness:

      The work offers a compelling unification of multiple multisensory tasks- temporal order judgments, illusions, Bayesian causal inference, and overt visual attention - under a single, fully stimulus-driven framework. Its success with natural stimuli should interest computational neuroscientists, systems neuroscientists, and machine learning scientists. This paper thus makes an important contribution to the field by moving beyond minimalistic lab stimuli, illustrating how raw audio and video can be integrated using elementary correlation analyses.

      Reviewer #1 (Recommendations for the authors):

      Recommendations:

      My biggest concern is a lack of specificity about model fitting, which is assuaged by the inclusion of sufficient detail to replicate the analysis completely or the inclusion of the analysis code. The code availability indicates a script for the population model will be included, but it is unclear if this code will provide the fitting details for the whole of the analysis.

      We thank the reviewer for raising this important point. A new methodological section has been added to the manuscript, detailing the model fitting procedures used throughout the study. In addition, the accompanying code repository now includes MATLAB scripts that allow full replication of the spatiotemporal MCD simulations.

      Perhaps it could be enlightening to re-evaluate the model with a measure of error rather than correlation? And I think many researchers would be interested in the model's performance on unseen data.

      The model has now been re-evaluated using mean squared error (MSE), and the results remain consistent with those obtained using Pearson correlation. Additionally, we have clarified which parts of the study involve testing the model on unseen data (i.e., data not used to fit the temporal constants of the units). These analyses are now included and discussed in the revised fitting section of the manuscript (pages 23-24).

      Otherwise, my concerns involve the interpretation of findings, and thus could be satisfied with minor rewording or tempering conclusions.

      The manuscript has been revised to address these interpretative concerns, with several conclusions reworded or tempered accordingly. All changes are marked in blue in the revised version.

      Miscellanea:

      Should b0 in equation 10 be bcrit to match the below text?

      Thank you for catching this inconsistency. We have corrected Equation 10 (and also Equation 21) to use the more transparent notation bcrit instead of b0, in line with the accompanying text.

      Equation 23, should time be averaged separately? For example, if multiple people are speaking, the average correlation for those frames will be higher than the average correlation across all times.

      We thank the reviewer for raising this thoughtful and important point. In response, we have clarified the notation of Equation 23 in the revised manuscript (page 20). Specifically, we now denote the averaging operations explicitly as spatial means and standard deviations across all pixel locations within each frame.

      This equation computes the z-score of the MCD correlation value at the current gaze location, normalized relative to the spatial distribution of correlation values in the same frame. That is, all operations are performed at the frame level, not across time. This ensures that temporally distinct events are treated independently and that the final measure reflects relative salience within each moment, not a global average over the stimulus. In other words, the spatial distribution of MCD activity is re-centered and rescaled at each frame, exactly to avoid the type of inflation or confounding the reviewer rightly cautioned against.

      Reviewer #2 (Recommendations for the authors):

      The authors have done a great job of providing a stimulus computable model of cue combination. I had just a few suggestions to strengthen the theoretical part of the paper:

      (1) While the authors have shown a good match between MCD and cue combination, some theoretical justification or equivalence analysis would benefit readers on how the two relate to each other. Something like Zhang et al. 2019 (which is for motion cue combination) would add to the paper.

      We agree that it is important to clarify the theoretical relationship between the Multisensory Correlation Detector (MCD) and normative models of cue integration, such as Bayesian combination. In the revised manuscript, we have now modified the introduction and added a paragraph in the Discussion addressing this link more explicitly. In brief, we see the MCD as an algorithmic-level implementation (in Marr’s terms) that may approximate or instantiate aspects of Bayesian inference.

      (2) Simulating cue combination for tasks that require integration of more than two cues (visual, auditory, haptic cues) would more strongly relate the correlation model to Bayesian cue combination. If that is a lot of work, at least discussing this would benefit the paper

      This point has now been addressed, and a new paragraph discussing the extension of the MCD model to tasks involving more than two sensory modalities has been added to the Discussion section.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Recommendations for the authors):

      (1) The onus of making the revisions understandable to the reviewers lies with the authors. In its current form, how the authors have approached the review is hard to follow, in my opinion. Although the authors have taken a lot of effort in answering the questions posed by reviewers, parallel changes in the manuscript are not clearly mentioned. In many cases, the authors have acknowledged the criticism in response to the reviewer, but have not changed their narrative, particularly in the results section.

      We fully acknowledge your concern regarding the narrative linking EB-induced GluCl expression to JH biosynthesis and fecundity enhancement, particularly the need to address alternative interpretations of the data. Below, we outline the specific revisions made to address your feedback and ensure the manuscript’s narrative aligns more precisely with the experimental evidence:

      (1) Revised Wording in the Results Section

      To avoid overinterpretation of causality, we have modified the language in key sections of the Results (e.g., Figure 5 and related text):

      Original phrasing:

      “These results suggest that EB activates GluCl which induces JH biosynthesis and release, which in turn stimulates reproduction in BPH (Figure 5J).”

      Revised phrasing:

      “We also examined whether silencing Gluclα impacts the AstA/AstAR signaling pathway in female adults. Knock-down of Gluclα in female adults was found to have no impact on the expression of AT, AstA, AstB, AstCC, AstAR, and AstBR. However, the expression of AstCCC and AstCR was significantly upregulated in dsGluclα-injected insects (Figure 5-figure supplement 2A-H). Further studies are required to delineate the direct or indirect mechanisms underlying this effect of Gluclα-knockdown.” (line 643-649). And we have removed Figure 5J in the revised manuscript.

      (2) Expanded Discussion of Alternative Mechanisms

      In the Discussion section, we have incorporated a dedicated paragraph to explore alternative pathways and compensatory mechanisms:

      Key additions:

      “This EB action on GluClα expression is likely indirect, and we do not consider EB as transcriptional regulator of GluClα. Thus, the mechanism behind EB-mediated induction of GluClα remains to be determined. It is possible that prolonged EB exposure triggers feedback mechanisms (e.g. cellular stress responses) to counteract EB-induced GluClα dysfunction, leading to transcriptional upregulation of the channel. Hence, considering that EB exposure in our experiments lasts several days, these findings might represent indirect (or secondary) effects caused by other factors downstream of GluCl signaling that affect channel expression.” (line 837-845).

      (2) In the response to reviewers, the authors have mentioned line numbers in the main text where changes were made. But very frequently, those lines do not refer to the changes or mention just a subsection of changes done. As an example please see point 1 of Specific Points below. The problem is throughout the document making it very difficult to follow the revision and contributing to the point mentioned above.

      Thank you for highlighting this critical oversight. We sincerely apologize for the inconsistency in referencing line numbers and incomplete descriptions of revisions, which undoubtedly hindered your ability to track changes effectively. We have eliminated all vague or incomplete line number references from the response letter. Instead, revisions are now explicitly tied to specific sections, figures, or paragraphs.

      (3) The authors need to infer the performed experiments rationally without over interpretation. Currently, many of the claims that the authors are making are unsubstantiated. As a result of the first review process, the authors have acknowledged the discrepancies, but they have failed to alter their interpretations accordingly.

      We fully agree that overinterpretation of data undermines scientific rigor. In response to your feedback, we have systematically revised the manuscript to align claims strictly with experimental evidence and to eliminate unsubstantiated assertions. We sincerely apologize for the earlier overinterpretations and appreciate your insistence on precision. The revised manuscript now rigorously distinguishes between observations (e.g., EB-GluCl-JH correlations) and hypotheses (e.g., GluCl’s mechanistic role). By tempering causal language and integrating competing explanations, we aimed to present a more accurate and defensible narrative.

      SPECIFIC POINTS (to each question initially raised and their rebuttals)

      (1a) "Actually, there are many studies showing that insects treated with insecticides can increase the expression of target genes". Please note what is asked for is that the ligand itself induces the expression of its receptor. Of course, insecticide treatment will result in the changes expression of targets. Of all the evidences furnished in rebuttal, only Peng et al. 2017 fits the above definition. Even in this case, the accepted mode of action of chlorantraniliprole is by inducing structural change in ryanodine receptor. The observed induction of ryanodine receptor chlorantraniliprole can best be described as secondary effect. All others references do not really suffice the point asked for.

      We appreciate the reviewers’ suggestions for improving the manuscript. First, we have supplemented additional studies supporting the notion that " There are several studies showing that insects treated with insecticides display increases in the expression of target genes. For example, the relative expression level of the ryanodine receptor gene of the rice stem borer, Chilo suppressalis was increased 10-fold after treatment with chlorantraniliprole, an insecticide which targets the ryanodine receptor (Peng et al., 2017). In Drosophila, starvation (and low insulin) elevates the transcription level of the receptors of the neuropeptides short neuropeptide F and tachykinin (Ko et al., 2015; Root et al., 2011). In BPH, reduction in mRNA and protein expression of a nicotinic acetylcholine receptor α8 subunit is associated with resistance to imidacloprid (Zhang et al., 2015). Knockdown of the α8 gene by RNA interference decreased the sensitivity of N. lugens to imidacloprid (Zhang et al., 2015). Hence, the expression of receptor genes may be regulated by diverse factors, including insecticide exposure.” We have inserted text in lines 846-857 to elaborate on these possibilities.

      Second, we would like to reiterate our position: we have merely described this phenomenon, specifically that EB treatment increases GluClα expression. “This EB action on GluClα expression is likely indirect, and we do not consider EB as transcriptional regulator of GluClα. Thus, the mechanism behind EB-mediated induction of GluClα remains to be determined. It is possible that prolonged EB exposure triggers feedback mechanisms (e.g. cellular stress responses) to counteract EB-induced GluClα dysfunction, leading to transcriptional upregulation of the channel. Hence, considering that EB exposure in our experiments lasts several days, these findings might represent indirect (or secondary) effects caused by other factors downstream of GluCl signaling that affect channel expression.” We have inserted text in lines 837-845 to elaborate on these possibilities.

      Once again, we sincerely appreciate this discussion, which has provided us with a deeper understanding of this phenomenon.

      b. The authors in their rebuttal accepts that they do not consider EB to a transcriptional regulator of Gluclα and the induction of Gluclα as a result of EB can best be considered as a secondary effect. But that is not reflected in the manuscript, particularly in the result section. Current state of writing implies EB up regulation of Gluclα to an important event that contributes majorly to the hypothesis. So much so that they have retained the schematic diagram (Fig. 5J) where EB -> Gluclα is drawn. Even the heading of the subsection says "EB-enhanced fecundity in BPHs is dependent on its molecular target protein, the Gluclα channel". As mentioned in the general points, it is not enough to have a good rebuttal written to the reviewer, the parent manuscript needs to reflect on the changes asked for.

      Thank you for your comments. We have carefully addressed your suggestions and made corresponding revisions to the manuscript.

      We fully acknowledge the reviewer's valid concern. In this revised manuscript, “However, we do not propose that EB is a direct transcriptional regulator of Gluclα, since EB and other avermectins are known to alter the channel conformation and thus their function (Wolstenholme, 2012; Wu et al., 2017). Thus, it is likely that the observed increase in Gluclα transcipt is a secondary effect downstream of EB signaling.” (Line 625-629). We agree that the original presentation in the manuscript, particularly within the Results section, did not adequately reflect this nuance and could be misinterpreted as suggesting a direct regulatory role for EB on Gluclα transcription.

      Regarding Fig. 5J, we have removed the figure and all mentions of Fig. 5J and its legend in the revised manuscript.

      c. "We have inserted text on lines 738 - 757 to explain these possibilities." Not a single line in the section mentioned above discussed the topic in hand. This is serious undermining of the review process or carelessness to the extreme level.

      In the Results section, we have now added descriptions “Taken together, these results reveal that EB exposure is associated with an increase in JH titer and that this elevated JH signaling contributes to enhanced fecundity in BPH.” (line 375-377).

      For the figures, we have removed Fig. 4N and all mentions of Fig. 4N and its legend in the revised manuscript.

      Lastly, regarding the issue of locating specific lines, we deeply regret any inconvenience caused. Due to the track changes mode used during revisions, line numbers may have shifted, resulting in incorrect references. We sincerely apologize for this and have now corrected the line numbers.

      (2) The section written in rebuttal should be included in the discussion as well, explaining why authors think a nymphal treatment with JH may work in increasing fecundity of the adults. Also, the authors accept that EBs effect on JH titer in Indirect. The text of the manuscript, results section and figures should be reflective of that. It is NOT ok to accept that EB impacts JH titer indirectly in a rebuttal letter while still continuing to portray EB direct effect on JH titer. In terms of diagrams, authors cannot put a -> sign until and unless the effect is direct. This is an accepted norm in biological publications.

      We appreciate the reviewer’s valuable suggestions here. We have now carefully revised the manuscript to address all concerns, particularly regarding the mechanism linking nymphal EB exposure to adult fecundity and the indirect nature of EB’s effect on JH titers. Below are our point-by-point responses and corresponding manuscript changes. Revised text is clearly marked in the resubmitted manuscript.

      (1) Clarifying the mechanism linking nymphal EB treatment to adult fecundity:

      Reviewer concern: Explain why nymphal EB treatment increases adult fecundity despite undetectable EB residues in adults.

      Response & Actions Taken:

      We agree this requires explicit discussion. We now propose that nymphal EB exposure triggers developmental reprogramming (e.g., metabolic/epigenetic changes) that persist into adulthood, indirectly enhancing JH synthesis and fecundity. This is supported by two key findings:

      (1) No detectable EB residues in adults after nymphal treatment (new Figure 1–figure supplement 1C).

      (2) Increased adult weight and nutrient reserves (Figure 1–figure supplement 3E,F), suggesting altered resource allocation.

      Added to Discussion (Lines 793–803): Notably, after exposing fourth-instar BPH nymphs to EB, no EB residues were detected in the subsequent adult stage. This finding indicates that the EB-induced increase in adult fecundity is initiated during the nymphal stage and s manifests in adulthood - a mechanism distinct from the direct fecundity enhancement of fecundity observed when EB is applied to adults. We propose that sublethal EB exposure during critical nymphal stages may reprogram metabolic or endocrine pathways, potentially via insulin/JH crosstalk. For instance, increased nutrient storage (e.g., proteins, sugars; Figure 2–figure supplement 2) could enhance insulin signaling, which in turn promotes JH biosynthesis in adults (Ling and Raikhel, 2021; Mirth et al., 2014; Sheng et al., 2011). Future studies should test whether EB alters insulin-like peptide expression or signaling during development.

      (3) Emphasizing EB’s indirect effect on JH titers:Reviewer concern: The manuscript overstated EB’s direct effect on JH. Arrows in figures implied causality where only correlation exists.

      Response & Actions

      Taken:We fully agree. EB’s effect on JH is indirect and multifactorial (via AstA/AstAR suppression, GluCl modulation, and metabolic changes). We have:

      Removed oversimplified schematics (original Figures 3N, 4N, 5J).

      Revised all causal language (e.g., "EB increases JH" → "EB exposure is associated with increased circulating JH III "). (Line 739)

      Clarified in Results/Discussion that EB-induced JH changes are likely secondary to neuroendocrine disruption.

      Key revisions:

      Results (Lines 375–377):

      "Taken together, these results reveal that EB exposure is associated with an increase in JH titer and that JH signaling contributes to enhanced fecundity in BPH."

      Discussion (Lines 837–845):

      This EB action on GluClα expression is likely indirect, and we do not consider EB as transcriptional regulator of GluClα. Thus, the mechanism behind EB-mediated induction of GluClα remains to be determined. It is possible that prolonged EB exposure triggers feedback mechanisms (e.g. cellular stress responses) to counteract EB-induced GluClα dysfunction, leading to transcriptional upregulation of the channel. Hence, considering that EB exposure in our experiments lasts several days, these findings might represent indirect (or secondary) effects caused by other factors downstream of GluCl signaling that affect channel expression.

      a. Lines 281-285 as mentioned, does not carry the relevant information.

      Thank you for your careful review of our manuscript. We sincerely apologize for the confusion regarding line references in our previous response. Due to extensive revisions and tracked changes during the revision process, the line numbers shifted, resulting in incorrect citations for Lines 281–285. The correct location for the added results (EB-induced increase in mature eggs in adult ovaries) is now in lines 253-258: “We furthermore observed that EB treatment of female adults also increases the number of mature eggs in the ovary (Figure 2-figure supplement 1).”

      b. Lines 351-356 as mentioned, does not carry the relevant information. Lines 281-285 as mentioned, does not carry the relevant information.

      Thank you for your careful review of our manuscript. We sincerely apologize for the confusion regarding line references in our previous response. The correct location for the added results is now in lines 366-371: “We also investigated the effects of EB treatment on the JH titer of female adults. The data indicate that the JH titer was also significantly increased in the EB-treated female adults compared with controls (Figure 3-figure supplement 3A). However, again the steroid 20-hydroxyecdysone, was not significantly different between EB-treated BPH and controls (Figure 3-figure supplement 3B).”

      c. Lines 378-379 as mentioned, does not carry the relevant information. Lines 387-390 as mentioned, does not carry the relevant information.

      We sincerely apologize for the confusion regarding line references in our previous response.

      The correct location for the added results is now in lines 393-394: We furthermore found that EB treatment in female adults increases JHAMT expression (Figure 3-figure supplement 3C).

      The other correct location for the added results is now in lines 405-408: We found that Kr-h1 was significantly upregulated in the adults of EB-treated BPH at the 5M, 5L nymph and 4 to 5 DAE stages (4.7-fold to 27.2-fold) when 4th instar nymph or female adults were treated with EB (Figure 3H and Figure 3-figure supplement 3D)..

      (3) The writing quality is still extremely poor. It does not meet any publication standard, let alone elife.

      We fully understand your concerns and frustrations, and we sincerely apologize for the deficiencies in our writing quality, which did not meet the high standards expected by you and the journal. We fully accept your criticism regarding the writing quality and have rigorously revised the manuscript according to your suggestions.

      (4) I am confused whether Figure 2B was redone or just edited. Otherwise this seems acceptable to me.

      Regarding Fig. 2B, we have edited the text on the y-axis. The previous wording included the term “retention,” which may have caused misunderstanding for both the readers and yourself, leading to the perception of contradiction. We have now revised this wording to ensure accurate comprehension.

      (5) The rebuttal is accepted. However, still some of the lines mentioned does not hold relevant information.

      This error has been corrected.

      The correct location for the added results is now in lines 255-258 and lines 279-282: “Hence, although EB does not affect the normal egg developmental stages (see description in next section), our results suggest that EB treatment promotes oogenesis and, as a result the insects both produce more eggs in the ovary and a larger number of eggs are laid.” and “However, considering that the number of eggs laid by EB treated females was larger than in control females (Figure 1 and Figure 1-figure supplement 1), our data indicates that EB treatment of BPH can both promote both oogenesis and oviposition.”

      (6) Thank you for the clarification. Although now discussed extensively in discussion section, the nuances of indirect effect and minimal change in expression should also be reflected in the result section text. This is to ensure that readers have clear idea about content of the paper.

      Corrected. To ensure readers gain a clear understanding of our data, we have briefly presented these discussions in the Results section. Please see line 397-402: The levels of met mRNA slightly increased in EB-treated BPH at the 5M and 5L instar nymph and 1 to 5 DAE adult stages compared to controls (1.7-fold to 2.9-fold) (Figure 3G). However, it should be mentioned that JH action does not result in an increase of Met. Thus, it is possible that other factors (indirect effects), induced by EB treatment cause the increase in the mRNA expression level of Met.

      (7) As per the author's interpretation, it becomes critical to quantitate the amount of EB present at the adult stages after a 4th instar exposure to it. Only this experiment will unambiguously proof the authors claim. Also, since they have done adult insect exposure to EB, such experiments should be systematically performed for as many sections as possible. Don't just focus on few instances where reviewers have pointed out the issue.

      Thank you for raising this critical point. To address this concern, we have conducted new supplementary experiments. The new experimental results demonstrate that residual levels of emamectin benzoate (EB) in adult-stage brown planthoppers (BPH) were below the instrument detection limit following treatment of 4th instar nymphs with EB. Line 172-184: “To determine whether EB administered during the fourth-instar larval stage persists as residues in the adult stage, we used HPLC-MS/MS to quantify the amount of EB present at the adult stage after exposing 4th-instar nymphs to this compound. However, we found no detectable EB residues in the adult stage following fourth-instar nymphal treatment (Figure 1-figure supplement 1C). This suggests that the mechanism underlying the increased fecundity of female adults induced by EB treatment of nymphs may differ from that caused by direct EB treatment of female adults. Combined with our previous observation that EB treatment significantly increased the body weight of adult females (Figure 1—figure supplement 3E and F), a possible explanation for this phenomenon is that EB may enhance food intake in BPH, potentially leading to elevated production of insulin-like peptides and thus increased growth. Increased insulin signaling could potentially also stimulate juvenile hormone (JH) biosynthesis during the adult stage (Badisco et al., 2013).”

      (8) Thank you for the revision. Lines 725-735 as mentioned, does not carry the relevant information. However, since the authors have decided to remove this systematically from the manuscript, discussion on this may not be required.

      Thank you for identifying the limited relevance of the content in Lines 725–735 of the original manuscript. As recommended, we have removed this section in the revised version to improve logical coherence and maintain focus on the core findings.

      (9) Normally, dsRNA would last for some time in the insect system and would down-regulate any further induction of target genes by EB. I suggest the authors to measure the level of the target genes by qPCR in KD insects before and after EB treatment to clear the confusion and unambiguously demonstrate the results. Please Note- such quantifications should be done for all the KD+EB experiments. Additionally, citing few papers where such a rescue effect has been demonstrated in closely related insect will help in building confidence.

      We appreciate the reviewer’s suggestion to clarify the interaction between RNAi-mediated gene knockdown (KD) and EB treatment. To address this, we performed additional experiments measuring Kr-h1 expression via qPCR in dsKr-h1-injected insects before and after EB exposure.

      The results (now Figure 3–figure supplement 4) show that:

      (1) EB did not rescue *Kr-h1* suppression at 24h post-treatment (*p* > 0.05).

      (2) Partial recovery of fecundity occurred later (Figure 3M), likely due to:

      a) Degradation of dsRNA over time, reducing KD efficacy (Liu et al., 2010).

      b) Indirect effects of EB (e.g., hormonal/metabolic reprogramming) compensating for residual Kr-h1 suppression.

      Please see line 441-453: “Next, we investigated whether EB treatment could rescue the dsRNA-mediated gene silencing effect. To address this, we selected the Kr-h1 gene and analyzed its expression levels after EB treatment. Our results showed that Kr-h1 expression was suppressed by ~70% at 72 h post-dsRNA injection. However, EB treatment did not significantly rescue Kr-h1 expression in gene knock down insects (*p* > 0.05) at 24h post-EB treatment (Figure 3-figure supplement 4). While dsRNA-mediated Kr-h1 suppression was robust initially, its efficacy may decline during prolonged experiments. This aligns with reports in BPH, where effects of RNAi gradually diminish beyond 7 days post-injection (Liu et al., 2010a). The late-phase fecundity increase might reflect partial Kr-h1 recovery due to RNAi degradation, allowing residual EB to weakly stimulate reproduction. In addition, the physiological impact of EB (e.g., neurotoxicity, hormonal modulation) could manifest via compensatory feedback loops or metabolic remodeling.”

      (10) Not a very convincing argument. Besides without a scale bar, it is hard for the reviewers to judge the size of the organism. Whole body measurements of JH synthesis enzymes will remain as a quite a drawback for the paper.

      In response to your suggestion, we have also included images with scale bars (see next Figure 1). The images show that the head region is difficult to separate from the brown thoracic sclerite region. Furthermore, the anatomical position of the Corpora Allata in brown planthoppers has never been reported, making dissection uncertain and highly challenging. To address this, we are now attempting to use Drosophila as a model to investigate how EB regulates JH synthesis and reproduction.

      Author response image 1.<br /> This illustration provides a visual representation of the brown planthopper (BPH), a major rice pest.<br />

      Figure 1. This illustration provides a visual representation of the brown planthopper (BPH), a major rice pest.).

      (11) "The phenomenon reported was specific to BPH and not found in other insects. This limits the implications of the study". This argument still holds. Combined with extreme species specificity, the general effect that EB causes brings into question the molecular specificity that the authors claim about the mode of action.

      We acknowledge that the specificity of the phenomenon to BPH may limit its broader implications, but we would like to emphasize that this study provides important insights into the unique biological mechanisms in BPH, a pest of significant agricultural importance. The molecular specificity we described in the manuscript is based on rigorous experimental evidence. We believe that it contributes to valuable knowledge to understand the interaction of external factors such as EB and BPH and resurgence of pests. We hope that this study will inspire further research into the mechanisms underlying similar phenomena in other insects, thereby broadening our understanding of insect biology. Since EB also has an effect on fecundity in Drosophila, albeit opposite to that in BPHs (Fig. 1 suppl. 2), it seems likely that EB actions may be of more general interest in insect reproduction.

      (12) The authors have added a few lines in the discussion but it does not change the overall design of the experiments. In this scenario, they should infer the performed experiments rationally without over interpretation. Currently, many of the claims that the authors are making are unsubstantiated. As a result of the first review process, the authors have acknowledged the discrepancies, but they have failed to alter their interpretations accordingly.

      We appreciate your concern regarding the experimental design and the need for rational inference without overinterpretation. In response, we would like to clarify that our discussion is based on the experimental data we have collected. We acknowledge that our study focuses on BPH and the specific effects of EB, and while we agree that broader generalizations require further research, we believe the new findings we present are valid and contribute to the understanding of this specific system.

      We also acknowledge the discrepancies you mentioned and have carefully considered your suggestions. In this revised version, we believe our interpretations are reasonable and consistent with the data, and we have adjusted our discussion to better reflect the scope of our findings. We hope that these revisions address your concerns. Thank you again for your constructive feedback.

      ADDITIONAL POINTS

      (1) Only one experiment was performed with Abamectin. No titration for the dosage were done for this compound, or at least not provided in the manuscript. Inclusion of this result will confuse readers. While removing this result does not impact the manuscript at all. My suggestion would be to remove this result.

      We acknowledge that the abamectin experiment lacks dose-titration details and that its standalone presentation could lead to confusion. However, we respectfully request to retain these results for the following reasons:

      Class-Specific Mechanism Validation:

      Abamectin and emamectin benzoate (EB) are both macrocyclic lactones targeting glutamate-gated chloride channels (GluCls). The observed similarity in their effects on BPH fecundity (e.g., Figure 1—figure supplement 1B) supports the hypothesis that GluCl modulation, rather than compound-specific off-target effects, drives the reproductive enhancement. This consistency strengthens the mechanistic argument central to our study.

      (2) The section "The impact of EB treatment on BPH reproductive fitness" is poorly described. This needs elaboration. A line or two should be included to describe why the parameters chosen to decide reproductive fitness were selected in the first place. I see that the definition of brachypterism has undergone a change from the first version of the manuscript. Can you provide an explanation for that? Also, there is no rationale behind inclusion of statements on insulin at this stage. The authors have not investigated insulin. Including that here will confuse readers. This can be added in the discussion though.

      Thank you for your suggestion. We have added an explanation regarding the primary consideration of evaluating reproductive fitness. In the interaction between sublethal doses of insecticides and pests, reproductive fitness is a key factor, as it accurately reflects the potential impact of insecticides on pest control in the field. Among the reproductive fitness parameters, factors such as female Nilaparvata lugens body weight, lifespan, and brachypterous ratio (as short-winged N. lugens exhibit higher oviposition rates than long-winged individuals) are critical determinants of reproductive success. Therefore, we comprehensively assessed the effects of EB on these parameters to elucidate the primary mechanism by which EB influences reproduction. We sincerely appreciate your constructive feedback.

      (3) "EB promotes ovarian maturation in BPH" this entire section needs to be rewritten and attention should be paid to the sequence of experiments described.

      Thank you for your suggestion. Based on your recommendation, we have rewritten this section (lines 267–275) and adjusted the sequence of experimental descriptions to improve the structural clarity of this part.

      (4) Figure 3N is outright wrong and should be removed or revised.

      In accordance with your recommendation, we have removed the figure.

      (5) When you are measuring hormonal titers, it is important to mention explicitly whether you are measuring hemolymph titer or whole body.

      We believe we have explicitly stated in the Methods section (line 1013) that we measured whole-body hormone titers. However, we now added this information to figure legends.

      (6)  EB induces JH biosynthesis through the peptidergic AstA/AstAR signaling pathway- this section needs attention at multiple points. Please check.

      We acknowledge that direct evidence for EB-AstA/AstAR interaction is limited and have framed these findings as a hypothesis for future validation.

      References

      Liu, S., Ding, Z., Zhang, C., Yang, B., Liu, Z., 2010. Gene knockdown by intro-thoracic injection of double-stranded RNA in the brown planthopper, Nilaparvata lugens. Insect Biochem. Mol. Biol. 40, 666-671

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public Review):

      Summary:

      Previous studies have shown that treatment with 17α-estradiol (a stereoisomer of the 17β-estradiol) extends lifespan in male mice but not in females. The current study by Li et al, aimed to identify cell-specific clusters and populations in the hypothalamus of aged male rats treated with 17α-estradiol (treated for 6 months). This study identifies genes and pathways affected by 17α-estradiol in the aged hypothalamus.

      Strengths:

      Using single-nucleus transcriptomic sequencing (snRNA-seq) on the hypothalamus from aged male rats treated with 17α-estradiol they show that 17α-estradiol significantly attenuated age-related increases in cellular metabolism, stress, and decreased synaptic activity in neurons.

      Thanks.

      Moreover, sc-analysis identified GnRH as one of the key mediators of 17α-estradiol's effects on energy homeostasis. Furthermore, they show that CRH neurons exhibited a senescent phenotype, suggesting a potential side effect of the 17α-estradiol. These conclusions are supported by supervised clustering by neuropeptides, hormones, and their receptors.

      Thanks.

      Weaknesses:

      However, the study has several limitations that reduce the strength of the key claims in the manuscript. In particular:

      (1) The study focused only on males and did not include comparisons with females. However, previous studies have shown that 17α-estradiol extends lifespan in a sex-specific manner in mice, affecting males but not females. Without the comparison with the female data, it's difficult to assess its relevance to the lifespan.

      This study was originally designed based on previous findings indicating that lifespan extension is only effective in males, leading to the exclusion of females from the analysis. The primary focus of our research was on the transcriptional changes and serum endocrine alterations induced by 17α-estradiol in aged males compared to untreated aged males. We believe that even in the absence of female subjects, the significant effects of 17α-estradiol on metabolism in the hypothalamus, synapses, and endocrine system remain evident, particularly regarding the expression levels of GnRH and testosterone. Notably, lower overall metabolism, increased synaptic activity, and elevated levels of GnRH and testosterone are strong indicators of health and well-being in males, supporting the validity of our primary conclusions. However, including female controls would enhance the depth of our findings. If female controls were incorporated, we propose redesigning the sample groups to include aged male control, aged female control, aged female treated, aged male treated, as well as young male control, young male treated, young female control, and young female treated. We regret that we cannot provide this data in the short term. Nevertheless, we believe this reviewer’s creative idea presents a valuable avenue for future research on this topic. In this study, we emphasize the role of 17α-estradiol in overall metabolism, synaptic function, GnRH, and testosterone in aged males and underscore the importance of supervised clustering of neuropeptide-secreting neurons in the hypothalamus.

      (2) It is not known whether 17α-estradiol leads to lifespan extension in male rats similar to male mice. Therefore, it is not possible to conclude that the observed effects in the hypothalamus, are linked to the lifespan extension.

      Thanks for the reminding. 17α-estradiol was reported to extend lifespan in male rats similar to male mice (PMID: 33289482). We have added the valuable reference to introduction in the new version.  

      (3) The effect of 17α-estradiol on non-neuronal cells such as microglia and astrocytes is not well-described (Figure 1). Previous studies demonstrated that 17α-estradiol reduces microgliosis and astrogliosis in the hypothalamus of aged male mice. Current data suggest that the proportion of oligo, and microglia were increased by the drug treatment, while the proportions of astrocytes were decreased. These data might suggest possible species differences, differences in the treatment regimen, or differences in drug efficiency. This has to be discussed.

      We have reviewed reports describing changes in cell numbers following 17α-estradiol treatment in the brain, using the keywords "17α-estradiol," "17alpha-estradiol," and "microglia" or "astrocyte." Only a limited amount of data was obtained. We found one article indicating that 17α-estradiol treatment in Tg (AβPP(swe)/PS1(ΔE9)) model mice resulted in a decreased microglial cell number compared to the placebo (AβPP(swe)/PS1(ΔE9) mice), but this change was not significant when compared to the non-transgenic control (PMID: 21157032). The transgenic AβPP(swe)/PS1(ΔE9) mouse model may differ from our wild-type aging rat model in this context.

      Moreover, the calculation of cell numbers was based on visual observation under a microscope across several brain tissue slices. This traditional method often yields controversial results. For example, oligodendrocytes in the corpus callosum, fornix, and spinal cord have been reported to be 20-40% more numerous in males than in females based on microscopic observations (PMID: 16452667). In contrast, another study found no significant difference in the number of oligodendrocytes between sexes when using immunohistochemistry staining (PMID: 18709647). Such discrepancies arising from traditional observational methods are inevitable.

      We believe the data presented in this article are reliable because the cell number and cell ratio data were derived from high-throughput cell counting of the entire hypothalamus using single-cell suspension and droplet wrapping (10x Genomics).

      (4) A more detailed analysis of glial cell types within the hypothalamus in response to drugs should be provided.

      We provided more enrichment analysis data of differentially expressed genes between Y, O, and O.T in microglia and astrocytes in Figure 2—figure supplement 3. In this supplemental data, we found unlike that in neurons, Micro displayed lower levels of synapse-related cellular processes in O.T. compared to O.

      (5) The conclusion that CRH neurons are going into senescence is not clearly supported by the data. A more detailed analysis of the hypothalamus such as histological examination to assess cellular senescence markers in CRH neurons, is needed to support this claim.

      We also noted the inappropriate claim and have changed "senescent phenotype" to "stressed phenotype" and "abnormal phenotype" in both the abstract and results sections. The stressed phenotype could be induced by heightened functional activity in the cells, potentially indicating higher cellular activity. The GnRH and CRH neurons discussed in this paper may represent such a case, as illustrated by the observed high serum GnRH, testosterone, and cortisol levels. This revision suggestion is highly valuable and constructive for our understanding of the unique physiological characteristics revealed by these data.

      Reviewer #2 (Public Review):

      Summary:

      Li et al. investigated the potential anti-ageing role of 17α-Estradiol on the hypothalamus of aged rats. To achieve this, they employed a very sophisticated method for single-cell genomic analysis that allowed them to analyze effects on various groups of neurons and non-neuronal cells. They were able to sub-categorize neurons according to their capacity to produce specific neurotransmitters, receptors, or hormones. They found that 17α-Estradiol treatment led to an improvement in several factors related to metabolism and synaptic transmission by bringing the expression levels of many of the genes of these pathways closer or to the same levels as those of young rats, reversing the ageing effect. Interestingly, among all neuronal groups, the proportion of Oxytocin-expressing neurons seems to be the one most significantly changing after treatment with 17α-Estradiol, suggesting an important role of these neurons in mediating its anti-ageing effects. This was also supported by an increase in circulating levels of oxytocin. It was also found that gene expression of corticotropin-releasing hormone neurons was significantly impacted by 17α-Estradiol even though it was not different between aged and young rats, suggesting that these neurons could be responsible for side effects related to this treatment. This article revealed some potential targets that should be further investigated in future studies regarding the role of 17α-Estradiol treatment in aged males.

      Strengths:

      (1) Single-nucleus mRNA sequencing is a very powerful method for gene expression analysis and clustering. The supervised clustering of neurons was very helpful in revealing otherwise invisible differences between neuronal groups and helped identify specific neuronal populations as targets.

      Thanks.

      (2) There is a variety of functions used that allow the differential analysis of a very complex type of data. This led to a better comparison between the different groups on many levels.

      Thanks.

      (3) There were some physiological parameters measured such as circulating hormone levels that helped the interpretation of the effects of the changes in hypothalamic gene expression

      Thanks.

      Weaknesses

      (1) One main control group is missing from the study, the young males treated with 17α-Estradiol.

      Given that the treatment period lasts six months, which extends beyond the young male rats' age range, we aimed to investigate the perturbation of 17α-Estradiol on the normal aging process. Including data from young males could potentially obscure the treatment's effects in aged males due to age effects, though similar effects between young and aged animals may exist. Long-term treatment of hormone may exert more developmental effects on the young than the old. Consequently, we decided to exclude this group from our initial sample design. We apologize for this omission.

      (2) Even though the technical approach is a sophisticated one, analyzing the whole rat hypothalamus instead of specific nuclei or subregions makes the study weaker.

      The precise targets of 17α-Estradiol within the hypothalamus remain unresolved. Selecting a specific nucleus for study is challenging. The supervised clustering method described in this manuscript allows us to identify the more sensitive neuron subtypes influenced by 17α-Estradiol and aging across the entire hypothalamus, without the need to isolate specific nuclei in a disturbed hypothalamic environment.

      (3) Although the authors claim to have several findings, the data fail to support these claims. You may mean the claim as the senescent phenotype in Crh neuron induced by 17a-estradiol.

      Thanks. We have changed the "senescent phenotype" to "stressed phenotype" in the abstract and results to avoid such claim. The stressed phenotype may be induced by heightened functional activity in the cells, potentially indicating higher cellular activity.

      (4) The study is about improving ageing but no physiological data from the study demonstrated such a claim with the exception of the testes histology which was not properly analyzed and was not even significantly different between the groups.

      The primary objective of this study is to elucidate the effects of 17α-Estradiol on the endocrine system in the aging hypothalamus; exploring anti-aging effects is not the main focus. From the characteristics of the aging hypothalamus, we know that down-regulated GnRH and testosterone levels, along with elevated mTOR signaling, are indicators of aging in these organs from previous publications (PMID: 37886966, PMID: 37048056, PMID: 22884327). The contrasting signaling networks related to metabolism and synaptic processes significantly differentiate young and aging hypothalami, and 17α-Estradiol helps rebalance these networks, suggesting its potential anti-aging effects.

      (5) Overall, the study remains descriptive with no physiological data to demonstrate that any of the effects on hypothalamic gene expression are related to metabolic, synaptic, or other functions.

      The study focuses on investigating cellular responses and endocrine changes in the aging hypothalamus induced by 17α-estradiol, utilizing single-nucleus RNA sequencing (snRNA-seq) and a novel data mining methodology to analyze various neuron subtypes. It is important to note that this study does not mainly aim to explore the anti-aging effects. Consequently, we have revised the claim in the abstract from “the effects of 17α-estradiol in anti-aging in neurons” to “the effects of 17α-estradiol on aging neurons.” We observed that the lower overall metabolism and increased expression levels of cellular processes in the synapses align with findings previously reported regarding 17α-estradiol. To address the lack of physiological data and the challenges in measuring multiple endocrine factors due to their volatile nature, we employed several bidirectional Mendelian analyses of various genome-wide association study (GWAS) data related to these serum endocrine factors to identify their mutual causal effects.

      Reviewing Editor Comment:

      Based on the Public Reviews and Recommendations for Authors, the Reviewers strongly recommend that revisions include an experimental demonstration of the physiological effects of the treatment on ageing in rats as well as the CRH-senescence link. Additional analysis of the glia would greatly strengthen the study, as would inclusion of females and young male controls. The important point was also raised that the work linking 17a-estradiol was performed in mice, and the link with lifespan in rats is not known. Discussion of this point is recommended.

      We thank the reviewers for their constructive feedback. Regarding the recommendations in the Public Reviews and Recommendations for Authors:

      a)  Physiological effects & CRH-senescence link:

      We acknowledge that 17α-estradiol has been reported to extend lifespan in male rats, consistent with findings in male mice (PMID: 33289482). This point has now been noted in the Introduction. We regret that further experimental validation of the treatment's physiological effects on aging in rats was beyond the scope of this study.

      b) Phenotype terminology:

      In response to concerns about the "senescent" characterization of CRH neurons, we have revised this terminology to "stressed phenotype" throughout the abstract and results. While we were unable to conduct additional experiments to confirm senescence markers, this revised description better reflects the heightened cellular activity observed (as evidenced by elevated serum GnRH and testosterone levels), without implying confirmed senescence.

      c) Glial cell analysis:

      To address questions about glial cell function during treatment, we have added new enrichment analysis data of differentially expressed genes in microglia and astrocytes from young (Y), old (O), and old treated (O.T) groups in Figure 2—figure supplement 3. This analysis reveals that microglia exhibit contrasting synaptic-related cellular processes compared to total neurons.

      d) Female and young controls:

      We sincerely apologize for the absence of female subjects and young male controls in the current study. The reviewers' suggestion to examine the male-specific effects of 17α-estradiol using female controls represents an excellent direction for future research, which we plan to pursue in upcoming studies.

      Reviewer #2 (Recommendations For The Authors):

      General comments:

      (1) The manuscript is very hard to read. Proofreading and editing by software or a professional seems necessary. The words "enhanced", "extensive" etc. are not always used in the right way.

      Thanks for the suggestion. We have revised the proofreading and editing. The words "enhanced" and "extensive" were also revised in most sentences.

      (2) The numbers of animals and samples are not well explained. Is it 9 rats overall or per group? If there are 8 testes samples per group, should we assume that there were 4 rats per group? The pooling of the hypothalamic how was it done? Were all the hypothalamic from each group pooled together? A small table with the animals per group and the samples would help.

      We appreciate your reminder regarding the initial mistake in our manuscript preparation. In the preliminary submission, we reported 9 rats based solely on sequencing data and data mining. The revised version (v1) now includes additional experimental data, with an effective total of 12 animals (4 per group). Unfortunately, we overlooked updating this information in the v1 submission. We have since added detailed information in the Materials and Methods sections: Animals, Treatment and Tissues, and snRNA-seq Data Processing, Batch Effect Correction, and Cell Subset Annotation.

      (3) The Clustering is wrong. There are genes in there that do not fall into any of the 3 categories: Neurotransmitters, Receptors, Hormones.

      We acknowledge the error in gene clustering and have implemented the following corrections:

      (a) The description has been updated to state: 'Vast majority of these subtypes were clustered by neuropeptides, hormones, and their receptors among all neurons.'

      (b) Genes not belonging to these three categories have been substantially removed.

      (c) The neuropeptide category (now including several growth hormones) has been expanded to 104 genes, while their corresponding receptors (including several sex hormone receptors) now comprise 105 genes.

      (4) The coloring of groups in the graphs is inconsistent. It must be more homogeneous to make it easier to identify.

      We have changed the colors of groups in Fig. 1D to make the color of cell clusters consistent in Fig. 1A-D.

      (5) The groups c1-c4 are not well explained. How did the authors come up with these?

      We have added more descriptions of c1-c4 in materials and methods in the new version.

      (6) In most cases it's not clear if the authors are talking about cell numbers that express a certain mRNA, the level of expression of a certain mRNA, or both. They need to do a better job using more precise descriptions instead of using general terms such as "signatures", "expression profiles", "affected neurons" etc. It is very hard to understand if the number of neurons is compared between the groups or the gene expression.

      We have changed the "signatures" to "gene signatures" to make it more accurate in meaning. The "affected neurons" were also changed to "sensitive neurons". But sorry that we were not able to find better alternatives to the "expression profiles".

      (7) Sometimes there are claims made without justification or a reference. For example, the claim about the senescence of CRH neurons due to the upregulation of mitochondrial genes and downregulation of adherence junction genes (lines 326-328) should be supported by a reference or own findings.

      The "senescence" here is not appropriate. We have changed it to "stressed phenotype" or "aberrant changes" in abstract and results.

      (8) Young males treated with Estradiol as a control group is necessary and it is missing.

      Your suggestion is appreciated; however, the treatment duration for aged mice (O.T) was set at 6 months, while the young mice were only 4 months old. This disparity makes it challenging to align treatment timelines for the young animals. The primary aim of this study is to investigate the perturbation of 17α-estradiol on the aging process, and any distinct effects due to age effect observed in young males might complicate our understanding of its role in aged males, though similar endocrine effects may exist in the young animals. Long-term treatment of hormone may exert more developmental effects on the young than the old. Therefore, we made the decision to exclude the young samples in our initial study design. We apologize for any confusion this may have caused.

      Specific Comments:

      Line 28: "elevated stresses and decreased synaptic activity": Please make this clearer. Can't claim changes in synaptic activity by gene expression.

      We have changed it to "the expression level of pathways involved in synapse"

      Line 32: "increased Oxytocin": serum Oxytocin.

      We have added the “serum”.

      Line 52 - 54: Any studies from rats?

      Thanks. In rats there is also reported that 17α-estradiol has similar metabolic roles as that in mice (PMID: 33289482) and we have added it to the refences. It’s very useful for this manuscript.

      Line 62 - 65: It wasn't investigated thoroughly in this paper so why was it suggested in the introduction?

      We have deleted this sentence as being suggested.

      Line 70: "synaptic activity" Same as line 28.

      We have changed it to "pathways involved in synaptic activity".

      Line 79: Why were aged rats caged alone and young by two? Could that introduce hypothalamic gene expression effects?

      The young males were bred together in peace. But the aged males will fight and should be kept alone.

      Lines 78, 99, 109-110: It is not clear how many animals per group were used and how many samples per group were used separately and/or grouped. Please be more specific.

      We have added these information to Materials and methods/Animals, treatment and tissues and Materials and methods/snRNA-seq data processing, batch effect correction, and cell subset annotation.

      Line 205: "in O" please add "versus young.".

      We have changed accordingly.

      Line 207: replace "were" with "was"

      We have alternatively changed the "proportion" to "proportions".

      Line 208: replace "that" with "compared to" and after "in O.T." add "compared to?"

      We have changed accordingly.

      Line 223: "O.T." compared to what? Figure?

      We have changed it accordingly.

      Line 227: Figure?

      We have added (Figure 1E) accordingly.

      Line 229: "synaptic activity" Same as line 28.

      We have revised it.

      Line 235: "synaptic activity" and "neuropeptide secretion" Same as line 28.

      We have revised it.

      Line 256:" interfered" please revise.

      We changed to "exerted".

      Line 263: "on the contrary" please revise.

      We have changed "on the contrary" to "opposite".

      Line 270: "conversed" did you mean "conserved"?

      We have changed "conversed" to "inversed".

      Line 296-298: Please explain. Why would these be side effects?

      It’s hard to explain, therefore, we deleted the words "side effects".

      Line 308: "synaptic activity" Same as line 28.

      We have changed it to "expression levels of synapse-related cellular processes".

      Line 314: "and sex hormone secretion and signaling"Isn't this expected?

      Yes, it is expected. We have added it to the sentence "and, as expected, sex hormone secretion and signaling".

      Line 325-328: Why is this senescence? Reference?

      We have added “potent” to it.

      Line 360-361: This doesn't show elevated synaptic activity.

      "elevated synaptic activity" was changed to "The elevated expression of synapse-related pathways"

      Line 363-364: "Unfortunately" is not a scientific expression and show bias.

      We have changed it to "Notably".

      Line 376: Similar as above.

      Yes, we have change it to "in contrast".

      Lines 382-385: This is speculation. Please move to discussion.

      Sorry for that. We think the causal effects derived from MR result is evidence. As such, we have not changed it.

      Line 389: Please revise "hormone expressing".

      We have changed it accordingly.

      Line 401: Isn't this effect expected due to feedback inhibition of the biochemical pathway? Please comment.

      The binding capability of 17alpha-estradiol to estrogen receptors and its role in transcriptional activation remain core questions surrounded by controversy. Earlier studies suggest that 17alpha-estradiol exhibits at least 200 times less activity than 17beta-estradiol (PMID: 2249627, PMID: 16024755). However, recent data indicate that 17alpha-estradiol shows comparable genomic binding and transcriptional activation through estrogen receptor α (Esr1) to that of 17beta-estradiol (PMID: 33289482). Additionally, there is evidence that 17alpha-estradiol has anti-estrogenic effects in rats (PMID: 16042770). These findings imply possible feedback inhibition via estrogen receptors. Furthermore, 17alpha-estradiol likely differs from 17beta-estradiol due to its unique metabolic consequences and its potential to slow aging in males, an effect not attributed to 17beta-estradiol. For instance, neurons are also targets of 17alpha-estradiol, with Esr1 not being the sole target (PMID: 38776045). Intriguingly, neurons expressing Ar and Esr1 ranked among the top 20 most perturbed receptor subtypes during aging (O vs Y), but were no longer ranked in this group following treatment (O.T vs Y and O.T vs O comparisons). This indicates that 17α-estradiol administration attenuated age-associated perturbation in these neuronal subtypes, which may be a consequence of potential feedback (Figure 3D). Nevertheless, the precise effective targets of 17alpha-estradiol are still unresolved.

      Line 409: This conclusion cannot be made because the effect is not statistically significant. Can say "trend" etc.

      Thanks for the recommendation. We have added "potential" in front of the conclusion.

      Line 426: "suggesting" please revise.

      sorry, it’s a verb.

      Lines 426-428: This is speculation. Please move to discussion.

      The elevated GnRH levels in O.T., observed through EIA analysis, suggest a deduction regarding the direct causal effects of 17alpha-estradiol on various endocrine factors related to feeding, energy homeostasis, reproduction, osmotic regulation, stress response, and neuronal plasticity through MR analysis. Thus, we have not amended our position. We apologize for any confusion.

      Lines 431-432: improved compared to what?

      The statement have been revised as " The most striking role of 17α-estradiol treatment revealed in this study showed that HPG axis was substantially improved in the levels of serum Gnrh and testosterone".

      Line 435: " Estrogen Receptor Antagonists". Please revise.

      Thanks for the recommendation. We have changed it to "estrogen receptor antagonists".

      Line 438" "Secrete". Please revise

      Sorry, it is "secret".

      Lines 439-449: None of this has been demonstrated. Please remove these conclusions.

      We appreciate the reviewer's scrutiny regarding lines 439-449. While these statements should not be interpreted as definitive conclusions from our current data, we propose they serve as clinically relevant discussion points worthy of exploration. Our findings demonstrate 17α-estradiol's role in modulating testosterone levels in aged males. This mechanistic insight warrants consideration of its therapeutic potential for age-related hypogonadism - a hypothesis we believe merits discussion given the compound's specific endocrine effects.

      Lines 450-457: No females were included in this study. Why? Also, why is this discussed? It is relevant but doesn't belong in this manuscript since it was not studied here.

      Testosterone levels are crucial for male health, while estradiol levels are essential for the health and fertility of females. Previous studies have demonstrated that 17α-estradiol does not contribute to lifespan extension in females. Given the effects of 17α-estradiol on males—specifically, its role in promoting testosterone and reducing estradiol levels—we believe it is important to discuss the potential sex-biased effects of 17α-estradiol, as this could inform future investigations. We have refined this section to clarify that these points represent mechanistic hypotheses derived from our male data and existing literature, not conclusions about unstudied female physiology. This framing maintains the discussion's scientific value while respecting the study's scope.

      Lines 458-459: This was not demonstrated in this article. Please remove.

      We have restricted the claim to "expression level of energy metabolism in hypothalamic neurons".

      Line 464: "Promoted lifespan extension" Not demonstrated. Please remove.

      At the end of the sentence it was revised as "which may be a contributing factor in promoting lifespan extension".

      Line 466: "Showed" No.

      The whole sentence was deleted in the new version.

      Line 483: "the sex-based effects". Not studied here.

      Since the changes in testosterone levels are significant in this dataset and this hormone has a sex-biased nature, we find it worthwhile to suggest this as a topic for future investigation. We have added "which needs further verification in the future" at the end of this sentence.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      In this manuscript, Dillard and colleagues integrate cross-species genomic data with a systems approach to identify potential driver genes underlying human GWAS loci and establish the cell type(s) within which these genes act and potentially drive disease. Specifically, they utilize a large single-cell RNA-seq (scRNA-seq) dataset from an osteogenic cell culture model - bone marrow-derived stromal cells cultured under osteogenic conditions (BMSC-OBs) - from a genetically diverse outbred mouse population called the Diversity Outbred (DO) stock to discover network driver genes that likely underlie human bone mineral density (BMD) GWAS loci. The DO mice segregate over 40M single nucleotide variants, many of which affect gene expression levels, therefore making this an ideal population for systems genetic and co-expression analyses. The current study builds on previously published work from the same group that used co-expression analysis to identify co-expressed "modules" of genes that were enriched for BMD GWAS associations. In this study, the authors utilize a much larger scRNA-seq dataset from 80 DO BMSC-OBs, infer co-expression-based and Bayesian networks for each identified mesenchymal cell type, focused on networks with dynamic expression trajectories that are most likely driving differentiation of BMSC-OBs, and then prioritized genes ("differentiation driver genes" or DDGs) in these osteogenic differentiation networks that had known expression or splicing QTLs (eQTL/sQTLs) in any GTEx tissue that colocalized with human BMD GWAS loci. The systems analysis is impressive, the experimental methods are described in detail, and the experiments appear to be carefully done. The computational analysis of the single-cell data is comprehensive and thorough, and the evidence presented in support of the identified DDGs, including Tpx2 and Fgfrl1, is for the most part convincing. Some limitations in the data resources and methods hamper enthusiasm somewhat and are discussed below. Overall, while this study will no doubt be valuable to the BMD community, the cross-species data integration and analytical framework may be more valuable and generally applicable to the study of other diseases, especially for diseases with robust human GWAS data but for which robust human genomic data in relevant cell types is lacking. 

      Specific strengths of the study include the large scRNA-seq dataset on BMSC-OBs from 80 DO mice, the clustering analysis to identify specific cell types and sub-types, the comparison of cell type frequencies across the DO mice, and the CELLECT analysis to prioritize cell clusters that are enriched for BMD heritability (Figure 1). The network analysis pipeline outlined in Figure 2 is also a strength, as is the pseudotime trajectory analysis (results in Figure 3). One weakness involves the focus on genes that were previously identified as having an eQTL or sQTL in any GTEx tissue. The authors rightly point out that the GTEx database does not contain data for bone tissue, but the reason that eQTLs can be shared across many tissues - this assumption is valid for many cis-eQTLs, but it could also exclude many genes as potential DDGs with effects that are specific to bone/osteoblasts. Indeed, the authors show that important BMD driver genes have cell-type-specific eQTLs. Furthermore, the mesenchymal cell type-specific co-expression analysis by iterative WGCNA identified an average of 76 co-expression modules per cell cluster (range 26-153). Based on the limited number of genes that are detected as expressed in a given cell due to sparse per-cell read depth (400-6200 reads/cell) and dropouts, it's hard to believe that as many as 153 co-expression modules could be distinguished within any cell cluster. I would suspect some degree of model overfitting here and would expect that many/most of these identified modules have very few gene members, but the methods list a minimum module size of 20 genes. How do the numbers of modules identified in this study compare to other published scRNA-seq studies that use iterative WGCNA? 

      In the section "Identification of differentiation driver genes (DDGs)", the authors identified 408 significant DDGs and found that 49 (12%) were reported by the International Mouse Knockout [sic] Consortium (IMPC) as having a significant effect on whole-body BMD when knocked out in mice. Is this enrichment significant? E.g., what is the background percentage of IMPC gene knockouts that show an effect on whole-body BMD? Similarly, they found that 21 of the 408 DDGs were genes that have BMD GWAS associations that colocalize with GTEx eQTLs/sQTLs. Given that there are > 1,000 BMD GWAS associations, is this enrichment (21/408) significant? Recommend performing a hypergeometric test to provide statistical context to the reported overlaps here. 

      We thank the reviewer for their constructive feedback and thoughtful questions. In regards to the iterativeWGCNA, a larger number of modules is sometimes an outcome of the analysis, as reported in the iterativeWGCNA preprint (Greenfest-Allen et al., 2017). While we did not make a comparison to other works leveraging this tool for scRNA-seq, it has been used broadly across other published studies, such as PMID: 39640571, 40075303, 33677398, 33653874. While model overfitting, as you mention, may be a cause for more modules, our Bayesian network analysis we perform after iterativeWGCNA highlights smaller aspects of coexpression modules, as opposed to focusing on the entirety of any given module.

      We did not perform enrichment or statistical tests as our goal was to simply highlight attributes or unique features of these genes for additional context.

      Reviewer #2 (Public review): 

      Summary: 

      In this manuscript, Farber and colleagues have performed single-cell RNAseq analysis on bone marrow-derived stem cells from DO Mice. By performing network analysis, they look for driver genes that are associated with bone mineral density GWAS associations. They identify two genes as potential candidates to showcase the utility of this approach. 

      Strengths: 

      The study is very thorough and the approach is innovative and exciting. The manuscript contains some interesting data relating to how cell differentiation is occurring and the effects of genetics on this process. The section looking for genes with eQTLs that differ across the differentiation trajectory (Figure 4) was particularly exciting. 

      Weaknesses: 

      The manuscript is in parts hard to read due to the use of acronyms and there are some questions about data analysis that need to be addressed. 

      We thank the reviewer for their feedback and shared enthusiasm for our work. We tried to minimize the use of technical acronyms as much as we could without compromising readability. Additionally, we addressed questions regarding aspects of data analysis. 

      Reviewer #1 (Recommendations for the authors):

      (1) For increased transparency and to allow reproducibility, it would be necessary for the scripts used in the analysis to be shared along with the publication of the preprint. Also, where feasible, sharing the processed data in addition to the raw data would allow the community greater access to the results and be highly beneficial. 

      Thank you for this suggestion. The raw data will be available via GEO accession codes listed in the data availability statement. We will make available scripts for some analyses on our Github (https://github.com/Farber-Lab/DO80_project) and processed scRNA-seq data in a Seurat object (.rds) on Zenodo (https://zenodo.org/records/15299631)

      (2) Lines 55-76: I think the summary of previous work here is too long. I understand that they would like to cover what has been done previously, but this seems like overkill. 

      Good suggestion. We have streamlined some of the summary of our previous work.

      (3) Did the authors try to map QTL for cell-type proportion differences in their BMSC-OBs? While 80 samples certainly limit mapping power, the data shown in Figs 4C/D suggest that you might identify a large-effect modifier of LMP/OB1 proportions. 

      We did try to map QTL for cell type proportion differences, but no significant associations were identified. 

      (4) Methods question: Does the read alignment method used in your analysis account for SNPs/indels that segregate among the DO/CC founder strains? If not, the authors may wish to include this in their discussion of study limitations and speculate on how unmapped reads could affect expression results. 

      The read alignment method we used does not account for SNPs/indels from the DO founder strains that fall in RNA transcripts captured in the scRNA-seq data. We have included this as a limitation in our discussion (line 422-424). 

      (5) Much of the discussion reads as an overview of the methods, while a discussion of the results and their context to the existing BMD literature is relatively lacking in comparison.

      We have added additional explanation of the results and context to the discussion (line 381-382, 396-407). 

      (6) Figure 1E and lines 146-149: Adjusted p values should be reported in the figure and accompanying text instead of switching between unadjusted and adjusted p values. 

      We updated Figure 1e to portray adjusted p-values, listed the adjusted p-values in legend of Figure 1e, and listed them in the main text (line 153-154).

      (7) Why do the authors bring the IMPC KO gene list into the analysis so late? This seems like a highly relevant data resource (moreso than the GTEx eQTLs/sQTLs) that could have been used much earlier to help identify DDGs. 

      Given that our scRNA-seq data is also from mice, we did choose to integrate information from the IMPC to highlight supplemental features of genes in networks (i.e., genes that have an experimentally-tested and significant effect on BMD in mice). However, our primary goal was to inform human GWAS and leverage our previous work in which we identified colocalizations between human BMD GWAS and eQTL/sQTL in a human GTEx tissue, which is why this information was used to guide our network analysis.

      (8) Does Fgfrl1 and/or Tpx2 have a cis-eQTL in your BMSC-OB scRNA-seq dataset? 

      We did not identify cis-eQTL effects for Fgfrl1 and Tpx2.

      (9) Figure 4B-C: These eQTLs may be real, but based on the diplotype patterns in Figure 4C, I suspect they are artifacts of low mapping power that are driven by rare genotype classes with one or two samples having outlier expression results. For example, if you look at the results in Fig 4C for S100a1 expression, the genotype classes with the highest/lowest expression have lower sample numbers. In the case of Pkm eQTL showing a PWK-low effect, the PWK genome has many SNPs that differ from the reference genome in the 3' UTR of this gene, and I wonder if reads overlapping these SNPs are not aligning correctly (see point 4 above) and resulting (falsely) in lower expression values for samples with a PWK haplotype. 

      As mentioned above, our alignment method did not consider DO founder genetic variation that is specifically located in the 3’ end of RNA transcripts in the scRNA-seq data. We have included this as a limitation in our discussion (line 422-424).

      In future studies, we intend to include larger populations of mice to potentially overcome, as you mention, any artifacts that may be attributable to low statistical power, rare genotype classes, or outlier expression.

      Reviewer #2 (Recommendations for the authors):

      Major Points 

      (1) The authors hypothesize "that many genes impacting BMD do so by influencing osteogenic differentiation or possibly bone marrow adipogenic differentiation". However, cell type itself does not correlate with any bone trait. Does this indicate that the hypothesis is not entirely correct, as genes that drive these phenotypes would not be enriched in one particular cell type? The authors have previously identified "high-priority target genes". So, are there any cell types that are enriched for these target genes? If not, this would indicate that all these genes are more ubiquitously expressed and this is probably why they would have a greater effect on the overall bone traits. Furthermore, are the 73 eGenes (so genes with eQTLs in a particular cell type that change around cell type boundaries) or the DDGs (Table 1) enriched for these high-priority target genes? 

      The bone traits measured in the DO mice are complex and impacted by many factors, including the differentiation propensity and abundance of certain cell types, both within and outside of bone. Though we did not identify correlations between cell type abundance and the bone traits we measured, we tailored our investigations to focus on cellular differentiation using the scRNA-seq data. However, future studies would need to be performed to investigate any connections between cellular differentiation, cell type abundance, and bone traits.

      We did not perform enrichment analyses of either the target genes identified from our other work or eGenes identified here, but instead used the target gene list to center our network analysis and the eGenes to showcase the utility of the DO mouse population.

      (2) The readability of the paper could be improved by minimising the use of acronyms and there are several instances of confusing wording throughout the paper. In many cases, this can be solved by re-organising sentences and adding a bit more detail. For example, it was unclear how you arrived at Fgfrl1 or Tpx2.

      One of the goals of our study was to identify genes that have (to our knowledge) little to no known connection to BMD. We chose to highlight Fgfrl1 and Tpx2 because there is minimal literature characterizing these genes in the context of bone, which we speak to in the results (line 296-297). Additionally, we prioritized these genes in our previous work and they were identified in this study by using our network analyses using the scRNA-seq data, which we mention in the results (line 276-279).

      (3) Technical aspects of the assay. In Figure 1d you show that the cell populations vary considerably between different DO mice. It would be useful to give some sense of the technical variance of this assay given that the assay involves culturing the cells in an exogenous environment. This could take the form of tests between mice within the same inbred strain, or even between different legs of the same DO mice to show that results are technically very consistent. It might also be prudent to identify that this is a potential limitation of the approach as in vitro culturing has the potential to substantially change the cell populations that are present. 

      We agree that in vitro culturing, in addition to the preparation of single cells for scRNA-seq, are unavoidable sources of technical variation in this study. However, the total number of cells contributed by each of the 80 DO mice after data processing does not appear to be skewed and the distribution appears normal (see added figures, now included as Supplemental Figure 3). Therefore, technical variation is at least consistent across all samples. Nevertheless, we have mentioned the potential for technical variation artifacts in our study in the discussion (line 414-416).

      (4) Need for permutation testing. "We identified 563 genes regulated by a significant eQTL in specific cell types. In total, 73 genes with eQTLs were also tradeSeq-identified genes in one or more cell type boundaries". These types of statements are fine but they need to be backed up with permutation testing to show that this level of enrichment is greater than one would expect by chance. 

      We did not perform enrichment tests as our only goal was to 1. determine if eQTL could be resolved in the DO mouse population using our scRNA-seq data and 2. predict in what cell type the associated eQTL and associated eGene may have an effect.

      (5) The main novelty of the paper seems to be that you have used single-cell RNA seq (given that you appear to have already detailed the candidates at the end). I don't think this makes the paper less interesting, but I think you need to reframe the paper more about the approach, and not the specific results. How you landed on these candidates is also not clear. So the paper might be improved by more robustly establishing the workflow and providing guidelines for how studies like this should be conducted in the future. 

      We sought to not only devise a rigorous approach to analyze our single cell data, but also showcase the utility of the approach in practice by highlighting targets for future research (i.e., Fgfrl1 and Tpx2).

      Our goal was to identify novel genes and we landed on these candidate genes (Fgfrl1 and Tpx2) because they had substantial data supporting their causality and they have yet to be fully characterized in the context of bone and BMD (line 295-297).

      In regards to establishing the workflow, we have included rationale for specific aspects of our approach throughout the paper. For example, Figure 2 itemizes each step of our network analysis and we explain why each step is utilized throughout various parts results (e.g., lines 168-170, 179-181, 191-193, 202-203, 257-260, 276-277).

      We have added a statement advocating for large-scale scRNA-seq from genetically diverse samples and network analyses for future studies (line 436-438).

      Minor Points 

      (1) In the summary you use the word "trajectory". Trajectories for what? I assume the transition between cell types, but this is not clear. 

      We added text to clarify the use of trajectory in the summary (line 34).

      (2) This sentence: "By 60 identifying networks enriched for genes implicated in GWAS we predicted putatively causal genes 61 for hundreds of BMD associations based on their membership in enriched modules." is also not clear. Do you mean: we predicted putatively causal genes by identifying clusters of co-expressed genes that were enriched for GWAS genes?" It is not clear how you identify the causal gene in the network. Is this just based on the hub gene? 

      The aforementioned sentence has since been removed to streamline the introduction, as suggested by Reviewer 1.

      In regards to causal gene identification, it is not based on whether it is hub gene. We prioritized a DDG (and their associated networks) if it was a causal gene that we identified in our previous work as having eQTL/sQTL in a GTEx tissue that colocalizes with human BMD GWAS.

      (3) Figure 3C. This is good but the labels are quite small. Would be good to make all the font sizes larger. 

      We have enlarged Figure 3C.

      (4) Line 341 in the Discussion should be "pseudotemporal". 

      We have edited “temporal” to “pseduotemporal”.

    1. Reviewer #1 (Public review):

      This is a well-designed and very interesting study examining the impact of imprecise feedback on outcomes on decision-making. I think this is an important addition to the literature and the results here, which provide a computational account of several decision-making biases, are insightful and interesting.

      I do not believe I have substantive concerns related to the actual results presented; my concerns are more related to the framing of some of the work. My main concern is regarding the assertion that the results prove that non-normative and non-Bayesian learning is taking place. I agree with the authors that their results demonstrate that people will make decisions in ways that demonstrate deviations from what would be optimal for maximizing reward in their task under a strict application of Bayes rule. I also agree that they have built reinforcement learning models which do a good job of accounting for the observed behavior. However, the Bayesian models included are rather simple- per the author descriptions, applications of Bayes' rule with either fixed or learned credibility for the feedback agents. In contrast, several versions of the RL models are used, each modified to account for different possible biases. However more complex Bayes-based models exist, notably active inference but even the hierarchical gaussian filter. These formalisms are able to accommodate more complex behavior, such as affect and habits, which might make them more competitive with RL models. I think it is entirely fair to say that these results demonstrate deviations from an idealized and strict Bayesian context; however, the equivalence here of Bayesian and normative is I think misleading or at least requires better justification/explanation. This is because a great deal of work has been done to show that Bayes optimal models can generate behavior or other outcomes that are clearly not optimal to an observer within a given context (consider hallucinations for example) but which make sense in the context of how the model is constructed as well as the priors and desired states the model is given.

      As such, I would recommend that the language be adjusted to carefully define what is meant by normative and Bayesian and to recognize that work that is clearly Bayesian could potentially still be competitive with RL models if implemented to model this task. An even better approach would be to directly use one of these more complex modelling approaches, such as active inference, as the comparator to the RL models, though I would understand if the authors would want this to be a subject for future work.

      Abstract:

      The abstract is lacking in some detail about the experiments done, but this may be a limitation of the required word count? If word count is not an issue, I would recommend adding details of the experiments done and the results. One comment is that there is an appeal to normative learning patterns, but this suggests that learning patterns have a fixed optimal nature, which may not be true in cases where the purpose of the learning (e.g. to confirm the feeling of safety of being in an in-group) may not be about learning accurately to maximize reward. This can be accommodated in a Bayesian framework by modelling priors and desired outcomes. As such the central premise that biased learning is inherently non-normative or non-Bayesian I think would require more justification. This is true in the introduction as well.

      Introduction:

      As noted above the conceptualization of Bayesian learning being equivalent to normative learning I think requires either further justification. Bayesian belief updating can be biased an non-optimal from an observer perspective, while being optimal within the agent doing the updating if the priors/desired outcomes are set up to advantage these "non-optimal" modes of decision making.

      Results:

      I wonder why the agent was presented before the choice - since the agent is only relevant to the feedback after the choice is made. I wonder if that might have induced any false association between the agent identity and the choice itself. This is by no means a critical point but would be interesting to get the authors' thoughts.

      The finding that positive feedback increases learning is one that has been shown before and depends on valence, as the authors note. They expanded their reinforcement learning model to include valence; but they did not modify the Bayesian model in a similar manner. This lack of a valence or recency effect might also explain the failure of the Bayesian models in the preceding section where the contrast effect is discussed. It is not unreasonable to imagine that if humans do employ Bayesian reasoning that this reasoning system has had parameters tuned based on the real world, where recency of information does matter; affect has also been shown to be incorporable into Bayesian information processing (see the work by Hesp on affective charge and the large body of work by Ryan Smith). It may be that the Bayesian models chosen here require further complexity to capture the situation, just like some of the biases required updates to the RL models. This complexity, rather than being arbitrary, may be well justified by decision making in the real world.

      The methods mention several symptom scales- it would be interesting to have the results of these and any interesting correlations noted. It is possible that some of individual variability here could be related to these symptoms, which could introduce precision parameter changes in a Bayesian context and things like reward sensitivity changes in an RL context.

      Discussion:

      (For discussion, not a specific comment on this paper): One wonders also about participant beliefs about the experiment or the intent of the experimenters. I have often had participants tell me they were trying to "figure out" a task or find patterns even when this was not part of the experiment. This is not specific to this paper, but it may be relevant in the future to try and model participant beliefs about the experiment especially in the context of disinformation, when they might be primed to try and "figure things out".

      As a general comment, in the active inference literature, there has been discussion of state-dependent actions, or "habits", which are learned in order to help agents more rapidly make decisions, based on previous learning. It is also possible that what is being observed is that these habits are at play, and that they represent the cognitive biases. This is likely especially true given, as the authors note, the high cognitive load of the task. It is true that this would mean that full-force Bayesian inference is not being used in each trial, or in each experience an agent might have in the world, but this is likely adaptive on the longer timescale of things, considering resource requirements. I think in this case you could argue that we have a departure from "normative" learning, but that is not necessarily a departure from any possible Bayesian framework, since these biases could potentially be modified by the agent or eschewed in favor of more expensive full-on Bayesian learning when warranted. Indeed in their discussion on the strategy of amplifying credible news sources to drown out low-credibility sources, the authors hint to the possibility of longer term strategies that may produce optimal outcomes in some contexts, but which were not necessarily appropriate to this task. As such, the performance on this task- and the consideration of true departure from Bayesian processing- should be considered in this wider context. Another thing to consider is that Bayesian inference is occurring, but that priors present going in produce the biases, or these biases arise from another source, for example factoring in epistemic value over rewards when the actual reward is not large. This again would be covered under an active inference approach, depending on how the priors are tuned. Indeed, given the benefit of social cohesion in an evolutionary perspective, some of these "biases" may be the result of adaptation. For example, it might be better to amplify people's good qualities and minimize their bad qualities in order to make it easier to interact with them; this entails a cost (in this case, not adequately learning from feedback and potentially losing out sometimes), but may fulfill a greater imperative (improved cooperation on things that matter). Given the right priors/desired states, this could still be a Bayes-optimal inference at a social level and as such may be ingrained as a habit which requires effort to break at the individual level during a task such as this.

      The authors note that this task does not relate to "emotional engagement" or "deep, identity-related, issues". While I agree that this is likely mostly true, it is also possible that just being told one is being lied to might elicit an emotional response that could bias responses, even if this is a weak response.

      Comments on revisions:

      In their updated version the authors have made some edits to address my concerns regarding the framing of the 'normative' bayesian model, clarifying that they utilized a simple bayesian model which is intended to adhere in an idealized manner to the intended task structure, though further simulations would have been ideal.

      The authors, however, did not take my recommendation to explore the symptoms in the symptom scales they collected as being a potential source of variability. They note that these were for hypothesis generation and were exploratory, fair enough, but this study is not small and there should have been sufficient sample size for a very reasonable analysis looking at symptom scores.

      However, overall the toned down claims and clarifications of intent are adequate responses to my previous review.

    2. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      This is a well-designed and very interesting study examining the impact of imprecise feedback on outcomes in decision-making. I think this is an important addition to the literature, and the results here, which provide a computational account of several decision-making biases, are insightful and interesting.

      We thank the reviewer for highlighting the strengths of this work.

      I do not believe I have substantive concerns related to the actual results presented; my concerns are more related to the framing of some of the work. My main concern is regarding the assertion that the results prove that non-normative and non-Bayesian learning is taking place. I agree with the authors that their results demonstrate that people will make decisions in ways that demonstrate deviations from what would be optimal for maximizing reward in their task under a strict application of Bayes' rule. I also agree that they have built reinforcement learning models that do a good job of accounting for the observed behavior. However, the Bayesian models included are rather simple, per the author's descriptions, applications of Bayes' rule with either fixed or learned credibility for the feedback agents. In contrast, several versions of the RL models are used, each modified to account for different possible biases. However, more complex Bayes-based models exist, notably active inference, but even the hierarchical Gaussian filter. These formalisms are able to accommodate more complex behavior, such as affect and habits, which might make them more competitive with RL models. I think it is entirely fair to say that these results demonstrate deviations from an idealized and strict Bayesian context; however, the equivalence here of Bayesian and normative is, I think, misleading or at least requires better justification/explanation. This is because a great deal of work has been done to show that Bayes optimal models can generate behavior or other outcomes that are clearly not optimal to an observer within a given context (consider hallucinations for example), but which make sense in the context of how the model is constructed as well as the priors and desired states the model is given.

      As such, I would recommend that the language be adjusted to carefully define what is meant by normative and Bayesian and to recognize that work that is clearly Bayesian could potentially still be competitive with RL models if implemented to model this task. An even better approach would be to directly use one of these more complex modelling approaches, such as active inference, as the comparator to the RL models, though I would understand if the authors would want this to be a subject for future work.

      We thank the reviewer for raising this crucial and insightful point regarding the framing of our results and the definitions of 'normative' and 'Bayesian' learning. Our primary aim in this work was to characterize specific behavioral signatures that demonstrate deviations from predictions generated by a strict, idealized Bayesian framework when learning from disinformation (which we term “biases”). We deliberately employed relatively simple Bayesian models as benchmarks to highlight these specific biases. We fully agree that more sophisticated Bayes-based models (as mentioned by the reviewer, or others) could potentially offer alternative mechanistic explanations for participant behavior. However, we currently do not have a strong notion about which Bayesian models can encompass our findings, and hence, we leave this important question for future work.

      To enhance clarity within the current manuscript we now avoided the use of the term “normative” to refer to our Bayesian models, using the term “ideal” instead. We also define more clearly what exactly we mean by that notion when the idea model is described:

      “This model is based on an idealized assumptions that during the feedback stage of each trial, the value of the chosen bandit is updated (based on feedback valence and credibility) according to Bayes rule reflecting perfect adherence to the instructed task structure (i.e., how true outcomes and feedback are generated).”

      Moreover, we have added a few sentences in the discussion commenting on how more complex Bayesian models might account for our empirical findings:

      “However, as hypothesized, when facing potential disinformation, we also find that individuals exhibit several important biases i.e., deviations from strictly idealized Bayesian strategies. Future studies should explore if and under what assumptions, about the task’s generative structure and/or learner’s priors and objectives, more complex Bayesian models (e.g., active inference (58)) might account for our empirical findings.”

      Abstract:

      The abstract is lacking in some detail about the experiments done, but this may be a limitation of the required word count. If word count is not an issue, I would recommend adding details of the experiments done and the results.

      We thank the reviewer for their valuable suggestion. We have now included more details about the experiment in the abstract:

      “In two experiments, participants completed a two-armed bandit task, where they repeatedly chose between two lotteries and received outcome-feedback from sources of varying credibility, who occasionally disseminated disinformation by lying about true choice outcome (e.g., reporting non reward when a reward was truly earned or vice versa).”

      One comment is that there is an appeal to normative learning patterns, but this suggests that learning patterns have a fixed optimal nature, which may not be true in cases where the purpose of the learning (e.g. to confirm the feeling of safety of being in an in-group) may not be about learning accurately to maximize reward. This can be accommodated in a Bayesian framework by modelling priors and desired outcomes. As such, the central premise that biased learning is inherently non-normative or non-Bayesian, I think, would require more justification. This is true in the introduction as well.

      Introduction:

      As noted above, the conceptualization of Bayesian learning being equivalent to normative learning, I think requires further justification. Bayesian belief updating can be biased and non-optimal from an observer perspective, while being optimal within the agent doing the updating if the priors/desired outcomes are set up to advantage these "non-optimal" modes of decision making.

      We appreciate the reviewer's thoughtful comment regarding the conceptualization of "normative" and "Bayesian" learning. We fully agree that the definition of "normative" is nuanced and can indeed depend on whether one considers reward-maximization or the underlying principles of belief updating. As explained above we now restrict our presentation to deviations from “ideal Bayes” learning patterns and we acknowledge the reviewer’s concern in a caveat in our discussion.

      Results:

      I wonder why the agent was presented before the choice, since the agent is only relevant to the feedback after the choice is made. I wonder if that might have induced any false association between the agent identity and the choice itself. This is by no means a critical point, but it would be interesting to get the authors' thoughts.

      We thank the reviewer for raising this interesting point regarding the presentation of the agent before the choice. Our decision to present the agent at this stage was intentional, as our original experimental design aimed to explore the possible effects of "expected source credibility" on participants' choices (e.g., whether knowledge of feedback credibility will affect choice speed and accuracy). However, we found nothing that would be interesting to report.

      The finding that positive feedback increases learning is one that has been shown before and depends on valence, as the authors note. They expanded their reinforcement learning model to include valence, but they did not modify the Bayesian model in a similar manner. This lack of a valence or recency effect might also explain the failure of the Bayesian models in the preceding section, where the contrast effect is discussed. It is not unreasonable to imagine that if humans do employ Bayesian reasoning that this reasoning system has had parameters tuned based on the real world, where recency of information does matter; affect has also been shown to be incorporable into Bayesian information processing (see the work by Hesp on affective charge and the large body of work by Ryan Smith). It may be that the Bayesian models chosen here require further complexity to capture the situation, just like some of the biases required updates to the RL models. This complexity, rather than being arbitrary, may be well justified by decision-making in the real world.

      Thanks for these additional important ideas which speak more to the notion that more complex Bayesian frameworks may account for biases we report.

      The methods mention several symptom scales- it would be interesting to have the results of these and any interesting correlations noted. It is possible that some of the individual variability here could be related to these symptoms, which could introduce precision parameter changes in a Bayesian context and things like reward sensitivity changes in an RL context.

      We included these questionnaires for exploratory purposes, with the aim of generating informed hypotheses for future research into individual differences in learning. Given the preliminary nature of these analyses, we believe further research is required about this important topic.

      Discussion:

      (For discussion, not a specific comment on this paper): One wonders also about participants' beliefs about the experiment or the intent of the experimenters. I have often had participants tell me they were trying to "figure out" a task or find patterns even when this was not part of the experiment. This is not specific to this paper, but it may be relevant in the future to try and model participant beliefs about the experiment especially in the context of disinformation, when they might be primed to try and "figure things out".

      We thank the reviewer for this important recommendation. We agree and this point is included in our caveat (cited above) that future research should address what assumptions about the generative task structure can allow Bayesian models to account for our empirical patterns.

      As a general comment, in the active inference literature, there has been discussion of state-dependent actions, or "habits", which are learned in order to help agents more rapidly make decisions, based on previous learning. It is also possible that what is being observed is that these habits are at play, and that they represent the cognitive biases. This is likely especially true given, as the authors note, the high cognitive load of the task. It is true that this would mean that full-force Bayesian inference is not being used in each trial, or in each experience an agent might have in the world, but this is likely adaptive on the longer timescale of things, considering resource requirements. I think in this case you could argue that we have a departure from "normative" learning, but that is not necessarily a departure from any possible Bayesian framework, since these biases could potentially be modified by the agent or eschewed in favor of more expensive full-on Bayesian learning when warranted.<br /> Indeed, in their discussion on the strategy of amplifying credible news sources to drown out low-credibility sources, the authors hint at the possibility of longer-term strategies that may produce optimal outcomes in some contexts, but which were not necessarily appropriate to this task. As such, the performance on this task- and the consideration of true departure from Bayesian processing- should be considered in this wider context.

      Another thing to consider is that Bayesian inference is occurring, but that priors present going in produce the biases, or these biases arise from another source, for example, factoring in epistemic value over rewards when the actual reward is not large. This again would be covered under an active inference approach, depending on how the priors are tuned. Indeed, given the benefit of social cohesion in an evolutionary perspective, some of these "biases" may be the result of adaptation. For example, it might be better to amplify people's good qualities and minimize their bad qualities in order to make it easier to interact with them; this entails a cost (in this case, not adequately learning from feedback and potentially losing out sometimes), but may fulfill a greater imperative (improved cooperation on things that matter). Given the right priors/desired states, this could still be a Bayes-optimal inference at a social level and, as such, may be ingrained as a habit that requires effort to break at the individual level during a task such as this.

      We thank the reviewer for these insightful suggestions speaking further to the point about more complex Bayesian models.

      The authors note that this task does not relate to "emotional engagement" or "deep, identity-related issues". While I agree that this is likely mostly true, it is also possible that just being told one is being lied to might elicit an emotional response that could bias responses, even if this is a weak response.

      We agree with the reviewer that a task involving performance-based bonuses, and particularly one where participants are explicitly told they are being lied to, might elicit weak emotional response. However, our primary point is that the degree of these responses is expected to be substantially weaker than those typically observed in the broader disinformation literature, which frequently deals with highly salient political, social, or identity-related topics that inherently carry strong emotional and personal ties for participants, leading to much more pronounced affective engagement and potential biases. Our task deliberately avoids such issues thus minimizing the potential for significant emotion-driven biases. We have toned down the discussion accordingly:

      “This occurs even when the decision at hand entails minimal emotional engagement or pertinence to deep, identity-related, issues.”

      Reviewer #2 (Public review):

      This valuable paper studies the problem of learning from feedback given by sources of varying credibility. The solid combination of experiment and computational modeling helps to pin down properties of learning, although some ambiguity remains in the interpretation of results.

      Summary:

      This paper studies the problem of learning from feedback given by sources of varying credibility. Two banditstyle experiments are conducted in which feedback is provided with uncertainty, but from known sources. Bayesian benchmarks are provided to assess normative facets of learning, and alternative credit assignment models are fit for comparison. Some aspects of normativity appear, in addition to deviations such as asymmetric updating from positive and negative outcomes.

      Strengths:

      The paper tackles an important topic, with a relatively clean cognitive perspective. The construction of the experiment enables the use of computational modeling. This helps to pinpoint quantitatively the properties of learning and formally evaluate their impact and importance. The analyses are generally sensible, and parameter recovery analyses help to provide some confidence in the model estimation and comparison.

      We thank the reviewer for highlighting the strengths of this work.

      Weaknesses:

      (1) The approach in the paper overlaps somewhat with various papers, such as Diaconescu et al. (2014) and Schulz et al. (forthcoming), which also consider the Bayesian problem of learning and applying source credibility, in terms of theory and experiment. The authors should discuss how these papers are complementary, to better provide an integrative picture for readers.

      Diaconescu, A. O., Mathys, C., Weber, L. A., Daunizeau, J., Kasper, L., Lomakina, E. I., ... & Stephan, K. E. (2014). Inferring the intentions of others by hierarchical Bayesian learning. PLoS computational biology, 10(9), e1003810.

      Schulz, L., Schulz, E., Bhui, R., & Dayan, P. Mechanisms of Mistrust: A Bayesian Account of Misinformation Learning. https://doi.org/10.31234/osf.io/8egxh

      We thank the reviewers for pointing us to this relevant work. We have updated the introduction, mentioning these precedents in the literature and highlighting our specific contributions:

      “To address these questions, we adopt a novel approach within the disinformation literature by exploiting a Reinforcement Learning (RL) experimental framework (36). While RL has guided disinformation research in recent years (37–41), our approach is novel in using one of its most popular tasks: the “bandit task”.”

      We also explain in the discussion how these papers relate to the current study:

      “Unlike previous studies wherein participants had to infer source credibility from experience (30,37,72), we took an explicit-instruction approach, allowing us to precisely assess source-credibility impact on learning, without confounding it with errors in learning about the sources themselves. More broadly, our work connects with prior research on observational learning, which examined how individuals learn from the actions or advice of social partners (72–75). This body of work has demonstrated that individuals integrate learning from their private experiences with learning based on others’ actions or advice—whether by inferring the value others attribute to different options or by mimicking their behavior (57,76). However, our task differs significantly from traditional observational learning. Firstly, our feedback agents interpret outcomes rather than demonstrating or recommending actions (30,37,72).”

      (2) It isn't completely clear what the "cross-fitting" procedure accomplishes. Can this be discussed further?

      We thank the reviewer for requesting further clarification on the cross-fitting procedure. Our study utilizes two distinct model families: Bayesian models and CA models. The credit assignment parameters from the CA models can be treated as “data/behavioural features” corresponding to how choice feedback affects choice-propensities. The cross fitting-approach allows us in effect to examine whether these propensity features are predicted from our Bayesian models. To the extent they are not, we can conclude empirical behavior is “biased”.

      Thus, in our cross-fitting procedure we compare the CA model parameters extracted from participant data (empirical features) with those that would be expected if our Bayesian agents performed the task. Specifically, we first fit participant behavior with our Bayesian models, then simulate this model using the best-fitted parameters and fit those simulations with our CA models. This generates a set of CA parameters that would be predicted if participants behavior is reduced to a Bayesian account. By comparing these predicted Bayesian CA parameters with the actual CA parameters obtained from human participants, the cross-fitting procedure allows us to quantitatively demonstrate that the observed participant parameters are indeed statistically significant deviations from normative Bayesian processing. This provides a robust validation that the biases we identify are not artifacts of the CA model's structure but true departures from normative learning.

      We also note that Reviewer 3 suggested an intuitive way to think about the CA parameters—as analogous to logistic regression coefficients in a “sophisticated regression” of choice on (recencyweighted) choice-feedback. We find this suggestion potentially helpful for readers. Under this interpretation, the purpose of the cross-fitting method can be seen simply as estimating the regression coefficients that would be predicted by our Bayesian agents, and comparing those to the empirical coefficients.

      In our manuscript we now explain this issues more clearly by explaining how our model is analogous to a logistic regression:

      “The probability to choose a bandit (say A over B) in this family of models is a logistic function of the contrast choice-propensities between these two bandits. One interpretation of this model is as a “sophisticated” logistic regression, where the CA parameters take the role of “regression coefficients” corresponding to the change in log odds of repeating the just-taken action in future trials based on the feedback (+/- CA for positive or negative feedback, respectively; the model also includes gradual perseveration which allows for constant log-odd changes that are not affected by choice feedback) . The forgetting rate captures the extent to which the effect of each trial on future choices diminishes with time. The Q-values are thus exponentially decaying sums of logistic choice propensities based on the types of feedback a bandit received.”

      We also explain our cross-fitting procedure in more detail:

      “To further characterise deviations between behaviour and our Bayesian learning models, we used a “crossfitting” method. Treating CA parameters as data-features of interest (i.e., feedback dependent changes in choice propensity), our goal was to examine if and how empirical features differ from features extracted from simulations of our Bayesian learning models. Towards that goal, we simulated synthetic data based on Bayesian agents (using participants’ best fitting parameters), but fitted these data using the CA-models, obtaining what we term “Bayesian-CA parameters” (Fig. 2d; Methods). A comparison of these BayesianCA parameters, with empirical-CA parameters obtained by fitting CA models to empirical data, allowed us to uncover patterns consistent with, or deviating from, ideal-Bayesian value-based inference. Under the sophisticated logistic-regression interpretation of the CA-model family the cross-fitting method comprises a comparison between empirical regression coefficients (i.e., empirical CA parameters) and regression coefficient based on simulations of Bayesian models (Bayesian CA parameters).”

      (3) The Credibility-CA model seems to fit the same as the free-credibility Bayesian model in the first experiment and barely better in the second experiment. Why not use a more standard model comparison metric like the Bayesian Information Criterion (BIC)? Even if there are advantages to the bootstrap method (which should be described if so), the BIC would help for comparability between papers.

      We thank the reviewer for this important comment regarding our model comparison approach. We acknowledge that classical information criteria like AIC and BIC are widely used in RL studies. However, we argue our method for model-comparison is superior.

      We conducted a model recovery analysis demonstrating a significant limitation of using AIC or BIC for model-comparison in our data. Both these methods are strongly biased in favor of the Bayesian models. Our PBCM method, on the other hand, is both unbiased and more accurate. We believe this is because “off the shelf” methods like AIC and BIC rely on strong assumptions (such as asymptotic sample size and trial-independence) that are not necessarily met in our tasks (Data is finite; Trials in RL tasks depend on previous trials). PBCM avoids such assumptions to obtain comparison criteria specifically tailored to the structure and size of our empirical data. We have now mentioned this fact in the results section of the main text:

      “We considered using AIC and BIC, which apply “off-the shelf” penalties for model-complexity. However, these methods do not adapt to features like finite sample size (relying instead on asymptotic assumption) or temporal dependence (as is common in reinforcement learning experiments). In contrast, the parametric bootstrap cross-fitting method replaces these fixed penalties with empirical, data-driven criteria for modelselection. Indeed, model-recovery simulations confirmed that whereas AIC and BIC were heavily biased in favour of the Bayesian models, the bootstrap method provided excellent model-recovery (See Fig. S20).”

      We have also included such model recovery in the SI document:

      (4) As suggested in the discussion, the updating based on random feedback could be due to the interleaving of trials. If one is used to learning from the source on most trials, the occasional random trial may be hard to resist updating from. The exact interleaving structure should also be clarified (I assume different sources were shown for each bandit pair). This would also relate to work on RL and working memory: Collins, A. G., & Frank, M. J. (2012). How much of reinforcement learning is working memory, not reinforcement learning? A behavioral, computational, and neurogenetic analysis. European Journal of Neuroscience, 35(7), 10241035.

      We thank the reviewer for this point. The specific interleaved structure of the agents is described in the main text:

      “Each agent provided feedback for 5 trials for each bandit pair (with the agent order interleaved within the bandit pair).”

      As well as in the methods section:

      “Feedback agents were randomly interleaved across trials subject to the constraint that each agent appeared on 5-trials for each bandit pair.”

      We also thank the reviewer for mentioning the relevant work on working memory. We have now added it to our discussion point:

      “In our main study, we show that participants revised their beliefs based on entirely non-credible feedback, whereas an ideal Bayesian strategy dictates such feedback should be ignored. This finding resonates with the “continued-influence effect” whereby misleading information continues to influence an individual's beliefs even after it has been retracted (59,60). One possible explanation is that some participants failed to infer that feedback from the 1-star agent was statistically void of information content, essentially random (e.g., the group-level credibility of this agent was estimated by our free-credibility Bayesian model as higher than 50%). Participants were instructed that this feedback would be “a lie” 50% of the time but were not explicitly told that this meant it was random and should therefore be disregarded. Notably, however, there was no corresponding evidence random feedback affected behaviour in our discovery study. It is possible that an individual’s ability to filter out random information might have been limited due to a high cognitive load induced by our main study task, which required participants to track the values of three bandit pairs and juggle between three interleaved feedback agents (whereas in our discovery study each experimental block featured a single bandit pair). Future studies should explore more systematically how the ability to filter random feedback depends on cognitive load (61).”

      (5) Why does the choice-repetition regression include "only trials for which the last same-pair trial featured the 3-star agent and in which the context trial featured a different bandit pair"? This could be stated more plainly.

      We thank the reviewer for this question. When we previously submitted our manuscript, we thought that finding enhanced credit-assignment for fully credible feedback following potential disinformation from a different context would constitute a striking demonstration of our “contrast effect”. However, upon reexamining this finding we found out we had a coding error (affecting how trials were filtered). We have now rerun and corrected this analysis. We have assessed the contrast effect for both "same-context" trials (where the contextual trial featured the same bandit pair as the learning trial) and "different-context" trials (where the contextual trial featured a different bandit pair). Our re-analysis reveals a selective significant contrast effect in the samecontext condition, but no significant effect in the different-context condition. We have updated the main text to reflect these corrected findings and provide a clearer explanation of the analysis:

      “A comparison of empirical and Bayesian credit-assignment parameters revealed a further deviation from ideal Bayesian learning: participants showed an exaggerated credit-assignment for the 3-star agent compared with Bayesian models [Wilcoxon signed-rank test, instructed-credibility Bayesian model (median difference=0.74, z=11.14); free-credibility Bayesian model (median difference=0.62, z=10.71), all p’s<0.001] (Fig. 3a). One explanation for enhanced learning for the 3-star agents is a contrast effect, whereby credible information looms larger against a backdrop of non-credible information. To test this hypothesis, we examined whether the impact of feedback from the 3-star agent is modulated by the credibility of the agent in the trial immediately preceding it. More specifically, we reasoned that the impact of a 3-star agent would be amplified by a “low credibility context” (i.e., when it is preceded by a low credibility trial). In a binomial mixed effects model, we regressed choice-repetition on feedback valence from the last trial featuring the same bandit pair (i.e., the learning trial) and the feedback agent on the trial immediately preceding that last trial (i.e., the contextual credibility; see Methods for model-specification). This analysis included only learning trials featuring the 3-star agent, and context trials featuring the same bandit pair as the learning trial (Fig. 4a). We found that feedback valence interacted with contextual credibility (F(2,2086)=11.47, p<0.001) such that the feedback-effect (from the 3-star agent) decreased as a function of the preceding context-credibility (3-star context vs. 2-star context: b= -0.29, F(1,2086)=4.06, p=0.044; 2star context vs. 1-star context: b=-0.41, t(2086)=-2.94, p=0.003; and 3-star context vs. 1-star context: b=0.69, t(2086)=-4.74, p<0.001) (Fig. 4b). This contrast effect was not predicted by simulations of our main models of interest (Fig. 4c). No effect was found when focussing on contextual trials featuring a bandit pair different than the one in the learning trial (see SI 3.5). Thus, these results support an interpretation that credible feedback exerts a greater impact on participants’ learning when it follows non-credible feedback, in the same learning context.”

      We have modified the discussion accordingly as well:

      “A striking finding in our study was that for a fully credible feedback agent, credit assignment was exaggerated (i.e., higher than predicted by our Bayesian models). Furthermore, the effect of fully credible feedback on choice was further boosted when it was preceded by a low-credibility context related to current learning. We interpret this in terms of a “contrast effect”, whereby veridical information looms larger against a backdrop of disinformation (21). One upshot is that exaggerated learning might entail a risk of jumping to premature conclusions based on limited credible evidence (e.g., a strong conclusion that a vaccine is produces significant side-effect risks based on weak credible information, following non-credible information about the same vaccine). An intriguing possibility, that could be tested in future studies, is that participants strategically amplify the extent of learning from credible feedback to dilute the impact of learning from noncredible feedback. For example, a person scrolling through a social media feed, encountering copious amounts of disinformation, might amplify the weight they assign to credible feedback in order to dilute effects of ‘fake news’. Ironically, these results also suggest that public campaigns might be more effective when embedding their messages in low-credibility contexts , which may boost their impact.”

      And we have included some additional analyses in the SI document:

      “3.5 Contrast effects for contexts featuring a different bandit

      Given that we observed a contrast effect when both the learning and the immediately preceding "context trial” involved the same pair of bandits, we next investigated whether this effect persisted when the context trial featured a different bandit pair – a situation where the context would be irrelevant to the current learning. Again, we used in a binomial mixed effects model, regressing choice-repetition on feedback valence in the learning trial and the feedback agent in the context trial. This analysis included only learning trials featuring the 3-star agent, and context trials featuring a different bandit pair than the learning trial (Fig. S22a). We found no significant evidence of an interaction between feedback valence and contextual credibility (F(2,2364)=0.21, p=0.81) (Fig. S22b). This null result was consistent with the range of outcomes predicted by our main computational models (Fig. S22c).

      We aimed to formally compare the influence of two types of contextual trials: those featuring the same bandit pair as the learning trial versus those featuring a different pair. To achieve this, we extended our mixedeffects model by incorporating a new predictor variable, "CONTEXT_TYPE" which coded whether the contextual trial involved the same bandit pair (coded as -0.5) or a different bandit pair (+0.5) compared to the learning trial. The Wilkinson notation for this expanded mixed-effects model is:

      𝑅𝐸𝑃𝐸𝐴𝑇 ~ 𝐶𝑂𝑁𝑇𝐸𝑋𝑇_𝑇𝑌𝑃𝐸 ∗ 𝐹𝐸𝐸𝐷𝐵𝐴𝐶𝐾 ∗ (𝐶𝑂𝑁𝑇𝐸𝑋𝑇<sub>2-star</sub> + 𝐶𝑂𝑁𝑇𝐸𝑋𝑇<sub>3-star</sub>) + 𝐵𝐸𝑇𝑇𝐸𝑅 + (1|𝑝𝑎𝑟𝑡𝑖𝑐𝑖𝑝𝑎𝑛𝑡)

      This expanded model revealed a significant three-way interaction between feedback valence, contextual credibility, and context type (F(2,4451) = 7.71, p<0.001). Interpreting this interaction, we found a 2-way interaction between context-source and feedback valence when the context was the same (F(2,4451) = 12.03, p<0.001), but not when context was different (F(2,4451) = 0.23, p = 0.79). Further interpreting the double feedback-valence * context-source interaction (for the same context) we obtained the same conclusions as reported in the main text.”

      (6) Why apply the "Truth-CA" model and not the Bayesian variant that it was motivated by?

      Thanks for this very useful suggestion. We are unsure if we fully understand the question. The Truth-CA model was not motivated by a new Bayesian model. Our Bayesian models were simply used to make the point that participants may partially discriminate between truthful and untruthful feedback (for a given source). This led to the idea that perhaps more credit is assigned for truth (than lie) trials, which is what we found using our Truth-CA model. Note we show that our Bayesian models cannot account for this modulation.

      We have now improved our "Truth-CA" model. Previously, our Truth-CA model considered whether feedback on each trial was true or not based on realized latent true outcomes. However, it is possible that the very same feedback would have had an opposite truth-status if the latent true outcome was different (recall true outcomes are stochastic). This injects noise into the trial classification in our previous model. To avoid this, in our new model feedback is modulated by the probability the reported feedback is true (marginalized over stochasticity of true outcome).

      We have described this new model in the methods section:

      “Additionally, we formulated a “Truth-CA” model, which worked as our Credibility-CA model, but incorporated a free truth-bonus parameter (TB). This parameter modulates the extent of credit assignment for each agent based on the posterior probability of feedback being true (given the credibility of the feedback agent, and the true reward probability of the chosen bandit). The chosen bandit was updated as follows:

      𝑄 ← (1 – 𝑓<sub>Q</sub>) ∗ 𝑄 + [𝐶𝐴(𝑎𝑔𝑒𝑛𝑡) + 𝑇𝐵 ∗ (𝑃(𝑡𝑟𝑢𝑡ℎ) − 0.5)] ∗ 𝐹

      where P(truth) is the posterior probability of the feedback being true in the current trial (for exact calculation of P(truth) see “Methods: Bayesian estimation of posterior belief that feedback is true”).”

      All relevant results have been updated accordingly in the main text:

      “To formally address whether feedback truthfulness modulates credit assignment, we fitted a new variant of the CA model (the “Truth-CA” model) to the data. This variant works as our Credibility-CA model but incorporated a truth-bonus parameter (TB) which increases the degree of credit assignment for feedback as a function of the experimenter-determined likelihood the feedback is true (which is read from the curves in Fig 6a when x is taken to be the true probability the bandit is rewarding). Specifically, after receiving feedback, the Q-value of the chosen option is updated according to the following rule: 𝑄 ← (1 – 𝑓<sub>Q</sub>) ∗ 𝑄 + [𝐶𝐴(𝑎𝑔𝑒𝑛𝑡) + 𝑇𝐵 ∗ (𝑃(𝑡𝑟𝑢𝑡ℎ) − 0.5)] ∗ 𝐹 where 𝑇𝐵 is the free parameter representing the truth bonus, and 𝑃(𝑡𝑟𝑢𝑡ℎ) is the probability the received feedback being true (from the experimenter’s perspective). We acknowledge that this model falls short of providing a mechanistically plausible description of the credit assignment process, because participants have no access to the experimenter’s truthfulness likelihoods (as the true bandit reward probabilities are unknown to them). Nonetheless, we use this ‘oracle model’ as a measurement tool to glean rough estimates for the extent to which credit assignment Is boosted as a function of its truthfulness likelihood. Fitting this Truth-CA model to participants' behaviour revealed a significant positive truth-bonus (mean=0.21, t(203)=3.12, p=0.002), suggesting that participants indeed assign greater weight to feedback that is likely to be true (Fig. 6c; see SI 3.3.1 for detailed ML parameter results). Notably, simulations using our other models (Methods) consistently predicted smaller truth biases (compared to the empirical bias) (Fig. 6d). Moreover, truth bias was still detected even in a more flexible model that allowed for both a positivity bias and truth-bias (see SI 3.7). The upshot is that participants are biased to assign higher credit based on feedback that is more likely to be true in a manner that is inconsistent with out Bayesian models and above and beyond the previously identified positivity biases.“

      Finally, the Supplementary Information for the discovery study has also been revised to feature this analysis:

      “We next assessed whether participants infer whether the feedback they received on each trial was true or false and adjust their credit assignment based on this inference. We again used the “Truth-CA” model to obtain estimates for the truth bonus (TB), the increase in credit assignment as a function of the posterior probability of feedback being true. As in our main study, the fitted truth bias parameter was significantly positive, indicating that participants assign greater weight to feedback they believe is likely to be true (Fig, S4a; see SI 3.3.1 for detailed ML parameter results). Strikingly, model-simulations (Methods) predicted a lower truth bonus than the one observed in participants (Fig. S4b).”

      (7) "Overall, the results from this study support the exact same conclusions (See SI section 1.2) but with one difference. In the discovery study, we found no evidence for learning based on 50%-credibility feedback when examining either the feedback effect on choice repetition or CA in the credibility-CA model (SI 1.2.3)" - this seems like a very salient difference, when the paper reports the feedback effect as a primary finding of interest, though I understand there remains a valence-based difference.

      We agree with the reviewer and thank them for this suggestion. We now state explicitly throughout the manuscript that this finding was obtained only in one of our two studies. In the section “Discovery study” of the results we state explicitly this finding was not found in the discovery study:

      “However, we found no evidence for learning based on 50%-credibility feedback when examining either the feedback effect on choice repetition or CA in the credibility-CA model (SI 1.2.3).”

      We also note that related to another concern from R3 (that perseveration may masquerade as positivity bias) we conducted additional analyses (detailed in SI 3.6.2). These analyses revealed that the observed positivity bias for the 1-star agent in the discovery study falls within the range predicted by simple choice-perseveration. Consequently, we have removed the suggestion that participants still learn from the random agent in the discovery study. Furthermore, we have modified the discussion section to include a possible explanation for this discrepancy between the two studies:

      “Notably, however, there was no corresponding evidence random feedback affected behaviour in our discovery study. It is possible that an individual’s ability to filter out random information might have been limited due to a high cognitive load induced by our main study task, which required participants to track the values of three bandit pairs and juggle between three interleaved feedback agents (whereas in our discovery study each experimental block featured a single bandit pair). Future studies should explore more systematically how the ability to filter random feedback depends on cognitive load (61).”

      (8) "Participants were instructed that this feedback would be "a lie 50% of the time but were not explicitly told that this meant it was random and should therefore be disregarded." - I agree that this is a possible explanation for updating from the random source. It is a meaningful caveat.

      Thank you for this thought. While this can be seen as a caveat—since we don’t know what would have happened with explicit instructions—we also believe it is interesting from another perspective. In many real-life situations, individuals may have all the necessary information to infer that the feedback they receive is uninformative, yet still fail to do so, especially when they are not explicitly told to ignore it.

      In future work, we plan to examine how behaviour changes when participants are given more explicit instructions—for example, that the 50%-credibility agent provides purely random feedback.

      (9) "Future studies should investigate conditions that enhance an ability to discard disinformation, such as providing explicit instructions to ignore misleading feedback, manipulations that increase the time available for evaluating information, or interventions that strengthen source memory." - there is work on some of this in the misinformation literature that should be cited, such as the "continued influence effect". For example: Johnson, H. M., & Seifert, C. M. (1994). Sources of the continued influence effect: When misinformation in memory affects later inferences. Journal of experimental psychology: Learning, memory, and cognition, 20(6), 1420.

      We thank the reviewer for pointing us towards the relevant literature. We have now included citations about the “continued influence effect” of misinformation in the discussion:

      “In our main study, we show that participants revised their beliefs based on entirely non-credible feedback, whereas an ideal Bayesian strategy dictates such feedback should be ignored. This finding resonates with the “continued-influence effect” whereby misleading information continues to influence an individual's beliefs even after it has been retracted (59,60).”

      (10) Are the authors arguing that choice-confirmation bias may be at play? Work on choice-confirmation bias generally includes counterfactual feedback, which is not present here.

      We agree with the reviewer that a definitive test for choice-confirmation bias typically requires counterfactual feedback, which is not present in our current task. In our discussion, we indeed suggest that the positivity bias we observe may stem from a form of choice-confirmation, drawing on the extensive literature on this bias in reinforcement learning (Lefebvre et al., 2017; Palminteri et al., 2017; Palminteri & Lebreton, 2022). However, we fully acknowledge that this link is a hypothesis and that explicitly testing for choice-confirmation bias would necessitate a future study specifically incorporating counterfactual feedback. We have included a clarification of this point in the discussion:

      “Previous reinforcement learning studies, report greater credit-assignment based on positive compared to negative feedback, albeit only in the context of veridical feedback (43,44,62). Here, supporting our a-priori hypothesis we show that this positivity bias is amplified for information of low and intermediate credibility (in absolute terms in the discovery study, and relative to the overall extent of CA in both studies) . Of note, previous literature has interpreted enhanced learning for positive outcomes in reinforcement learning as indicative of a confirmation bias (42,44). For example, positive feedback may confirm, to a greater extent than negative feedback one’s choice as superior (e.g., “I chose the better of the two options”). Leveraging the framework of motivated cognition (35), we posited that feedback of uncertain veracity (e.g., low credibility) amplifies this bias by incentivising individuals to self-servingly accept positive feedback as true (because it confers positive, desirable outcomes), and explain away undesirable, choice-disconfirming, negative feedback as false. This could imply an amplified confirmation bias on social media, where content from sources of uncertain credibility, such as unknown or unverified users, is more easily interpreted in a self-serving manner, disproportionately reinforcing existing beliefs (63). In turn, this could contribute to an exacerbation of the negative social outcomes previously linked to confirmation bias such as polarization (64,65), the formation of ‘echo chambers’ (19), and the persistence of misbelief regarding contemporary issues of importance such as vaccination (66,67) and climate change (68–71). We note however, that further studies are required to determine whether positivity bias in our task is indeed a form of confirmation bias.”

      Reviewer #3 (Public review):

      Summary

      This paper investigates how disinformation affects reward learning processes in the context of a two-armed bandit task, where feedback is provided by agents with varying reliability (with lying probability explicitly instructed). They find that people learn more from credible sources, but also deviate systematically from optimal Bayesian learning: They learned from uninformative random feedback, learned more from positive feedback, and updated too quickly from fully credible feedback (especially following low-credibility feedback). Overall, this study highlights how misinformation could distort basic reward learning processes, without appeal to higher-order social constructs like identity.

      Strengths

      (1) The experimental design is simple and well-controlled; in particular, it isolates basic learning processes by abstracting away from social context.

      (2) Modeling and statistics meet or exceed the standards of rigor.

      (3) Limitations are acknowledged where appropriate, especially those regarding external validity.

      (4) The comparison model, Bayes with biased credibility estimates, is strong; deviations are much more compelling than e.g., a purely optimal model.

      (5) The conclusions are interesting, in particular the finding that positivity bias is stronger when learning from less reliable feedback (although I am somewhat uncertain about the validity of this conclusion)

      We deeply thank the reviewer for highlighting the strengths of this work.

      Weaknesses

      (1) Absolute or relative positivity bias?

      In my view, the biggest weakness in the paper is that the conclusion of greater positivity bias for lower credible feedback (Figure 5) hinges on the specific way in which positivity bias is defined. Specifically, we only see the effect when normalizing the difference in sensitivity to positive vs. negative feedback by the sum. I appreciate that the authors present both and add the caveat whenever they mention the conclusion (with the crucial exception of the abstract). However, what we really need here is an argument that the relative definition is the right way to define asymmetry....

      Unfortunately, my intuition is that the absolute difference is a better measure. I understand that the relative version is common in the RL literature; however previous studies have used standard TD models, whereas the current model updates based on the raw reward. The role of the CA parameter is thus importantly different from a traditional learning rate - in particular, it's more like a logistic regression coefficient (as described below) because it scales the feedback but not the decay. Under this interpretation, a difference in positivity bias across credibility conditions corresponds to a three-way interaction between the exponentially weighted sum of previous feedback of a given type (e.g., positive from the 75% credible agent), feedback positivity, and condition (dummy coded). This interaction corresponds to the nonnormalized, absolute difference.

      Importantly, I'm not terribly confident in this argument, but it does suggest that we need a compelling argument for the relative definition.

      We thank the reviewer for raising this important point about the definition of positivity bias, and for their thoughtful discussion on the absolute versus relative measures. We believe that the relative valence bias offers a distinct and valuable perspective on positivity bias. Conceptually, this measure describes positivity bias in a manner akin to a “percentage difference” relative to the overall level of learning which allows us to control for the overall decreases in the overall amount of credit assignment as feedback becomes less credible. We are unsure if one measure is better or more correct than the other and we believe that reporting both measures enriches the understanding of positivity bias and allows for a more comprehensive characterization of this phenomenon (as long as these measures are interpreted carefully). We have stated the significance of the relative measure in the results section:

      “Following previous research, we quantified positivity bias in 2 ways: 1) as the absolute difference between credit-assignment based on positive or negative feedback, and 2) as the same difference but relative to the overall extent of learning. We note that the second, relative, definition, is more akin to “percentage change” measurements providing a control for the overall lower levels of credit-assignment for less credible agent.”

      We also wish to point out that in our discovery study we had some evidence for amplification of positivity bias in absolute sense.

      (2) Positivity bias or perseveration?

      A key challenge in interpreting many of the results is dissociating perseveration from other learning biases. In particular, a positivity bias (Figure 5) and perseveration will both predict a stronger correlation between positive feedback and future choice. Crucially, the authors do include a perseveration term, so one would hope that perseveration effects have been controlled for and that the CA parameters reflect true positivity biases. However, with finite data, we cannot be sure that the variance will be correctly allocated to each parameter (c.f. collinearity in regressions). The fact that CA- is fit to be negative for many participants (a pattern shown more strongly in the discovery study) is suggestive that this might be happening. A priori, the idea that you would ever increase your value estimate after negative feedback is highly implausible, which suggests that the parameter might be capturing variance besides that it is intended to capture.

      The best way to resolve this uncertainty would involve running a new study in which feedback was sometimes provided in the absence of a choice - this would isolate positivity bias. Short of that, perhaps one could fit a version of the Bayesian model that also includes perseveration. If the authors can show that this model cannot capture the pattern in Figure 5, that would be fairly convincing.

      We thank the reviewer for this very insightful and crucial point regarding the potential confound between positivity bias and perseveration. We entirely agree that distinguishing these effects can be challenging. To rigorously address this concern and ascertain that our observed positivity bias, particularly its inflation for low-credibility feedback, is not merely an artifact of perseveration, we conducted additional analyses as suggested.

      First, following the reviewer’s suggestion we simulated our Bayesian models, including a perseveration term, for both our main and discovery studies. Crucially, none of these simulations predicted the specific pattern of inflated positivity bias for low-credibility feedback that we identified in participants.

      Additionally, taking a “devil’s advocate” approach, we tested whether our credibility-CA model (which includes perseveration but not a feedback valence bias) can predict our positivity bias findings. Thus, we simulated 100 datasets using our Credibility-CA model (based on empirical best-fitting parameters). We then fitted each of these simulated datasets using our CredibilityValence CA model. By examining the distribution of results across these synthetic datasets fits and comparing them to the actual results from participants, we found that while perseveration could indeed lead (as the reviewer suspected) to an artifactual positivity bias, it could not predict the magnitude of the observed inflation of positivity bias for low-credibility feedback (whether measured in absolute or relative terms).

      Based on these comprehensive analyses, we are confident that our main results concerning the modulation of a valence bias as a function of source-credibility cannot be accounted by simple choice-perseveration. We have briefly explained these analyses in the main results section:

      “Previous research has suggested that positivity bias may spuriously arise from pure choice-perseveration (i.e., a tendency to repeat previous choices regardless of outcome) (49,50). While our models included a perseveration-component, this control may not be preferent. Therefore, in additional control analyses, we generated synthetic datasets using models including choice-perseveration but devoid of feedback-valence bias, and fitted them with our credibility-valence model (see SI 3.6.1). These analyses confirmed that perseveration can masquerade as an apparent positivity bias. Critically, however, these analyses also confirmed that perseveration cannot account for our main finding of increased positivity bias, relative to the overall extent of CA, for low-credibility feedback.”

      Additionally, we have added a detailed description of these additional analyses and their findings to the Supplementary Information document:

      “3.6 Positivity bias results cannot be explained by a pure perseveration

      3.6.1 Main study

      Previous research has suggested it may be challenging to dissociate between a feedback-valence positivity bias and perseveration (i.e., a tendency to repeat previous choices regardless of outcome). While our Credit Assignment (CA) models already include a perseveration mechanism to account for this, this control may not be perfect. We thus conducted several tests to examine if our positivity-bias related results could be accounted for by perseveration.

      First we examined whether our Bayesian-models, augmented by a perseveration mechanism (as in our CA model) can generate predictions similar to our empirical results. We repeated our cross-fitting procedure to these extended Bayesian models. To briefly recap, this involved fitting participant behavior with them, generating synthetic datasets based on the resulting maximum likelihood (ML) parameters, and then fitting these simulated datasets with our Credibility-Valence CA model (which is designed to detect positivity bias). This test revealed that adding perseveration to our Bayesian models did not predict a positivity bias in learning. In absolute terms there was a small negativity bias (instructed-credibility Bayesian: b=−0.19, F(1,1218)=17.78, p<0.001, Fig. S23a-b; free-credibility Bayesian: b=−0.17, F(1,1218)=13.74, p<0.001, Fig. S23d-e). In relative terms we detected no valence related bias (instructed-credibility Bayesian: b=−0.034, F(1,609)=0.45, p=0.50, Fig. S22c; free-credibility Bayesian: b=−0.04, F(1,609)=0.51, p=0.47, Fig. S23f). More critically, these simulations also did not predict a change in the level of positivity bias as a function of feedback credibility, neither at an absolute level (instructed-credibility Bayesian: F(2,1218)=0.024, p=0.98, Fig. S23b; free-credibility Bayesian: F(2,1218)=0.008, p=0.99, Fig. S23e), nor at a relative level (instructedcredibility Bayesian: F(2,609)=1.57, p=0.21, Fig. S23c; free-credibility Bayesian: F(2,609)=0.13, p=0.88, Fig. S23f). The upshot is that our positivity-bias findings cannot be accounted for by our Bayesian models even when these are augmented with perseveration.

      However, it is still possible that empirical CA parameters from our credibility-valence model (reported in main text Fig. 5) were distorted, absorbing variance from a perseveration. To address this, we took a “devil's advocate” approach testing the assumption that CA parameters are not truly affected by feedback valance and that there is only perseveration in our data. Towards that goal, we simulated data using our CredibilityCA model (which includes perseveration but does not contain a valence bias in its learning mechanism) and then fitted these synthetic datasets using our Credibility-Valence CA model to see if the observed positivity bias could be explained by perseveration alone. Specifically, we generated 101 “group-level” synthetic datasets (each including one simulation for each participant, based on their empirical ML parameters), and fitted each dataset with our Credibility-Valence CA model. We then analysed the resulting ML parameters in each dataset using the same mixed-effects models as described in the main text, examining the distribution of effects of interest across these simulated datasets. Comparing these simulation results to the data from participants revealed a nuanced picture. While the positivity bias observed in participants is within the range predicted by a pure perseveration account when measured in absolute terms (Fig. S24a), it is much higher than predicted by pure perseveration when measured relative to the overall level of learning (Fig. S24c). More importantly, the inflation in positivity bias for lower credibility feedback is substantially higher in participants than what would be predicted by a pure perseveration account, a finding that holds true for both absolute (Fig. S24b) and relative (Fig. S24d) measures.”

      “3.6.2 Discovery study

      We then replicated these analyses in our discovery study to confirm our findings. We again checked whether extended versions of the Bayesian models (including perseveration) predicted the positivity bias results observed. Our cross-fitting procedure showed that the instructed-credibility Bayesian model with perseveration did predict a positivity bias for all credibility levels in this discovery study, both when measured in absolute terms [50% credibility (b=1.74,t(824)=6.15), 70% credibility (b=2.00,F(1,824)=49.98), 85% credibility (b=1.81,F(1,824)=40.78), 100% credibility (b=2.42,F(1,824)=72.50), all p's<0.001], and in relative terms [50% credibility (b=0.25,t(412)=3.44), 70% credibility (b=0.31,F(1,412)=17.72), 85% credibility (b=0.34,F(1,412)=21.06), 100% credibility (b=0.42,F(1,412)=31.24), all p's<0.001]. However, importantly, these simulations did not predict a change in the level of positivity bias as a function of feedback credibility, neither at an absolute level (F(3,412)=1.43,p=0.24), nor at a relative level (F(3,412)=2.06,p=0.13) (Fig. S25a-c). In contrast, simulations of the free-credibility Bayesian model (with perseveration) predicted a slight negativity bias when measured in absolute terms (b=−0.35,F(1,824)=5.14,p=0.024), and no valence bias when measured relative to the overall degree of learning (b=0.05,F(1,412)=0.55,p=0.46). Crucially, this model also did not predict a change in the level of positivity bias as a function of feedback credibility, neither at an absolute level (F(3,824)=0.27,p=0.77), nor at a relative level (F(3,412)=0.76,p=0.47) (Fig. S25d-f).

      As in our main study, we next assessed whether our Credibility-CA model (which includes perseveration but no valence bias) predicted the positivity bias results observed in participants in the discovery study. This analysis revealed that the average positivity bias in participants is higher than predicted by a pure perseveration account, both when measured in absolute terms (Fig. S26a) and in relative terms (Fig. S26c). Specifically, only the aVBI for the 70% credibility agent was above what a perseveration account would predict, while the rVBI for all agents except the completely credible one exceeded that threshold. Furthermore, the inflation in positivity bias for lower credibility feedback (compared to the 100% credibility agent) is significantly higher in participants than would be predicted by a pure perseveration account, in both absolute (Fig. S26b) and relative (Fig. S26d) terms.

      Together, these results show that the general positivity bias observed in participants could be predicted by an instructed-credibility Bayesian model with perseveration, or by a CA model with perseveration. Moreover, we find that these two models can predict a positivity bias for the 50% credibility agent, raising a concern that our positivity bias findings for this source may be an artefact of not-fully controlled for perseveration. However, the credibility modulation of this positivity bias, where the bias is amplified for lower credibility feedback, is consistently not predicted by perseveration alone, regardless of whether perseveration is incorporated into a Bayesian or a CA model. This finding suggests that participants are genuinely modulating their learning based on feedback credibility, and that this modulation is not merely an artifact of choice perseveration.”

      (3) Veracity detection or positivity bias?

      The "True feedback elicits greater learning" effect (Figure 6) may be simply a re-description of the positivity bias shown in Figure 5. This figure shows that people have higher CA for trials where the feedback was in fact accurate. But assuming that people tend to choose more rewarding options, true-feedback cases will tend to also be positive-feedback cases. Accordingly, a positivity bias would yield this effect, even if people are not at all sensitive to trial-level feedback veracity. Of course, the reverse logic also applies, such that the "positivity bias" could actually reflect discounting of feedback that is less likely to be true. This idea has been proposed before as an explanation for confirmation bias (see Pilgrim et al, 2024 https://doi.org/10.1016/j.cognition.2023.105693and much previous work cited therein). The authors should discuss the ambiguity between the "positivity bias" and "true feedback" effects within the context of this literature....

      Before addressing these excellent comments, we first note that we have now improved our "TruthCA" model. Previously, our Truth-CA model considered whether feedback on each trial was true or not based on realized latent true outcomes. However, it is possible that the very same feedback would have had an opposite truth-status if the latent true outcome was different (recall true outcomes are stochastic). This injects noise into the trial classification in our former model. To avoid this, in our new model feedback is modulated by the probability the reported feedback is true (marginalized over stochasticity of true outcome). Please note in our responses below that we conducted extensive analysis to confirm that positivity bias doesn’t in fact predict the truthbias we detect using our truth biased model

      We have described this new model in the methods section:

      “Additionally, we formulated a “Truth-CA” model, which worked as our Credibility-CA model, but incorporated a free truth-bonus parameter (TB). This parameter modulates the extent of credit assignment for each agent based on the posterior probability of feedback being true (given the credibility of the feedback agent, and the true reward probability of the chosen bandit). The chosen bandit was updated as follows:

      𝑄 ← (1 – 𝑓<sub>Q</sub>) ∗ 𝑄 + [𝐶𝐴(𝑎𝑔𝑒𝑛𝑡) + 𝑇𝐵 ∗ (𝑃(𝑡𝑟𝑢𝑡ℎ) − 0.5)] ∗ 𝐹

      where P(truth) is the posterior probability of the feedback being true in the current trial (for exact calculation of P(truth) see “Methods: Bayesian estimation of posterior belief that feedback is true”).”

      All relevant results have been updated accordingly in the main text:

      To formally address whether feedback truthfulness modulates credit assignment, we fitted a new variant of the CA model (the “Truth-CA” model) to the data. This variant works as our Credibility-CA model, but incorporated a truth-bonus parameter (TB) which increases the degree of credit assignment for feedback as a function of the experimenter-determined likelihood the feedback is true (which is read from the curves in Fig 6a when x is taken to be the true probability the bandit is rewarding). Specifically, after receiving feedback, the Q-value of the chosen option is updated according to the following rule:

      𝑄 ← (1 – 𝑓<sub>Q</sub>) ∗ 𝑄 + [𝐶𝐴(𝑎𝑔𝑒𝑛𝑡) + 𝑇𝐵 ∗ (𝑃(𝑡𝑟𝑢𝑡ℎ) − 0.5)] ∗ 𝐹

      where 𝑇𝐵 is the free parameter representing the truth bonus, and 𝑃(𝑡𝑟𝑢𝑡ℎ) is the probability the received feedback being true (from the experimenter’s perspective). We acknowledge that this model falls short of providing a mechanistically plausible description of the credit assignment process, because participants have no access to the experimenter’s truthfulness likelihoods (as the true bandit reward probabilities are unknown to them). Nonetheless, we use this ‘oracle model’ as a measurement tool to glean rough estimates for the extent to which credit assignment Is boosted as a function of its truthfulness likelihood.

      Fitting this Truth-CA model to participants' behaviour revealed a significant positive truth-bonus (mean=0.21, t(203)=3.12, p=0.002), suggesting that participants indeed assign greater weight to feedback that is likely to be true (Fig. 6c; see SI 3.3.1 for detailed ML parameter results). Notably, simulations using our other models (Methods) consistently predicted smaller truth biases (compared to the empirical bias) (Fig. 6d). Moreover, truth bias was still detected even in a more flexible model that allowed for both a positivity bias and truth-bias (see SI 3.7). The upshot is that participants are biased to assign higher credit based on feedback that is more likely to be true in a manner that is inconsistent with out Bayesian models and above and beyond the previously identified positivity biases.”

      Finally, the Supplementary Information for the discovery study has also been revised to feature this analysis:

      “We next assessed whether participants infer whether the feedback they received on each trial was true or false and adjust their credit assignment based on this inference. We again used the “Truth-CA” model to obtain estimates for the truth bonus (TB), the increase in credit assignment as a function of the posterior probability of feedback being true. As in our main study, the fitted truth bias parameter was significantly positive, indicating that participants assign greater weight to feedback they believe is likely to be true (Fig, S4a; see SI 3.3.1 for detailed ML parameter results). Strikingly, model-simulations (Methods) predicted a lower truth bonus than the one observed in participants (Fig. S4b).”

      Additionally, we thank the reviewer for pointing us to the relevant work by Pilgrim et al. (2024). We agree that the relationship between "true feedback" and "positivity bias" effects is nuanced, and their potential overlap warrants careful consideration. Note our analyses suggest that this is not solely the case. Firstly, simulations of our Credibility-Valence CA model predict only a small "truth bonus" effect, which is notably smaller than what we observed in participants. Secondly, we formulated an extension of our "Truth-CA" model that includes a valence bias in credit assignment. If our truth bonus results were merely an artifact of positivity bias, this extended model should absorb that variance, producing a null truth bonus parameter. However, fitting this model to participant data still revealed a significant positive truth bonus, which again exceeds the range predicted by simulations of our Credibility CA model:

      “3.7 Truth inference is still detected when controlling for valence bias

      Given that participants frequently select bandits that are, on average, mostly rewarding, it is reasonable to assume that positive feedback is more likely to be objectively true than negative feedback. This raises a question if the "truth inference" effect we observed in participants might simply be an alternative description of a positivity bias in learning. To directly test this idea, we extended our Truth-CA model to explicitly account for a valence bias in credit assignment. This extended model features separate CA parameters for positive and negative feedback for each agent. When we fitted this new model to participant behavior, it still revealed a significant truth bonus in both the main study (Wilkoxon’s signrank test: median = 0.09, z(202)=2.12, p=0.034; Fig. S27a) and the discovery study (median = 3.52, z(102)=7.86, p<0.001; Fig. S27c). Moreover, in the main study, this truth bonus remained significantly higher than what was predicted by all the alternative models, with the exception of the instructed-credibility bayesian model (Fig. S27b). In the discovery study, the truth bonus was significantly higher than what was predicted by all the alternative models (Fig. S27d).”

      Together, these findings suggest that our truth inference results are not simply a re-description of a positivity bias.

      Conversely, we acknowledge the reviewer's point that our positivity bias results could potentially stem from a more general truth inference mechanism. We believe that this possibility should be addressed in a future study where participants rate their belief that received feedback is true (rather than a lie).We have extended our discussion to clarify this possibility and to include the suggested citation:

      “Our findings show that individuals increase their credit assignment for feedback in proportion to the perceived probability that the feedback is true, even after controlling for source credibility and feedback valence. Strikingly, this learning bias was not predicted by any of our Bayesian or credit-assignment (CA) models. Notably, our evidence for this bias is based on a “oracle model” that incorporates the probability of feedback truthfulness from the experimenter's perspective, rather than the participant’s. This raises an important open question: how do individuals form beliefs about feedback truthfulness, and how do these beliefs influence credit assignment? Future research should address this by eliciting trial-by-trial beliefs about feedback truthfulness. Doing so would also allow for testing the intriguing possibility that an exaggerated positivity bias for non-credible sources reflects, to some extent, a truth-based discounting of negative feedback—i.e., participants may judge such feedback as less likely to be true. However, it is important to note that the positivity bias observed for fully credible sources (here and in other literature) cannot be attributed to a truth bias—unless participants were, against instructions, distrustful of that source.”

      The authors get close to this in the discussion, but they characterize their results as differing from the predictions of rational models, the opposite of my intuition. They write:

      “Alternative "informational" (motivation-independent) accounts of positivity and confirmation bias predict a contrasting trend (i.e., reduced bias in low- and medium credibility conditions) because in these contexts it is more ambiguous whether feedback confirms one's choice or outcome expectations, as compared to a full-credibility condition.”

      I don't follow the reasoning here at all. It seems to me that the possibility for bias will increase with ambiguity (or perhaps will be maximal at intermediate levels). In the extreme case, when feedback is fully reliable, it is impossible to rationally discount it (illustrated in Figure 6A). The authors should clarify their argument or revise their conclusion here.

      We apologize for the lack of clarity in our previous explanation. We removed the sentence you cited (it was intended to make a different point which we now consider non-essential). Our current narration is consistent with the point you are making.

      (4) Disinformation or less information?

      Zooming out, from a computational/functional perspective, the reliability of feedback is very similar to reward stochasticity (the difference is that reward stochasticity decreases the importance/value of learning in addition to its difficulty). I imagine that many of the effects reported here would be reproduced in that setting. To my surprise, I couldn't quickly find a study asking that precise question, but if the authors know of such work, it would be very useful to draw comparisons. To put a finer point on it, this study does not isolate which (if any) of these effects are specific to disinformation, rather than simply less information. I don't think the authors need to rigorously address this in the current study, but it would be a helpful discussion point.

      We thank the reviewer for highlighting the parallel (and difference) between feedback reliability and reward stochasticity. However, we have not found any comparable results in the literature. We also note that our discussion includes a paragraph addressing the locus of our effects making the point that more studies are necessary to determine whether our findings are due to disinformation per se or sources being less informative. While this paragraph was included in the previous version it led us to infer our Discussion was too long and we therefore shortened it considerably:

      “An important question arises as to the psychological locus of the biases we uncovered. Because we were interested in how individuals process disinformation—deliberately false or misleading information intended to deceive or manipulate—we framed the feedback agents in our study as deceptive, who would occasionally “lie” about the true choice outcome. However, statistically (though not necessarily psychologically), these agents are equivalent to agents who mix truth-telling with random “guessing” or “noise” where inaccuracies may arise from factors such as occasionally lacking access to true outcomes, simple laziness, or mistakes, rather than an intent to deceive. This raises the question of whether the biases we observed are driven by the perception of potential disinformation as deceitful per se or simply as deviating from the truth. Future studies could address this question by directly comparing learning from statistically equivalent sources framed as either lying or noisy. Unlike previous studies wherein participants had to infer source credibility from experience (30,37,72), we took an explicit-instruction approach, allowing us to precisely assess source-credibility impact on learning, without confounding it with errors in learning about the sources themselves. More broadly, our work connects with prior research on observational learning, which examined how individuals learn from the actions or advice of social partners (72–75). This body of work has demonstrated that individuals integrate learning from their private experiences with learning based on others’ actions or advice—whether by inferring the value others attribute to different options or by mimicking their behavior (57,76). However, our task differs significantly from traditional observational learning. Firstly, our feedback agents interpret outcomes rather than demonstrating or recommending actions (30,37,72). Secondly, participants in our study lack private experiences unmediated by feedback sources. Finally, unlike most observational learning paradigms, we systematically address scenarios with deliberately misleading social partners. Future studies could bridge this by incorporating deceptive social partners into observational learning, offering a chance to develop unified models of how individuals integrate social information when credibility is paramount for decision-making.”

      (5) Over-reliance on analyzing model parameters

      Most of the results rely on interpreting model parameters, specifically, the "credit assignment" (CA) parameter. Exacerbating this, many key conclusions rest on a comparison of the CA parameters fit to human data vs. those fit to simulations from a Bayesian model. I've never seen anything like this, and the authors don't justify or even motivate this analysis choice. As a general rule, analyses of model parameters are less convincing than behavioral results because they inevitably depend on arbitrary modeling assumptions that cannot be fully supported. I imagine that most or even all of the results presented here would have behavioral analogues. The paper would benefit greatly from the inclusion of such results. It would also be helpful to provide a description of the model in the main text that makes it very clear what exactly the CA parameter is capturing (see next point).

      We thank the reviewer for this important suggestion which we address together with the following point.

      (6) RL or regression?

      I was initially very confused by the "RL" model because it doesn't update based on the TD error. Consequently, the "Q values" can go beyond the range of possible reward (SI Figure 5). These values are therefore not Q values, which are defined as expectations of future reward ("action values"). Instead, they reflect choice propensities, which are sometimes notated $h$ in the RL literature. This misuse of notation is unfortunately quite common in psychology, so I won't ask the authors to change the variable. However, they should clarify when introducing the model that the Q values are not action values in the technical sense. If there is precedent for this update rule, it should be cited.

      Although the change is subtle, it suggests a very different interpretation of the model.

      Specifically, I think the "RL model" is better understood as a sophisticated logistic regression, rather than a model of value learning. Ignoring the decay term, the CA term is simply the change in log odds of repeating the just-taken action in future trials (the change is negated for negative feedback). The PERS term is the same, but ignoring feedback. The decay captures that the effect of each trial on future choices diminishes with time. Importantly, however, we can re-parameterize the model such that the choice at each trial is a logistic regression where the independent variables are an exponentially decaying sum of feedback of each type (e.g., positive-cred50, positive-cred75, ... negative-cred100). The CA parameters are simply coefficients in this logistic regression.

      Critically, this is not meant to "deflate" the model. Instead, it clarifies that the CA parameter is actually not such an assumption-laden model estimate. It is really quite similar to a regression coefficient, something that is usually considered "model agnostic". It also recasts the non-standard "cross-fitting" approach as a very standard comparison of regression coefficients for model simulations vs. human data. Finally, using different CA parameters for true vs false feedback is no longer a strange and implausible model assumption; it's just another (perfectly valid) regression. This may be a personal thing, but after adopting this view, I found all the results much easier to understand.

      We thank the reviewer for their insightful and illuminating comments, particularly concerning the interpretation of our model parameters and the nature of our Credit assignment model. We believe your interpretation of the model is accurate and we now narrate it to readers in the hope that our modelling will become clearer and more intuitively. We also present to readers how these recasts our “cross-fitting” approach in the way you suggested (we return to this point below).

      Broadly, while we agree that modelling results depend on underlying assumptions, we believe that “model-agnostic” approaches also have important limitations—especially in reinforcement learning (RL), where choices are shaped by histories of past events, which such approaches often fail to fully account for. As students of RL, we are frequently struck by how careful modelling demonstrates that seemingly meaningful “model-agnostic” patterns can emerge as artefacts of unaccounted-for variables. We also note that the term “model-agnostic” is difficult to define—after all, even regression models rely on assumptions, and some computational models make richer or more transparent assumptions than others. Ideally, we aim to support our findings using converging methods wherever possible.

      We want to clarify that many of our reported findings indeed stem from straightforward behavioral analyses (e.g., simple regressions of choice-repetition), which do not rely on complex modeling assumptions. The two key results that primarily depend on the analysis of model parameters are our findings related to positivity bias and truth inference.

      Regarding the positivity bias, identifying truly model-agnostic behavioral signatures, distinct from effects like choice-perseveration, has historically been a significant challenge in the literature. Classical research on this bias rests on the interpretation of model parameters (Lefebvre et al., 2017; Palminteri et al., 2017), or at least on the use of models to assess what an “unbiased learner” baseline should look like (Palminteri & Lebreton, 2022). Some researchers have suggested possible regressions incorporating history effects to detect positivity bias from choicerepetition behavior, but these regressions (as our model) rely on subtle assumptions about forgetting and history effects (Toyama et al., 2019). Specifically, in our case, this issue is also demonstrated by analysis we conducted related to the previous point the reviewer made (about perseveration masquerading as positivity bias). We believe that dissociating clearly positivity bias from perseveration is an important challenge for the field going forward.

      For our truth inference results, obtaining purely behavioral signatures is similarly challenging due to the intricate interdependencies (the reviewer has identified in previous points) between agent credibility, feedback valence, feedback truthfulness, and choice accuracy within our task design.

      Finally, we agree with the reviewer that regression coefficients are often interpreted as a “modelagnostic” pattern. From this perspective even our findings regarding positivity and truth bias are not a case of over-reliance on complex model assumptions but are rather a way to expose deviations between empirical “sophisticated” regression coefficients and coefficients predicted from Bayesian models.

      We have now described the main learning rule of our model in the main text to ensure that the meaning of the CA parameters is clearer for readers:

      “Next, we formulated a family of non-Bayesian computational RL models. Importantly, these models can flexibly express non-Bayesian learning patterns and, as we show in following sections, can serve to identify learning biases deviating from an idealized Bayesian strategy. Here, an assumption is that during feedback, the choice propensity for the chosen bandit (which here is represented by a point estimate, “Q value“, rather than a distribution) either increases or decreases (for positive or negative feedback, respectively) according to a magnitude quantified by the free “Credit-Assignment (CA)” model parameters (47):

      𝑄(𝑐ℎ𝑜𝑠𝑒𝑛) ← (1 – 𝑓<sub>Q</sub>) ∗ 𝑄(𝑐ℎ𝑜𝑠𝑒𝑛) + 𝐶𝐴(𝑎𝑔𝑒𝑛𝑡, 𝑣𝑎𝑙𝑒𝑛𝑐𝑒) ∗ 𝐹

      where F is the feedback received from the agents (coded as 1 for reward feedback and -1 for non-reward feedback), while fQ (∈[0,1]) is the free parameter representing the forgetting rate of the Q-value (Fig. 2a, bottom panel; Fig. S5b; Methods). The probability to choose a bandit (say A over B) in this family of models is a logistic function of the contrast choice-propensities between these two bandits. One interpretation of this model is as a “sophisticated” logistic regression, where the CA parameters take the role of “regression coefficients” corresponding to the change in log odds of repeating the just-taken action in future trials based on the feedback (+/- CA for positive or negative feedback, respectively; the model also includes gradual perseveration which allows for constant log-odd changes that are not affected by choice feedback; see “Methods: RL models”) . The forgetting rate captures the extent to which the effect of each trial on future choices diminishes with time. The Q-values are thus exponentially decaying sums of logistic choice propensities based on the types of feedback a bandit received.”

      We also explain the implications of this perspective for our cross-fitting procedure:

      “To further characterise deviations between behaviour and our Bayesian learning models, we used a “crossfitting” method. Treating CA parameters as data-features of interest (i.e., feedback dependent changes in choice propensity), our goal was to examine if and how empirical features differ from features extracted from simulations of our Bayesian learning models. Towards that goal, we simulated synthetic data based on Bayesian agents (using participants’ best fitting parameters), but fitted these data using the CA-models, obtaining what we term “Bayesian-CA parameters” (Fig. 2d; Methods). A comparison of these BayesianCA parameters, with empirical-CA parameters obtained by fitting CA models to empirical data, allowed us to uncover patterns consistent with, or deviating from, ideal-Bayesian value-based inference. Under the sophisticated logistic-regression interpretation of the CA-model family the cross-fitting method comprises a comparison between empirical regression coefficients (i.e., empirical CA parameters) and regression coefficient based on simulations of Bayesian models (Bayesian CA parameters). Using this approach, we found that both the instructed-credibility and free-credibility Bayesian models predicted increased BayesianCA parameters as a function of agent credibility (Fig. 3c; see SI 3.1.1.2 Tables S8 and S9). However, an in-depth comparison between Bayesian and empirical CA parameters revealed discrepancies from ideal Bayesian learning, which we describe in the following sections.”

      Recommendations for the authors:

      Reviewer #3 (Recommendations for the authors):

      (1) Keep terms consistent, e.g., follow-up vs. main; hallmark vs. traditional.

      We have now changed the text to keep terms consistent.

      (2) CA model is like a learning rate; but it's based on the raw reward, not the TD error - this seems strange.

      We thank the reviewer for this comment. We understand that the use of a CA model instead of a TD error model may seem unusual at first glance. However, the CA model offers an important advantage: it more easily accommodates what we term "negative learning rates". This means that some participants may treat certain agents (especially the random one) as consistently deceitful, leading them to effectively increase/reduce choice tendencies following negative/positive feedback. A CA model handles this naturally by allowing negative CA parameters as a simple extension of positive ones. In contrast, adapting a TD error model to account for this is more complex. For instance, attempting to introduce a "negative learning rate" makes the RW model behave in a non-stable manner (e.g., Q values become <0 or >1). At the initial stages of our project, we explored different approaches to dealing with this issue and we found the CA model provides the best approach. For these reasons, we decided to proceed with our CA model.

      Additionally, we used the CA model in previous studies (e.g., Moran, Dayan & Dolan (2021)) where we included (in SI) a detailed discussion of the similarities and difference between creditassignment and Rescorla-Wagner models

      (3) Why was the follow-up study not pre-registered?

      We appreciate the reviewer's comment regarding preregistration, which we should have done. Unfortunately, this is now “water under the bridge” but going forward we hope to pre-register increasing parts of our work.

      (4) Other work looking at reward stochasticity?

      As noted in point 4 of the main weaknesses, previous work on reward stochasticity primarily focused on explaining the increase/decrease in learning and its mechanistic bases under varying stochasticity levels. In our study, we uniquely characterize several specific learning biases that are modulated by source credibility, a topic not extensively explored within the existing reward stochasticity framework, as far as we know.

      (5) Equation 1 is different from the one in the figure?

      The reviewer is completely correct. The figure provides a simplified visual representation, primarily focusing on the feedback-based update of the Q-value, and for simplicity, it omits the forgetting term present in the full Equation 1. To ensure complete clarity and prevent any misunderstanding, we have now incorporated a more detailed explanation of the model, including the complete Equation 1 and its components, directly within the main text. This comprehensive description will ensure that readers are fully aware of how the model operates.

      “Next, we formulated a family of non-Bayesian computational RL models. Importantly, these models can flexibly express non-Bayesian learning patterns and, as we show in following sections, can serve to identify learning biases deviating from an idealized Bayesian strategy. Here, an assumption is that during feedback, the choice propensity for the chosen bandit (which here is represented by a point estimate, “Q value“, rather than a distribution) either increases or decreases (for positive or negative feedback, respectively) according to a magnitude quantified by the free “Credit-Assignment (CA)” model parameters (47):

      𝑄(𝑐ℎ𝑜𝑠𝑒𝑛) ← (1 – 𝑓<sub>Q</sub>) ∗ 𝑄(𝑐ℎ𝑜𝑠𝑒𝑛) + 𝐶𝐴(𝑎𝑔𝑒𝑛𝑡, 𝑣𝑎𝑙𝑒𝑛𝑐𝑒) ∗ 𝐹

      where F is the feedback received from the agents (coded as 1 for reward feedback and -1 for non-reward feedback), while fQ (∈[0,1]) is the free parameter representing the forgetting rate of the Q-value (Fig. 2a, bottom panel; Fig. S5b; Methods).”

      (6) Please describe/plot the distribution of all fitted parameters in the supplement. I would include the mean and SD in the main text (methods) as well.

      Following the reviewer’s suggestions, we have included in the Supplementary Document tables displaying the mean and SD of fitted parameters from participants for our main models of interest. We have also plotted the distributions of such parameters. Both for the main study:

      (7) "A novel approach within the disinformation literature by exploiting a Reinforcement Learning (RL) experimental framework".

      The idea of applying RL to disinformation is not new. Please tone down novelty claims. It would be nice to cite/discuss some of this work as well.

      https://arxiv.org/abs/2106.05402?utm_source=chatgpt.com https://www.scirp.org/pdf/jbbs_2022110415273931.pdf https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4173312

      We thank the reviewer for pointing us towards relevant literature. We have now toned down the sentence in the introduction and cited the references provided:

      “To address these questions, we adopt a novel approach within the disinformation literature by exploiting a Reinforcement Learning (RL) experimental framework (36). While RL has guided disinformation research in recent years (37–40), our approach is novel in using one of its most popular tasks: the “bandit task”.”

      (8) Figure 3a - The figures should be in the order that they're referenced (3 is referenced before 2).

      We generally try to stick to this important rule but, in this case, we believe that our ordering serves better the narrative and hope the reviewer will excuse this small violation.

      (9) "Additionally, we found a positive feedback-effect for the 3-star agent"

      What is the analysis here? To avoid confusion with the "positive feedback" effect, consider using "positive effect of feedback". The dash wasn't sufficient to avoid confusion in my case.

      We have now updated the terms in the text to avoid confusion.

      (10) The discovery study revealed even stronger results supporting a conclusion that the credibility-CA model was superior to both Bayesian models for most subjects

      This is very subjective, but I'll just mention that my "cherry-picking" flag was raised by this sentence. Are you only mentioning cases where the discovery study was consistent with the main study? Upon a closer read, I think the answer is most likely "no", but you might consider adopting a more systematic (perhaps even explicit) policy on when and how you reference the discovery study to avoid creating this impression in a more casual reader.

      We thank the reviewer for this valuable suggestion. To prevent any impression of "cherry-picking", we have removed specific references to the discovery study from the main body of the text. Instead, all discussions regarding the convergence and divergence of results between the two studies are now in the dedicated section focusing on the discovery study:

      “The discovery study (n=104) used a disinformation task structurally similar to that used in our main study, but with three notable differences: 1) it included 4 feedback agents, with credibilities of 50%, 70%, 85% and 100%, represented by 1, 2, 3, and 4 stars, respectively; 2) each experimental block consisted of a single bandit pair, presented over 16 trials (with 4 trials for each feedback agent); and 3) in certain blocks, unbeknownst to participants, the two bandits within a pair were equally rewarding (see SI section 1.1). Overall, this study's results supported similar conclusions as our main study (see SI section 1.2) with a few differences. We found convergent support for increased learning from more credible sources (SI 1.2.1), superior fit for the CA model over Bayesian models (SI 1.2.2) and increased learning from feedback inferred to be true (SI 1.2.6). Additionally, we found an inflation of positivity bias for low-credibility both when measured relative to the overall level of credit assignment (as in our main study), or in absolute terms (unlike in our main study) (Fig. S3; SI 1.2.5). Moreover, choice-perseveration could not predict an amplification of positivity bias for low-credibility sources (see SI 3.6.2). However, we found no evidence for learning based on 50%-credibility feedback when examining either the feedback effect on choice repetition or CA in the credibility-CA model (SI 1.2.3).”

      (11) An in-depth comparison between Bayesian and empirical CA parameters revealed discrepancies from normative Bayesian learning.

      Consider saying where this in-depth comparison can be found (based on my reading, I think you're referring to the next section?

      We have now modified the sentence for better clarity:

      “However, an in-depth comparison between Bayesian and empirical CA parameters revealed discrepancies from ideal Bayesian learning, which we describe in the following sections.”

      (12) "which essentially provides feedback" Perhaps you meant "random feedback"?

      We have modified the text as suggested by the reviewer.

      <(13) Essentially random

      Why "essentially"? Isn't it just literally random?

      We have modified the text as suggested by the reviewer.

      (14) Both Bayesian models predicted an attenuated credit-assignment for the 3-star agent

      Attenuated relative to what? I wouldn't use this word if you mean weaker than what we see in the human data. Instead, I would say people show an exaggerated credit-assignment, since Bayes is the normative baseline.

      We changed the text according to the reviewer’s suggestion:

      “A comparison of empirical and Bayesian credit-assignment parameters revealed a further deviation from ideal Bayesian learning: participants showed an exaggerated credit-assignment for the 3-star agent compared with Bayesian models.”

      (15) "there was no difference between 2-star and 3-star agent contexts (b=0.051, F(1,2419)=0.39, p=0.53)"

      You cannot confirm the null hypothesis! Instead, you can write "The difference between 2-star and 3-star agent contexts was not significant". Although even with this language, you should be careful that your conclusions don't rest on the lack of a difference (the next sentence is somewhat ambiguous on this point).

      Additionally, the reported b coefs do not match the figure, which if anything, suggests a larger drop from 0.75 (2-star) to 1 (3-star). Is this a mixed vs fixed effects thing? It would be helpful to provide an explanation here.

      We thank the reviewer for this question. When we previously submitted our manuscript, we thought that finding enhanced credit-assignment for fully credible feedback following potential disinformation from a DIFFERENT context would constitute a striking demonstration of our “contrast effect”. However, upon reexamining this finding we found out we had a coding error (affecting how trials were filtered). We have now rerun and corrected this analysis. We have assessed the contrast effect for both "same-context" trials (where the contextual trial featured the same bandit pair as the learning trial) and "different-context" trials (where the contextual trial featured a different bandit pair). Our re-analysis reveals a selective significant contrast effect in the same-context condition, but no significant effect in the different-context condition. We have updated the main text to reflect these corrected findings and provide a clearer explanation of the analysis:

      “A comparison of empirical and Bayesian credit-assignment parameters revealed a further deviation from ideal Bayesian learning: participants showed an exaggerated credit-assignment for the 3-star agent compared with Bayesian models [Wilcoxon signed-rank test, instructed-credibility Bayesian model (median difference=0.74, z=11.14); free-credibility Bayesian model (median difference=0.62, z=10.71), all p’s<0.001] (Fig. 3a). One explanation for enhanced learning for the 3-star agents is a contrast effect, whereby credible information looms larger against a backdrop of non-credible information. To test this hypothesis, we examined whether the impact of feedback from the 3-star agent is modulated by the credibility of the agent in the trial immediately preceding it. More specifically, we reasoned that the impact of a 3-star agent would be amplified by a “low credibility context” (i.e., when it is preceded by a low credibility trial). In a binomial mixed effects model, we regressed choice-repetition on feedback valence from the last trial featuring the same bandit pair (i.e., the learning trial) and the feedback agent on the trial immediately preceding that last trial (i.e., the contextual credibility; see Methods for model-specification). This analysis included only learning trials featuring the 3-star agent, and context trials featuring the same bandit pair as the learning trial (Fig. 4a). We found that feedback valence interacted with contextual credibility (F(2,2086)=11.47, p<0.001) such that the feedback-effect (from the 3-star agent) decreased as a function of the preceding context-credibility (3-star context vs. 2-star context: b= -0.29, F(1,2086)=4.06, p=0.044; 2star context vs. 1-star context: b=-0.41, t(2086)=-2.94, p=0.003; and 3-star context vs. 1-star context: b=0.69, t(2086)=-4.74, p<0.001) (Fig. 4b). This contrast effect was not predicted by simulations of our main models of interest (Fig. 4c). No effect was found when focussing on contextual trials featuring a bandit pair different than the one in the learning trial (see SI 3.5). Thus, these results support an interpretation that credible feedback exerts a greater impact on participants’ learning when it follows non-credible feedback, in the same learning context.”

      We have modified the discussion accordingly as well:

      “A striking finding in our study was that for a fully credible feedback agent, credit assignment was exaggerated (i.e., higher than predicted by our Bayesian models). Furthermore, the effect of fully credible feedback on choice was further boosted when it was preceded by a low-credibility context related to current learning. We interpret this in terms of a “contrast effect”, whereby veridical information looms larger against a backdrop of disinformation (21). One upshot is that exaggerated learning might entail a risk of jumping to premature conclusions based on limited credible evidence (e.g., a strong conclusion that a vaccine produces significant side-effect risks based on weak credible information, following non-credible information about the same vaccine). An intriguing possibility, that could be tested in future studies, is that participants strategically amplify the extent of learning from credible feedback to dilute the impact of learning from noncredible feedback. For example, a person scrolling through a social media feed, encountering copious amounts of disinformation, might amplify the weight they assign to credible feedback in order to dilute effects of ‘fake news’. Ironically, these results also suggest that public campaigns might be more effective when embedding their messages in low-credibility contexts, which may boost their impact.”

      And we have included some additional analyses in the SI document:

      “3.5 Contrast effects for contexts featuring a different bandit Given that we observed a contrast effect when both the learning and the immediately preceding "context trial” involved the same pair of bandits, we next investigated whether this effect persisted when the context trial featured a different bandit pair – a situation where the context would be irrelevant to the current learning. Again, we used in a binomial mixed effects model, regressing choice-repetition on feedback valence in the learning trial and the feedback agent in the context trial. This analysis included only learning trials featuring the 3-star agent, and context trials featuring a different bandit pair than the learning trial (Fig. S22a). We found no significant evidence of an interaction between feedback valence and contextual credibility (F(2,2364)=0.21, p=0.81) (Fig. S22b). This null result was consistent with the range of outcomes predicted by our main computational models (Fig. S22c).”

      We aimed to formally compare the influence of two types of contextual trials: those featuring the same bandit pair as the learning trial versus those featuring a different pair. To achieve this, we extended our mixedeffects model by incorporating a new predictor variable, "CONTEXT_TYPE" which coded whether the contextual trial involved the same bandit pair (coded as -0.5) or a different bandit pair (+0.5) compared to the learning trial. The Wilkinson notation for this expanded mixed-effects model is:

      𝑅𝐸𝑃𝐸𝐴𝑇 ~ 𝐶𝑂𝑁𝑇𝐸𝑋𝑇_𝑇𝑌𝑃𝐸 ∗ 𝐹𝐸𝐸𝐷𝐵𝐴𝐶𝐾 ∗ (𝐶 𝐶𝑂𝑁𝑇𝐸𝑋𝑇<sub>2-star</sub> + 𝐶𝑂𝑁𝑇𝐸𝑋𝑇<sub>3-star</sub>) + 𝐵𝐸𝑇𝑇𝐸𝑅 + (1|𝑝𝑎𝑟𝑡𝑖𝑐𝑖𝑝𝑎𝑛𝑡)

      This expanded model revealed a significant three-way interaction between feedback valence, contextual credibility, and context type (F(2,4451) = 7.71, p<0.001). Interpreting this interaction, we found a 2-way interaction between context-source and feedback valence when the context was the same (F(2,4451) = 12.03, p<0.001), but not when context was different (F(2,4451) = 0.23, p = 0.79). Further interpreting the double feedback-valence * context-source interaction (for the same context) we obtained the same conclusions as reported in the main text.”

      (16) "Strikingly, model-simulations (Methods) showed this pattern is not predicted by any of our other models"

      Why doesn't the Bayesian model predict this?

      Thanks for the comment. Overall, Bayesian models do predict a slight truth inference effect (see Figure 6d). However, these effects are not as strong as the ones observed in participants, suggesting that our results go beyond what would be predicted by a Bayesian model.

      Conceptually, it's important to note that the Bayesian model can infer (after controlling for source credibility and feedback valence) whether feedback is truthful based solely on prior beliefs about the chosen bandit. Using this inferred truth to amplify the weight of truthful feedback would effectively amount to “bootstrapping on one’s own beliefs.” This is most clearly illustrated with the 50% agent: if one believes that a chosen bandit yields rewards 70% of the time, then positive feedback is more likely to be truthful than negative feedback. However, a Bayesian observer would also recognize that, given the agent’s overall unreliability, such feedback should be ignored regardless.

      (17) "A striking finding in our study was that for a fully credible feedback agent, credit assignment was exaggerated (i.e., higher than predicted by a Bayesian strategy)".

      "Since we did not find any significant interactions between BETTER and the other regressors, we decided to omit it from the model formulation".

      Was this decision made after seeing the data? If so, please report the original analysis as well.

      We have included the BETTER regressor again, and we have re-run the analyses. We now report the results of such regression. We have also changed the methods section accordingly:

      “We used a different mixed-effects binomial regression model to test whether value learning from the 3-star agent was modulated by contextual credibility. We focused this analysis on instances where the previous trial with the same bandit pair featured the 3-star agent. We regressed the variable REPEAT, which indicated whether the current trial repeated the choice from the previous trial featuring the same bandit-pair (repeated choice=1, non-repeated choice=0). We included the following regressors: FEEDBACK coding the valence of feedback in the previous trial with the same bandit pair (positive=0.5, negative=-0.5), CONTEXT2-star indicating whether the trial immediately preceding the previous trial with the same bandit pair (context trial) featured the 2-star agent (feedback from 2-star agent=1, otherwise=0), and CONTEXT3star indicating whether the trial immediately preceding the previous trial with the same bandit pair featured the 3-star agent. We also included a regressor (BETTER) coding whether the bandit chosen in the learning trial was the better -mostly rewarding- or the worse -mostly unrewarding- bandit within the pair. We included in this analysis only current trials where the context trial featured a different bandit pair. The model in Wilkinson’s notation was:

      𝑅𝐸𝑃𝐸𝐴𝑇~ 𝐹𝐸𝐸𝐷𝐵𝐴𝐶𝐾 ∗ (𝐶𝑂𝑁𝑇𝐸𝑋𝑇<sub>2-star</sub> + 𝐶𝑂𝑁𝑇𝐸𝑋𝑇<sub>3-star</sub>) + 𝐵𝐸𝑇𝑇𝐸𝑅 + (1|𝑝𝑎𝑟𝑡𝑖𝑐𝑖𝑝𝑎𝑛𝑡) ( 13 )

      In figure 4c, we independently calculate the repeat probability difference for the better (mostly rewarding) and worse (mostly non-rewarding) bandits and averaged across them. This calculation was done at the participants level, and finally averaged across participants.”

    1. Shellfish reefs, particularly mussels, can form large areas of habitat that are vital to their infaunal communities (Cole and McQuaid, 2010), but past research has shown that as calcifying organisms, they are the most vulnerable to warming and acidification (Kroeker et al., 2013a; Parker et al., 2013). On temperate Australian rocky shores, habitats created by the native mussel Trichomya hirsuta, and to a lesser extent, the invasive mussel Mytilus galloprovincialis support a local diversity of annelids, crustaceans, molluscs, and echinoderms (People, 2006; Cole, 2010). Eastern Australia is a climate change “hot-spot” with sea surface temperatures in this region increasing three times faster than the global average (Wernberg et al., 2011; Hobday and Pecl, 2014), and oceans are acidifying worldwide (Collins et al., 2013). The invasive M. galloprovincialis is relatively tolerant to environmental change (Hiebenthal et al., 2013); whereas little is known about the tolerance of T. hirsuta. As the oceans warm and acidify, M. galloprovincialis may have the capacity to replace T. hirsuta as the dominant biogenic habitat on the Australian rocky shores. Any changes in the biogenic mussel habitat could alter the infaunal communities, with downstream consequences for dependent organisms. Such consequences will have an impact on the natural communities and the success of current and future shellfish reef restoration projects (Pereira et al., 2019).

      If natives are replaced by hardier shellfish, do we think organisms will adapt to consume the new shellfish? Perhaps softer shelled mussels move in to the territory, will these areas be more susceptible to storm surges and wave energy? The new species may temporarily sound good but could be quickly destroyed by storm systems. This may enable the new species to spread out further and possibly benefit, or lead to the softer shelled mussels demise. Could the stronger storm systems associated with climate change put more stress on these oyster beds?

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      This work by Govorunova et al. identified three naturally blue-shifted channelrhodopsins (ChRs) from ancyromonads, namely AnsACR, FtACR, and NlCCR. The phylogenetic analysis places the ancyromonad ChRs in a distinct branch, highlighting their unique evolutionary origin and potential for novel applications in optogenetics. Further characterization revealed the spectral sensitivity, ionic selectivity, and kinetics of the newly discovered AnsACR, FtACR, and NlCCR. This study also offers valuable insights into the molecular mechanism underlying the function of these ChRs, including the roles of specific residues in the retinal-binding pocket. Finally, this study validated the functionality of these ChRs in both mouse brain slices (for AnsACR and FtACR) and in vivo in Caenorhabditis elegans (for AnsACR), demonstrating the versatility of these tools across different experimental systems.

      In summary, this work provides a potentially valuable addition to the optogenetic toolkit by identifying and characterizing novel blue-shifted ChRs with unique properties.

      Strengths:

      This study provides a thorough characterization of the biophysical properties of the ChRs and demonstrates the versatility of these tools in different ex vivo and in vivo experimental systems. The mutagenesis experiments also revealed the roles of key residues in the photoactive site that can affect the spectral and kinetic properties of the channel.

      We thank the Reviewer for his/her positive evaluation of our work.

      Weaknesses:

      While the novel ChRs identified in this work are spectrally blue-shifted, there still seems to be some spectral overlap with other optogenetic tools. The authors should provide more evidence to support the claim that they can be used for multiplex optogenetics and help potential end-users assess if they can be used together with other commonly applied ChRs. Additionally, further engineering or combination with other tools may be required to achieve truly orthogonal control in multiplexed experiments.

      To demonstrate the usefulness of ancyromonad ChRs for multiplex optogenetics as a proof of principle, we co-expressed AnsACR with the red-shifted cation-conducting ChR Chrimson and measured net photocurrent generated by this combination as a function of the wavelength. We found that it is hyperpolarizing in the blue region of the spectrum, and depolarizing at the red region. In the revision, we added a new panel (Figure 1D) showing these results and the following paragraph to the main text:

      “To test the possibility of using AnsACR in multiplex optogenetics, we co-expressed it with the red-shifted CCR Chrimson (Klapoetke et al., 2014) fused to an EYFP tag in HEK293 cells. We measured the action spectrum of the net photocurrents with 4 mM Cl<sup>-</sup> in the pipette, matching the conditions in the neuronal cytoplasm (Doyon, Vinay et al. 2016). Figure 1D, black shows that the direction of photocurrents was hyperpolarizing upon illumination with λ<500 nm and depolarizing at longer wavelengths. A shoulder near 520 nm revealed a FRET contribution from EYFP (Govorunova, Sineshchekov et al. 2020), which was also observed upon expression of the Chrimson construct alone (Figure 1D, red)”.

      In the C. elegans experiments, partial recovery of pharyngeal pumping was observed after prolonged illumination, indicating potential adaptation. This suggests that the effectiveness of these ChRs may be limited by cellular adaptation mechanisms, which could be a drawback in long-term experiments. A thorough discussion of this challenge in the application of optogenetics tools would prove very valuable to the readership.

      We added the following paragraph to the revised Discussion:

      “One possible explanation of the partial recovery of pharyngeal pumping that we observed after 15-s illumination, even at the highest tested irradiance, is continued attenuation of photocurrent during prolonged illumination (desensitization). However, the rate of AnsACR desensitization (Figure 1 – figure supplement 4A and Figure 1 – figure supplement 5A) is much faster than the rate of the pumping recovery, reducing the likelihood that desensitization is driving this phenomenon. Another possible reason for the observed adaptation is an increase in the cytoplasmic Cl<sup>-</sup> concentration owing to AnsACR activity and hence a breakdown of the Cl<sup>-</sup> gradient on the neuronal membrane. The C. elegans pharynx is innervated by 20 neurons, 10 of which are cholinergic (Pereira, Kratsios et al. 2015). A pair of MC neurons is the most important for regulation of pharyngeal pumping, but other pharyngeal cholinergic neurons, including I1, M2, and M4, also play a role (Trojanowski, Padovan-Merhar et al. 2014). Moreover, the pharyngeal muscles generate autonomous contractions in the presence of acetylcholine tonically released from the pharyngeal neurons (Trojanowski, Raizen et al. 2016). Given this complexity, further elucidation of pharyngeal pumping adaptation mechanisms is beyond the scope of this study.”

      Reviewer #2 (Public review):

      Summary:

      Govorunova et al present three new anion opsins that have potential applications in silencing neurons. They identify new opsins by scanning numerous databases for sequence homology to known opsins, focusing on anion opsins. The three opsins identified are uncommonly fast, potent, and are able to silence neuronal activity. The authors characterize numerous parameters of the opsins.

      Strengths:

      This paper follows the tradition of the Spudich lab, presenting and rigorously characterizing potentially valuable opsins. Furthermore, they explore several mutations of the identified opsin that may make these opsins even more useful for the broader community. The opsins AnsACR and FtACR are particularly notable, having extraordinarily fast onset kinetics that could have utility in many domains. Furthermore, the authors show that AnsACR is usable in multiphoton experiments having a peak photocurrent in a commonly used wavelength. Overall, the author's detailed measurements and characterization make for an important resource, both presenting new opsins that may be important for future experiments, and providing characterizations to expand our understanding of opsin biophysics in general.

      We thank the Reviewer for his/her positive evaluation of our work.

      Weaknesses:

      First, while the authors frequently reference GtACR1, a well-used anion opsin, there is no side-by-side data comparing these new opsins to the existing state-of-the-art. Such comparisons are very useful to adopt new opsins.

      GtACR1 exhibits the peak sensitivity at 515 nm and therefore is poorly suited for combination with red-shifted CCRs or fluorescent sensors, unlike blue-light-absorbing ancyromonad ACRs. Nevertheless, we conducted side-by-side comparison of ancyromonad ChRs, GtACR1 and GtACR2, the latter of which has the spectral maximum at 470 nm. The results are shown in the new Figures 1E and F, and the new multipanel Figure 1 – figure supplement 4 added in the revision. We also added the following text, describing these results, to the revised Results section:

      “Figures 1E and F show the dependence of the peak photocurrent amplitude and reciprocal peak time, respectively, on the photon flux density for ancyromonad ChRs and GtACRs. The current amplitude saturated earlier than the time-to-peak for all tested ChRs. Figure 1 – figure supplement 4A-E shows normalized photocurrent traces recorded at different photon densities. Quantitation of desensitization at the end of 1-s illumination revealed a complex light dependence (Figure 1, Figure Supplement 4F). Figure 1 – figure supplement 5 shows normalized photocurrent traces recorded in response to a 5-s light pulse of the maximal available intensity and the magnitude of desensitization at its end.”

      Next, multiphoton optogenetics is a promising emerging field in neuroscience, and I appreciate that the authors began to evaluate this approach with these opsins. However, a few additional comparisons are needed to establish the user viability of this approach, principally the photocurrent evoked using the 2p process, for given power densities. Comparison across the presented opsins and GtACR1 would allow readers to asses if these opsins are meaningfully activated by 2P.

      We carried out additional 2P experiments in ancyromonad ChRs, GtACR1 and GtACR2 and added their results to a new main-text Figure 6 and Figure 6 – figure supplement 1. We added the new section describing these results, “Two-photon excitation”, to the main text in the revision:

      “To determine the 2P activation range of AnsACR, FtACR, and NlCCR, we conducted raster scanning using a conventional 2P laser, varying the excitation wavelength between 800 and 1,080 nm (Figure 6 – figure supplement 1). All three ChRs generated detectable photocurrents with action spectra showing maximal responses at ~925 nm for AnsACR, 945 nm for FtACR, and 890 nm for NlCCR (Figure 6A). These wavelengths fall within the excitation range of common Ti:Sapphire lasers, which are widely used in neuroscience laboratories and can be tuned between ~700 nm and 1,020-1,300 nm. To assess desensitization, cells expressing AnsACR, FtACR, or NlCCR were illuminated at the respective peak wavelength of each ChR at 15 mW for 5 seconds. GtACR1 and GtACR2, previously used in 2P experiments (Forli, Vecchia et al. 2018, Mardinly, Oldenburg et al. 2018), were included for comparison. The normalized photocurrent traces recorded under these conditions are shown in Figure 6B-F. The absolute amplitudes of 2P photocurrents at the peak time and at the end of illumination are shown in Figure 6G and H, respectively. All five tested variants exhibited comparable levels of desensitization at the end of illumination (Figure 6I).”

      Reviewer #3 (Public review):

      Summary:

      The authors aimed to develop Channelrhodopsins (ChRs), light-gated ion channels, with high potency and blue action spectra for use in multicolor (multiplex) optogenetics applications. To achieve this, they performed a bioinformatics analysis to identify ChR homologues in several protist species, focusing on ChRs from ancyromonads, which exhibited the highest photocurrents and the most blue-shifted action spectra among the tested candidates. Within the ancyromonad clade, the authors identified two new anion-conducting ChRs and one cation-conducting ChR. These were characterized in detail using a combination of manual and automated patch-clamp electrophysiology, absorption spectroscopy, and flash photolysis. The authors also explored sequence features that may explain the blue-shifted action spectra and differences in ion selectivity among closely related ChRs.

      Strengths:

      A key strength of this study is the high-quality experimental data, which were obtained using well-established techniques such as manual patch-clamp and absorption spectroscopy, complemented by modern automated patch-clamp approaches. These data convincingly support most of the claims. The newly characterized ChRs expand the optogenetics toolkit and will be of significant interest to researchers working with microbial rhodopsins, those developing new optogenetic tools, as well as neuro- and cardioscientists employing optogenetic methods.

      We thank the Reviewer for his/her positive evaluation of our work.

      Weaknesses:

      This study does not exhibit major methodological weaknesses. The primary limitation of the study is that it includes only a limited number of comparisons to known ChRs, which makes it difficult to assess whether these newly discovered tools offer significant advantages over currently available options.

      We conducted side-by-side comparison of ancyromonad ChRs and GtACRs, wildly used for optical inhibition of neuronal activity. The results are shown in the new Figures 1E and F, and the new multipanel Figure 1 – figure supplement 4 and Figure 1 – figure supplement 5 added in the revision. We also added the following text, describing these results, to the revised Results section:

      “Figures 1E and F show the dependence of the peak photocurrent amplitude and reciprocal peak time, respectively, on the photon flux density for ancyromonad ChRs and GtACRs. The current amplitude saturated earlier than the time-to-peak for all tested ChRs. Figure 1 – figure supplement 4A-E shows normalized photocurrent traces recorded at different photon densities. Quantitation of desensitization at the end of 1-s illumination revealed a complex light dependence (Figure 1, Figure Supplement 4F). Figure 1 – figure supplement 5 shows normalized photocurrent traces recorded in response to a 5-s light pulse of the maximal available intensity and the magnitude of desensitization at its end.”

      Additionally, although the study aims to present ChRs suitable for multiplex optogenetics, the new ChRs were not tested in combination with other tools. A key requirement for multiplexed applications is not just spectral separation of the blue-shifted ChR from the red-shifted tool of interest but also sufficient sensitivity and potency under low blue-light conditions to avoid cross-activation of the respective red-shifted tool. Future work directly comparing these new ChRs with existing tools in optogenetic applications and further evaluating their multiplexing potential would help clarify their impact.

      As a proof of principle, we co-expressed AnsACR with the red-shifted cation-conducting CCR Chrimson and demonstrated that the net photocurrent generated by this combination is hyperpolarizing in the blue region of the spectrum, and depolarizing at the red region. In the revision, we added a new panel (Figure 1D) showing these results and the following paragraph to the main text:

      “To test the possibility of using AnsACR in multiplex optogenetics, we co-expressed it with the red-shifted CCR Chrimson (Klapoetke et al., 2014) fused to an EYFP tag in HEK293 cells. We measured the action spectrum of the net photocurrents with 4 mM Cl<sup>-</sup> in the pipette, matching the conditions in the neuronal cytoplasm (Doyon, Vinay et al. 2016). Figure 1D, black shows that the direction of photocurrents was hyperpolarizing upon illumination with λ<500 nm and depolarizing at longer wavelengths. A shoulder near 520 nm revealed a FRET contribution from EYFP (Govorunova, Sineshchekov et al. 2020), which was also observed upon expression of the Chrimson construct alone (Figure 1D, red)”.

      Reviewing Editor Comments:

      The reviewers suggest that direct comparison to GtACR1 is the most important step to make this work more useful to the community.

      We followed the Reviewers’ recommendations and carried out side-by-side comparison of ancyromonad ChRs and GtACR1 as well as GtACR2 (Figure 1E and F, Figure 1 – figure supplement 4, Figure 1 – figure supplement 5, and Figure 6). Note, however, that GtACR1’s spectral maximum is at 515 nm, which makes it poorly suitable for blue light excitation. Also, ChRs are known to perform very differently in different cell types and upon expression of their genes in different vector backbones, so our results cannot be generalized for all experimental systems. Each ChR user needs to select the most appropriate tool for his/her purpose by testing several candidates in his/her own experimental setting.

      Reviewer #1 (Recommendations for the authors):

      (1) The figure legend for Figure 2D-I appears to be incomplete. Please provide a detailed explanation of the panels.

      In the revision, we have expanded the legend of Figure 2 to explain all individual panels.

      (2) The meaning of the Vr shift (Y-axis in Figure 2H-I) should be clarified in the main text to aid reader understanding.

      In the revision, we added the phrase “which indicated higher relative permeability to NO<sub>3</sub> than to Cl<sup>-“</sup> to explain the meaning of the Vr shift upon replacement of Cl<sup>-</sup> with NO<sub>3</sub>-.

      (3) Adding statistical analysis for the peak and end photocurrent values in Figure 2D-F would strengthen the claim that there is minimal change in relative permeability during illumination.

      In the revision, we added the V<sub>r</sub> values for the peak photocurrent to Figure 2H-I, which already contained the V<sub>r</sub> values for the end photocurrent, and carried out a statistical analysis of their comparison. The following sentence was added to the text in the revision:

      “The V<sub>r</sub> values of the peak current and that at the end of illumination were not significantly different by the two-tailed Wilcoxon signed-rank test (Fig. 2G), indicating no change in the relative permeability during illumination.”

      (4) Figure 4H and I seem out of place in Figure 4, as the title suggests a focus on wild-proteins and AnsACR mutants. The authors could consider moving these panels to Figure 3 for better alignment with the content.

      As noted below, we changed the panel order in Figure 4 upon the Reviewer’s request. In particular, former Figure 4I is Figure 4C in the revision, and former Figure 4H is now panel C in Figure 3 – figure supplement 1 in the revision. We rearranged the corresponding section of the text (highlighted yellow in the manuscript).

      (5) The characterization section could be strengthened by including data on the pH sensitivity of FtACR, which is currently missing from the main figures.

      Upon the Reviewer’s request, we carried out pH titration of FtACR absorbance and added the results as Figure 4B in the revision.

      (6) The logic in Figure 4A-G appears somewhat disjointed. For example, Figure 4A shows pH sensitivity for WT AnsACR and the G86E mutant, while Figure 4 B-D shifts to WT AnsACR and the D226N mutant, and Figure 4E returns to the G86E mutant. Reorganizing or clarifying the flow would improve readability.

      We followed the Reviewer’s advice and changed the panel order in Figure 4. In the revised version, the upper row (panels A-C) shows the pH titration data of the three WTs, the middle row (panels D-F) shows analysis of the AnsACR_D226N mutant, and the lower row (panels G-I) shows analysis of the AnsACR_G88E mutant. We also rearranged accordingly the description of these panels in the text.

      (7) In Figure 5A, "NIACR" should likely be corrected to "NlCCR".

      We corrected the typo in the revision.

      (8) The statistical significance in Figure 6C and D is somewhat confusing. Clarifying which groups are being compared and using consistent symbols would improve interoperability.

      In the revision, we improved the figure panels and legend to clarify that the comparisons are between the dark and light stimulation groups within the same current injection.

      (9) The authors pointed out that at rest or when a small negative current was injected, the neurons expressing Cl- permeable ChRs could generate a single action potential at the beginning of photostimulation, as has been reported before. The authors could help by further discussing if and how this phenomenon would affect the applicability of such tools.

      We mentioned in the revised Discussion section that activation of ACRs in the axons could depolarize the axons and trigger synaptic transmission at the onset of light stimulation, and this undesired excitatory effect need to be taken into consideration when using ACRs.

      Reviewer #2 (Recommendations for the authors):

      Govorunova et al present three new anion opsins that have potential applications in silencing neurons. This paper follows the tradition of the Spudich lab, presenting and rigorously characterizing potentially valuable opsins. Furthermore, they explore several mutations of the identified opsin that may make these opsins even more useful for the broader community. In general, I feel positively about this manuscript. It presents new potentially useful opsins and provides characterization that would enable its use. I have a few recommendations below, mostly centered around side-by-side comparisons to existing opsins.

      (1) My primary concern is that while there is a reference to GtACR1, a highly used opsin first described by this team, they do not present any of this data side by side.

      When evaluating opsins to use, it is important to compare them to the existing state of the art. As a potential user, I need to know where these opsins differ. Citing other papers does not solve this as, even within the same lab, subtle methodological differences or data plotting decisions can obscure important differences.

      As we explained in the response to the public comments, we carried out side-by-side comparison of ancyromonad ChRs and GtACRs as requested by the Reviewer. The results are shown in the new Figures 1E and F, and the new multipanel Figure 1 – figure supplement 4 and Figure 1 – figure supplement 5, added in the revision. However, we would like to emphasize a limited usefulness of such comparative analysis, as ChRs are known to perform very differently in different cell types and upon expression of their genes in different vector backbones, so our results cannot be generalized for all experimental systems. Each ChR user needs to select the most appropriate tool for his/her purpose by testing several candidates in his/her own experimental setting.

      (2) Multiphoton optogenetics is an emerging field of optogenetics, and it is admirable that the authors address it here. The authors should present more 2p characterization, so that it can be established if these new opsins are viable for use with 2P methods, the way GtACR1 is. The following would be very useful for 2P characterization:

      Photocurrents for a given power density, compared to GtACR1 and GtACR2.

      The new Figure 6 (B-F) added in the revision shows photocurrent traces recorded from the three ancyromonad ChRs and  two GtACRs upon 2P excitation of a given power density.

      Comparing NICCR and FtACR's wavelength specificity and photocurrent. If these opsins are too weak to create reasonable 2P spectra, this difference should be discussed.

      The new Figure 6A shows the 2P action spectra of all three ancyromonad ChRs.

      A Trace and calculated photocurrent kinetics to compare 1P and 2P. This need not be the flash-based absorption characterization of Figure 3, but a side-by-side photocurrent as in Figure 2.

      As mentioned above, photocurrent traces recorded from ancyromonad ChRs and GtACRs upon 2P excitation are shown in the new Figure 6 (B-F). However, direct comparison of the 2P data with the 1P data is not possible, as we used laser scanning illumination for the former and wild-field illumination for the latter.

      Characterization of desensitization. As the authors mention, many opsins undergo desensitization, presenting the ratio of peak photocurrent vs that at multiple time points (probably up to a few seconds) would provide evidence for how effectively these constructs could be used in different scenarios.

      We conducted a detailed analysis of desensitization under both 1P and 2P excitation. The new Figure 1 – figure supplement 4 and Figure 1 – figure supplement 5 show the data obtained under 1P excitation, and the new Figure 6 shows the data for 2P conditions.

      I have to admit, that by the end of the paper, I was getting confused as to which of the three original constructs had which property, and how that was changing with each mutation. I would suggest that a table summarizing each opsin and mutation with its onset and offset kinetics, peak wavelength, photocurrent, and ion selectivity would greatly increase the ability to select and use opsins in the future.

      In the revision, we added a table of the spectroscopic properties of all tested mutants as Supplementary File 2. This study did not aim to analyze other parameters listed by the Reviewer. We added the following sentence referring to this table to the main text:

      “Supplementary File 2 contains the λ values of the half-maximal amplitude of the long-wavelength slope of the spectrum, which can be estimated more accurately from the action spectra than the λ of the maximum.”

      It may be out of the scope of this manuscript, but if a soma localization sequence can be shown to remove the 'axonal spiking' (as described in line 441), this would be a significant addition to the paper.

      Our previous study (Messier et al., 2018, doi: 10.7554/eLife.38506) showed that a soma localization sequence can reduce, but not eliminate, the axonal spiking. We plan to test these new ACRs with the trafficking motifs in the future.

      NICCR appears to have the best photocurrents of all tested opsins in this paper. It seems odd that it was omitted from the mouse cortical neurons experiments.

      We have not included analysis of NlCCR behavior in neurons because we are preparing a separate manuscript on this ChR.

      Figure 6 would benefit from more gradation in the light powers used to silence and would benefit from comparison to GtACR. I suggest using a fixed current with a series of illumination intensities to see which of the three opsins (or GtACR) is most effective at silencing. At present, it looks binary, and a user cannot evaluate if any of these opsins would be better than what is already available.

      In the revision, we added the data comparing the light sensitivity of AnsACR and FtACR with previously identified GtACR1 and GtACR2 (new Figure 1E and F) to help users compare these ACRs. Although they are less sensitive to light comparing to GtACR1 and GtACR2, they could still be activated by commercially available light sources if the expression levels are similar. Less sensitive ACRs may have less unwanted activation when using with other optogenetic tools.

      Reviewer #3 (Recommendations for the authors):

      Suggested Improvements to Experiments, Data, or Analyses:

      (1) Line 25: "significantly exceeding those by previously known tools" and Line 408: "NlCCR is the most blue-shifted among ancyromonad ChRs and generates larger photocurrents than the earlier known CCRs with a similar absorption maximum." As noted in the public review, this statement applies only to a very specific subgroup of ChRs with spectral maxima below 450 nm. If the goal was to claim that NlCCR is a superior tool among a broader range of blue-light-activated ChRs, direct comparisons with state-of-the-art ChRs such as ChR2 T159C (Berndt et al., 2011), CatCh (Kleinlogel et al., 2014), CoChR (Klapoetke et al., 2014), CoChR-3M (Ganjawala et al., 2019), or XXM 2.0 (Ding et al., 2022) would be beneficial. If the goal was to demonstrate superiority among tools with spectra below 450 nm, I suggest explicitly stating this in the paper.

      The Reviewer correctly inferred that we emphasized the superiority of NlCCR among tools with similar spectral maxima, not all blue-light-activated ChRs available for neuronal photoexcitation, most of which exhibit absorption maxima at longer wavelengths. To clarify this, we added “with similar spectral maxima” to the sentence in the original Line 25. The sentence in Line 408 already contains this clarification: “with a similar absorption maximum”.

      (2) Lines 111-113: "The absorption spectra of the purified proteins were slightly blue-shifted from the respective photocurrent action spectra (Figure 1D), likely due to the presence of non-electrogenic cis-retinal-bound forms." I would be skeptical of this statement. The spectral shifts in NlCCR and AnsACR are small and may fall within the range of experimental error. The shift in FtACR is more apparent; however, if two forms coexist in purified protein, this should be reflected as two Gaussian peaks in the absorption spectrum (or at least as a broader total peak reflecting two states with close maxima and similar populations). On the contrary, the action spectrum appears to have two peaks, one potentially below 465 nm. Generally, neither spectrum appears significantly broader than a typical microbial rhodopsin spectrum. This question could be clarified by quantifying the widths of the absorption and action spectra or by overlaying them on the same axis. In my opinion, the two spectra seem very similar, and just appearance of the "bump" in the action spectum shifts the apparent maximum of the action spectrum to the red. If there were two states, then they should both be electrogenic, and the slight difference in spectra might be explained by something else (e.g. by a slight difference in the quantum yields of the two states).

      As the Reviewer suggested, in the revision we added a new figure (Figure 1 – figure supplement 2), showing the overlay of the absorption and action spectra of each ancyromonad ChR. This figure shows that the absorption spectra are wider than the action spectra (especially in AnsACR and FtACR), which confirms our interpretation (contribution of the non-electrogenic blue-shifted cis-retinal-bound forms to the absorption spectrum). Note that the presence of such forms explaining a blue shift of the absorption spectrum has been experimentally verified in HcKCR1 (doi: 10.1016/j.cell.2023.08.009; 10.1038/s41467-025-56491-9). Therefore, we revised the text as follows:

      “The absorption spectra of the purified proteins (Figure 1C) were slightly blue-shifted from the respective photocurrent action spectra (Figure 1 – figure supplement 3), likely due to the presence of non-electrogenic cis-retinal-bound forms. The presence of such forms, explaining the discrepancy between the absorption and the action spectra, was verified by HPLC in KCRs (Tajima et al. 2023, Morizumi et al., 2025).”

      (3) Lines 135-136: "The SyncroPatch enables unbiased estimation of the photocurrent amplitude because the cells are drawn into the wells without considering their tag fluorescence." While SyncroPatch does allow unbiased selection of patched cells, it does not account for the fraction of transfected cells. Without a method to exclude non-transfected cells, which are always present in transient transfections, the comparison of photocurrents may be affected by the proportion of untransfected cells, which could vary between constructs. To clarify whether the statistically significant difference in the Kolmogorov-Smirnov test could indicate that the fraction of transfected cells after 48-72h differs between constructs, I suggest analyzing only transfected cells or reporting fractions of transfected cells by each construct.

      The Reviewer correctly states that non-transfected cells are always present in transiently transfected cell populations. However, his/her suggestion to “exclude non-transfected cells” is not feasible in the absence of a criterion for such exclusion. As it is evident from our data, transient transfection results in a continuum of the amplitude values, and it is not possible to distinguish a small photocurrent from no photocurrent, considering the noise level. We would like, however, to emphasize that not excluding any cells provides an estimate of the overall potency of each ChR variant, which depends on both the fraction of transfected cells and their photocurrents. This approach mimics the conditions of in vivo experiments, when non-expressing cells also cannot be excluded.

      (4) Line 176: "AnsACR and FtACR photocurrents exhibited biphasic rise." The fastest characteristic time is very close to the typical resolution of a patch-clamp experiment (RC = 50 μs for a 10 pF cell with a 5 MΩ series resistance). Thus, I am skeptical that the faster time constant of the biphasic opening represents a protein-specific characteristic time. It may not be fully resolved by patch-clamp and could simply result from low-pass filtering of a specific cell. I suggest clarifying this for the reader.

      The Reviewer is right that the patch clamp setup acts as a lowpass filter. Earlier, we directly measured its time resolution (~15 μs) by recording the ultrafast (occurring on the ps time scale) charge movements related to the trans-cis isomerization (doi: 10.1111/php.12558). However, the lowpass filter of the setup can only slow the entire signal, but cannot lead to the appearance of a separate kinetic component (i.e. a monophasic process cannot become biphasic). Therefore, we believe that the biphasic photocurrent rise reflects biphasic channel opening rather than a measurement artifact. Two phases in the channel opening have also been detected in GtACR1 (doi: 10.1073/pnas.1513602112) and CrChR2 (10.1073/pnas.1818707116).

      (5) Line 516: "The forward LED current was 900 mA." It would be more informative to report the light intensity rather than the forward current, as many readers may not be familiar with the specific light output of the used LED modules at this forward current.

      We have added the light intensity value in the revision:

      “The forward LED current was 900 mA (which corresponded to the irradiance of ~2 mW mm<sup>-2</sup>)…”

      (6) Lines 402-403: "The NlCCR ... contains a neutral residue in the counterion position (Asp85 in BR), which is typical of all ACRs. Yet, NlCCR does not conduct anions, instead showing permeability to Na+." This is not atypical for CCRs and has been demonstrated in previous works of the authors (CtCCR in Govorunova et al. 2021, ChvCCR1 in Govorunova et al. 2022). What is unique is the absence of negatively charged residues in TM2, as noted later in the current study. However, the absence of negatively charged residues in TM2 appears to be rare for ACRs as well. Not as a strong point of criticism, but to enhance clarity, I suggest analyzing the frequency of carboxylate residues in TM2 of ACRs to determine whether the unique finding is relevant to ion selectivity or to another property.

      The Reviewer is correct that some CCRs lack a carboxylate residue in the D85 position, so this feature alone cannot be considered as a differentiating criterion. However, the complete absence of glutamates in TM2 is not rare in ACRs and is found, for example, in HfACR1 and CarACR2. We have discussed this issue in our earlier review (doi: 10.3389/fncel.2021.800313) and do not think that repeating this discussion in this manuscript is appropriate.

      Recommendations for Writing and Presentation:

      (1) Some figures contain incomplete or missing labels:

      Figure 2: Panels D to I lack labels.

      In the revision, we have expanded the legend of Figure 2 to explain all individual panels.

      Figure 3 - Figure Supplement 1: Missing explanations for each panel.

      In the revision, we changed the order of panes and explained all individual panels in the legend.

      Figure 5 - Figure Supplement 1: Missing explanations for each panel.

      No further explanation for individual panels in this Figure is needed because all panels show the action spectra of various mutants, the names of which are provided in the panels themselves. Repeating this information in the figure legend would be redundant.

      (2) In Figure 2, "sem" is written in lowercase, whereas "SEM" is capitalized in other figures. Standardizing the format would improve consistency.

      In the revision, we changed the font of the SEM abbreviation to the uppercase in all instances.

      (3) Line 20: "spectrally separated molecules must be found in nature." There is no proof that they cannot be developed synthetically; rather, it is just difficult. I suggest softening this statement, as the findings of this study, together with others, will probably allow designing molecules with specified spectral properties in the future.

      In the revision, we changed the cited sentence to the following:

      “Multiplex optogenetic applications require spectrally separated molecules, which are difficult to engineer without disrupting channel function”.

      (4) Line 216-219: "Acidification increased the amplitude of the fast current ~10-fold (Figure 4F) and shifted its Vr ~100 mV (Figure 3 - figure supplement 1D), as expected of passive proton transport. The number of charges transferred during the fast peak current was >2,000 times smaller than during the channel opening, from which we concluded that the fast current reflects the movement of the RSB proton." The claim about passive transport of the RSB proton should be clarified, as typically, passive transport is not limited to exactly one proton per photocycle, and the authors observe the increase in the fast photocurrents upon acidification.

      We thank the Reviewer for pointing out the confusing character of our description. To clarify the matter, we added a new photocurrent trace to Figure 4I in the revision recorded from AnsACR_G86E at 0 mV and pH 7.4. We have rewritten the corresponding section of Results as follows:

      “Its rise and decay τ corresponded to the rise and decay τ of the fast positive current recorded from AnsACR_G86E at 0 mV and neutral pH, superimposed on the fast negative current reflecting the chromophore isomerization (Figure 4I, upper black trace). We interpret this positive current as an intramolecular proton transfer to the mutagenetically introduced primary acceptor (Glu86), which was suppressed by negative voltage (Figure 4I, lower black trace). Acidification increased the amplitude of the fast negative current ~10-fold (Figure 4I, black arrow) and shifted its V<sub>r</sub> ~100 mV to more depolarized values (Figure 4 – figure supplement 2A). This can be explained by passive inward movement of the RSB proton along the large electrochemical gradient.”

      Minor Corrections:

      (1) Line 204: Missing bracket in "phases in the WT (Figure 4D."

      The quoted sentence was deleted during the revision.

      (2) Line 288: Typo-"This Ala is conserved" should probably be "This Met is conserved."

      We mean here the Ala four residues downstream from the first Ala. To avoid confusion, we changed the cited sentence to the following:

      “The Ala corresponding to BR’s Gly122 is also found in AnsACR and NlCCR (Figure 5A)…”

      (3) Lines 702-704: Missing Addgene plasmid IDs in "(plasmids #XXX and #YYY, respectively)."

      In the revision, we added the missing plasmid IDs.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      There is growing appreciation for the important of luminal (apical) ECM in tube development, but such matrices are much less well understood than basal ECMs. Here the authors provide insights into the aECM that shapes the Drosophila salivary gland (SG) tube and the importance of PAPSS-dependent sulfation in its organization and function.

      The first part of the paper focuses on careful phenotypic characterization of papss mutants, using multiple markers and TEM. This revealed reduced markers of sulfation and defects in both apical and basal ECM organization, Golgi (but not ER) morphology, number and localization of other endosomal compartments, plus increased cell death. The authors focus on the fact that papss mutants have an irregular SG lumen diameter, with both narrowed regions and bulged regions. They address the pleiotropy, showing that preventing the cell death and resultant gaps in the tube did not rescue the SG luminal shape defects and discussing similarities and differences between the papss mutant phenotype and those caused by more general trafficking defects. The analysis uses a papss nonsense mutant from an EMS screen - I appreciate the rigorous approach the authors took to analyze transheterozygotes (as well as homozygotes) plus rescued animals in order to rule out effects of linked mutations. Importantly, the rescue experiments also demonstrated that sulfation enzymatic activity is important.

      The 2nd part of the paper focuses on the SG aECM, showing that Dpy and Pio ZP protein fusions localize abnormally in papss mutants and that these ZP mutants (and Np protease mutants) have similar SG lumen shaping defects to the papss mutants. A key conclusion is that SG lumen defects correlate with loss of a Pio+Dpy-dependent filamentous structure in the lumen. These data suggest that ZP protein misregulation could explain this part of the papss phenotype.

      Overall, the text is very well written and clear. Figures are clearly labeled. The methods involve rigorous genetic approaches, microscopy, and quantifications/statistics and are documented appropriately. The findings are convincing.

      Significance:

      This study will be of interest to researchers studying developmental morphogenesis in general and specifically tube biology or the aECM. It should be particularly of interest to those studying sulfation or ZP proteins (which are broadly present in aECMs across organisms, including humans).

      This study adds to the literature demonstrating the importance of luminal matrix in shaping tubular organs and greatly advances understanding of the luminal matrix in the Drosophila salivary gland, an important model of tubular organ development and one that has key matrix differences (such as no chitin) compared to other highly studied Drosophila tubes like the trachea.

      The detailed description of the defects resulting from papss loss suggests that there are multiple different sulfated targets, with a subset specifically relevant to aECM biology. A limitation is that specific sulfated substrates are not identified here (e.g. are these the ZP proteins themselves or other matrix glycoproteins or lipids?); therefore, it's not clear how direct or indirect the effects of papss are on ZP proteins. However, this is clearly a direction for future work and does not detract from the excellent beginning made here.

      Comments on revised version:

      Overall, I am pleased with the authors' revisions in response to my original comments and those of the other reviewers

      Reviewer #2 (Public review):

      Summary

      This study provides new insights into organ morphogenesis using the Drosophila salivary gland (SG) as a model. The authors identify a requirement for sulfation in regulating lumen expansion, which correlates with several effects at the cellular level, including regulation of intracellular trafficking and the organization of Golgi, the aECM and the apical membrane. In addition, the authors show that the ZP proteins Dumpy (Dpy) and Pio form an aECM regulating lumen expansion. Previous reports already pointed to a role for Papss in sulfation in SG and the presence of Dpy and Pio in the SG. Now this work extends these previous analyses and provides more detailed descriptions that may be relevant to the fields of morphogenesis and cell biology (with particular focus on ECM research and tubulogenesis). This study nicely presents valuable information regarding the requirements of sulfation and the aECM in SG development.

      Strengths

      -The results supporting a role for sulfation in SG development are strong. In addition, the results supporting the involvement of Dpy and Pio in the aECM of the SG, their role in lumen expansion, and their interactions, are also strong.

      -The authors have made an excellent job in revising and clarifying the many different issues raised by the reviewers, particularly with the addition of new experiments and quantifications. I consider that the manuscript has improved considerably.

      -The authors generated a catalytically inactive Papss enzyme, which is not able to rescue the defects in Papss mutants, in contrast to wild type Papss. This result clearly indicates that the sulfation activity of Papss is required for SG development.

      Weaknesses

      -The main concern is the lack of clear connection between sulfation and the phenotypes observed at the cellular level, and, importantly, the lack of connection between sulfation and the Pio-Dpy matrix. Indeed, the mechanism/s by which sulfation affects lumen expansion are not elucidated and no targets of this modification are identified or investigated. A direct (or instructive) role for sulfation in aECM organization is not clearly supported by the results, and the connection between sulfation and Pio/Dpy roles seems correlative rather than causative. As it is presented, the mechanisms by which sulfation regulates SG lumen expansion remains elusive in this study.

      -In my opinion the authors overestimate their findings with several conclusions, as exemplified in the abstract:

      "In the absence of Papss, Pio is gradually lost in the aECM, while the Dpy-positive aECM structure is condensed and dissociates from the apical membrane, leading to a thin lumen. Mutations in dpy or pio, or in Notopleural, which encodes a matriptase that cleaves Pio to form the luminal Pio pool, result in a SG lumen with alternating bulges and constrictions, with the loss of pio leading to the loss of Dpy in the lumen. Our findings underscore the essential role of sulfation in organizing the aECM during tubular organ formation and highlight the mechanical support provided by ZP domain proteins in maintaining luminal diameter."

      The findings leading to conclude that sulfation organizes the aECM and that the absence of Papss leads to a thin lumen due to defects in Dpy/Pio are not strong. The authors certainly show that Papss is required for proper Pio and Dpy accumulation. They also show that Pio is required for Dpy accumulation, and that Pio and Dpy form an aECM required for lumen expansion. However, the absence of Pio and Dpy do not fully recapitulate Papss mutant defects (thin lumen). I wonder whether other hypothesis and models could account for the observed results. For instance, a role for Papss affecting secretion, in which case sulfation would have an indirect role in aECM organization. This study does not address the mechanical properties of Dpy in normal and mutant salivary glands.

      -Minor issues relate to the genotype/phenotype analysis. It is surprising that the authors detect only mild effects on sulfation in Papss mutants using an anti-sulfoTyr antibody, as Papss is the only Papss synthathase. Generating germ line clones (which is a feasible experiment) would have helped to prove that this minor effect is due to the contribution of maternal product. The loss of function allele used in this study seems problematic, as it produces effects in heterozygous conditions difficult to interpret. Cleaning the chromosome or using an alternative loss of function condition (another allele, RNAi, etc...) would have helped to present a more reliable explanation.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Overall, I am pleased with the authors' revisions in response to my original comments and those of the other reviewers. The addition of the sulfation(-) mutant to Fig. 1 is particularly nice. I have just a few additional suggestions for text changes to improve clarity/precision.

      (1) The current title of this manuscript is quite broad, making it sound like a review article. I recommend adding sulfation and salivary gland to the title to convey the main points more clearly. e.g. Sulfation affects apical extracellular matrix organization during development of the Drosophila salivary gland tube.

      Thank you for the suggestion. We agree and have changed the title of the paper as suggested.

      (2) Figure 1B shows very striking enrichment of papss expression in the salivary gland compared to other tubes like the trachea that also contain Pio and Dpy. To me, this implies that the key substrate(s) of Papss are likely to be unique, or at least more highly enriched, in the salivary gland aECM compared to the tracheal aECM (e.g. probably not Pio or Dpy themselves). I suggest that the authors address the implications of this apparent SG specificity in the discussion (paragraph beginning on p. 21, line 559).

      Yes, we agree that there may be other key substrates of Papss in the SG, such as mucins, which play an important role in organizing the aECM and expanding the lumen. We have included a discussion.

      (3) p. 15, lines 374-376 "The Pio protein is known to be cleaved, at one cleavage site after the ZP domain by the furin protease and at another cleavage site within the ZP domain by the matriptase Notopleural (Np) (Drees et al., 2019; Drees et al., 2023; Figure 5B)." As far as I can see, the Drees papers show that Pio is cleaved somewhere in the vicinity of a consensus furin cleavage site, but do not actually establish that the cleavage happens at this exact site or is done by a furin protease (this is just an assumption). Please word more carefully, e.g. "at one cleavage site after the ZP domain, possibly by a furin protease".

      Thank you for pointing this out. We have edited the text.

      Reviewer #2 (Recommendations for the authors):

      Throughout the paper, I find a bit confusing the description of the lumen phenotype and their interpretations.

      Papss mutants produce SG that are either "thin" or show "irregular lumen with bulges". Do the authors think that these are two different manifestations of the same effect? or do they think that there are different causes behind?

      The thin lumen phenotype appears to occur when the Pio-Dpy matrix is significantly condensed. When this matrix is less condensed in one region of the lumen than in other regions, the lumen appears irregular with bulges.

      Are the defects in Grasp65 mutants categorized as "irregular lumen with bulges" similar to those in Papss mutants? Why do these mutants don't show a "thin lumen" defect?

      Grasp65 mutant phenotypes are milder than those of Papss mutants. Multiple mutations in several Golgi components that more significantly disrupt Golgi structures and function may cause more severe defects in lumen expansion and shape.

      How the defects described for Pio ("multiple constrictions with a slight expansion between constrictions") and Dpy mutants ("lumen with multiple bulges and constrictions") relate to the "irregular lumen with bulges" in Papss mutants?

      pio and dpy mutants show more stereotypical phenotypes, while Papss mutants exhibit more irregular and random phenotypes. The irregular lumen phenotypes in Papss mutants are associated with a condensed Pio-Dpy matrix.

    1. Author response:

      (1) General Statements

      Our manuscript studies mechanisms of planar polarity establishment in vivo in the Drosophila pupal wing. Specifically we seek to understand mechanisms of ‘cell-scale signalling’ that is responsible for segregating core pathway planar polarity proteins to opposite cell edges. This is an understudied question, in part because it is difficult to address experimentally.

      We use conditional and restrictive expression tools to spatiotemporally manipulate core protein activity, combined with quantitative measurement of core protein distribution, polarity and stability. Our results provide evidence for a robust cell-scale signal, while arguing against mechanisms that depend on depletion of a limited pool of a core protein or polarised transport of core proteins on microtubules. Furthermore, we show that polarity propagation across a tissue is hard, highlighting the strong intrinsic capacity of individual cells to establish and maintain planar polarity.

      The original manuscript received three fair and thorough peer-reviews, which raised many important points. In response, we decided to embark on a full revision that attempts to answer all of the points. We have included new data to support our conclusions in Supplemental Figures 1, 2 and 5.

      Additionally in response to the reviewers we have revised the manuscript title, which is now ‘Characterisation of cell-scale signalling by the core planar polarity pathway during Drosophila wing development’.

      (2) Point-by-point description of the revisions

      We thank all of the reviewers for their thorough and thoughtful review of our manuscript. They raise many helpful points which have been extremely useful in assisting us to revise the manuscript.

      In response we have carried out a major revision of the manuscript, making numerous changes and additions to the text and also adding new experimental data. Specific changes are listed after our detailed response to each comment.

      Reviewer #1:

      Summary

      The authors use inducible Fz::mKate2-sfGFP to explore "cell-scale signaling" in PCP. They reach several conclusions. First, they conclude that cell-scale signaling does not depend on limiting pools of core components (other than Fz). Second, they conclude that cell-scale signaling does not depend on microtubule orientation, and third, they conclude that cell-scale signaling is strong relative to cell to cell coupling of polarity. 

      There are some interesting inferences that can be drawn from the manuscript, but there are also some significant challenges in interpreting the results and conclusions from the work as presented. I suggest that the authors 1) define "cell-scale signaling," as the precise meaning must be inferred, 2) reconsider some premises upon which some conclusions depend, 3) perform an essential assay validation, and 4) explain some other puzzling inconsistencies.

      Major points

      The exact meaning of cell-scale signaling is not defined, but I infer that the authors use this term to describe how what happens on one side of a cell affects another side. The remainder of my critique depends on this understanding of the intended meaning.

      As the reviewer points out, it is important that the meaning of the term ‘cell-scale signalling’ is clear to the reader and in response to their comment we have had another go at defining it explicitly in the Introduction to the manuscript.

      Specifically, we use the term ‘cell-scale signalling’ to describe possible intracellular mechanisms acting on core protein segregation to opposite cell membranes during core pathway dependent planar polarisation. For example, this could be a signal from distal complexes at one side of the cell leading to segregation of proximal complexes to the opposite cell edge, or vice versa. See also our response to Reviewer #2 regarding the distinction between ‘molecular-scale’ and ‘cell-scale’ signalling. 

      Changes to manuscript: Revised definition of ‘cell-scale signalling’ in Introduction.

      The authors state that any tissue wide directional information comes from pre-existing polarity and its modification by cell flow, such that the de novo signaling paradigm "bypasses" these events and should therefore not be responsive to any further global cues. It is my understanding that this is not a universally accepted model, and indeed, the authors' data seem to suggest otherwise. For example, the image in Fig 5B shows that de novo induction restores polarity orientation to a predominantly proximal to distal orientation. If no global cue is active, how is this orientation explained?

      We assume that the reviewer’s point is that it is not universally accepted that de novo induction after hinge contraction leads to uncoupling from global cues (rather than that it is not accepted that hinge contraction remodels radial polarity to a proximodistal pattern). We are (we believe) the only lab that has used de novo induction as a tool, and we’re not aware of any debate in the literature about whether this bypasses global cues. Nevertheless, we accept that it is hard to prove there is no influence of global cues, when the nature of those cues and the time at which they act remain unclear. Below we summarise the reasons why we believe there are not significance effects of global cues in our experiments that would influence the interpretation of our results.

      First, our reading of the literature supports a broad consensus that an early radial core planar polarity pattern is realigned by cell flow produced by hinge contraction beginning at around 16h APF (e.g. Aigouy et al., 2010; Strutt and Strutt, 2015; Aw and Devenport, 2017; Butler and Wallingford, 2017; Tan and Strutt, 2025). Taken at face value, this suggests that there are ‘radial’ cues present prior to hinge contraction, maybe coming from the wing margin – arguably these radial cues could be Ft-Ds or Wnts or both, given they are expressed in patterns consistent with such a role (notwithstanding the published evidence arguing against roles for either of these cues). It then appears that hinge contraction supercedes these cues to convert a radial pattern to a proximodistal pattern – whether the radial cues that affect the core pathway earlier remain active after hinge contraction is unclear, although both Ft-Ds and Wnts appear to maintain their ‘radial’ patterns beyond the beginning of hinge contraction (e.g. Merkel et al., 2014; Ewen-Campen et al., 2020; Yu et al., 2020).

      We think that the reviewer is proposing the presence of a proximodistal cue that is active in the proximal region of the wing that we use for our experiments shown e.g. in Fig.5, and that this cue orients core polarity here (but not elsewhere in the wing) in a time window after 18h APF. Ft-Ds and Wnts do not seem to be plausible candidates as they are still in ‘radial’ patterns. This leaves either an unknown proximodistal cue (a gradient of some unknown signalling molecule?), or possibly some ability of hinge contraction to align proximodistal polarity specifically in this wing region but not elsewhere. We cannot definitively rule out either of these possibilities, but neither do we think there is sufficient evidence to justify invoking their existence to explain our observations.

      In particular, the reason that we don’t think there is a proximodistal cue in the proximal part of the wing after 18h APF, is that work from our lab shows that induction of Fz or Stbm expression at times around or after the start of hinge contraction (i.e. >16 h APF) results in increasing levels of trichome swirling with polarity not being coordinated with the tissue axis either proximally or distally (Strutt and Strutt, 2002; Strutt and Strutt 2007). Our simplest interpretation for this is that induction at these stages fails to establish the early radial pattern of core pathway polarity and hence hinge contraction cannot reorient radial to proximodistal. If hinge contraction alone could specify proximodistal polarity in the absence of the earlier radial polarity, then we would not expect to see swirling over much of the proximal wing (where the forces from hinge contraction are strongest (Etournay et al., 2015)).

      In this manuscript, our earliest de novo experiments begin with Fz induction at 18h APF (de novo 10h), then at 20h APF (de novo 8h) and at 22h APF (de novo 6h). The image in Fig. 5B, referred to by the reviewer, is of a wing where Fz is induced de novo at 22 h APF. In these wings, as expected, the core proteins localise asymmetrically in stereotypical swirling patterns throughout the wing surface (see Fig. 2M and also Strutt and Strutt, 2002; Strutt and Strutt 2007), but – usefully for our experiments – they broadly localise along the proximal-distal axis in the region analysed in Fig. 5B. Given the strong swirling in surrounding regions when inducing at >20h APF, we feel reasonably confident in assuming that the pattern is not due to a proximodistal cue present in the proximal wing.

      We appreciate that the original manuscript did not show images including the trichome pattern in adjacent regions, so this point would not have been clear, but we now include these in Supplementary Fig. 5. We have also added a note in the legend to Fig. 5B to clarify that the proximodistal pattern seen is local to this wing region. We apologise for this oversight and the confusion caused and appreciate the feedback.

      The 6 hr condition, that has only partial polarity magnitude, is quite disordered. Do the patterns at 8 and 10 hrs become more proximally-distally oriented? It is stated that they all show swirls, but please provide adult wing images, and the corresponding orientation outputs from QuantifyPolarity to help validate the notion that the global cues are indeed bypassed by this paradigm.

      In all three ‘normal’ de novo conditions (6h, 8h and 10h), regardless of the time of induction, the polarity orientation patterns of Fz-mKate2 in pupal and adult wings are very similar in the experimentally analysed region (Fig. S5B-E). The strong local hair swirling agrees with the previous published data (Strutt and Strutt, 2002; Strutt and Strutt 2007). Overall, we don’t see any evidence that the 10h de novo induction results in more proximodistally coordinated polarity than the 8h or 6h conditions. This is consistent with our contention that there is no global cue present at these stages, which presumably would have a stronger effect when core pathway activity was induced at earlier stages.

      Changes to manuscript: Added additional explanation of the ‘de novo induction’ paradigm and why we believe the resulting polarity patterns are unlikely to be influenced by any global signals in Introduction and Results section ‘Induced core protein relocalisation…’. Added quantification of polarity in the experiment region proximal to the anterior cross-vein in pupal wings (Fig.S5E-E’’’) and zoomed-out images of the surrounding region in adult wings showing that the polarity pattern does not become more proximodistal when induction time is longer, and also that there is not overall proximodistal polarity in proximal regions of the wing (Fig.S5B-D), arguing against an unknown proximodistal polarity cue at these stages of development.

      In the de novo paradigm, polarization is initiated immediately or shortly after heat shock induction. However, the results should be differently interpreted if the level of available Fz protein does not rise rapidly and then stabilize before the 6 hr time point, and instead continues to rise throughout the experiment. Western blots of the Fz::mKate2-sfGFP at time points after induction should be performed to demonstrate steady state prior to measurements. Otherwise, polarity magnitude could simply reflect the total available pool of Fz at different times after induction. Interpreting stability is complex, and could depend on the same issue, as well as the amount of recycling that may occur. Prior work from this lab using FRAP suggested that turnover occurs, and could result from recycling as well as replenishment from newly synthesized protein. 

      The reviewer raises an important point, which we agree could confound our experimental interpretations. As suggested we have now carried out western blotting and quantitation for Fz::mKate2-sfGFP levels and added these data to Fig.S1 (Fig. S1C,D). Quantified Fz is not significantly different between the three de novo polarity induction timings and not significantly different compared to constitutive Fz::mKate2-sfGFP expression (although there is a trend towards increasing Fz::mKate2-sfGFP protein levels with increasing induction times). These data are consistent with Fz::mKate2-sfGFP being at steady state in our experiments and that levels are sufficient to achieve normal polarity (as constitutive Fz::mKate2-sfGFP does so). Therefore it is unlikely that differing protein levels explain the differing polarity magnitudes at the different induction times. Interestingly, Fz::mKate2-sfGFP levels are lower than endogenous Fz levels, possibly due to lower expression or increased turnover/reduced recycling.

      Changes to manuscript: Added western blot analysis of Fz::mKate2-sfGFP expression under 10h, 8h and 6h induction conditions vs endogenous Fz expression and constitutive Fz::mKate2sfGFP expression (Fig.S1C-D) and discussed in Results section ‘Planar polarity establishment is…’.

      From the Fig 3 results, the authors claim that limiting pools of core proteins do not explain cellscale signaling, a result expected based on the lack of phenotypes in heterozygotes, but of course they do not test the possibility that Fz is limiting. They do note that some other contributing protein could be. 

      Previously published results from our lab (Strutt et al., 2016 Cell Reports; Supplemental Fig. S6E) show that in a heterozygous fz mutant background, Fz protein levels are not affected by halving the gene dosage when compared to wt, suggesting that Fz is most likely produced in excess and is not normally limiting, but that protein that cannot form complexes may be rapidly degraded. We have now added this information to the text.

      Changes to manuscript: Added explanation in text that Fz levels had previously been shown to not be dosage sensitive in Results section ‘Planar polarity establishment is…’ and also added a caveat to the Discussion about not directly testing Fz.

      In Fig 3, it is unclear why the authors chose to test dsh1/+ rather than dsh[null]/+. In any case, the statistically significant effect of Dsh dose reduction is puzzling, and might indicate that the other interpretation is correct. Ideally, a range including larger and smaller reductions would be tested. As is, I don't think limiting Dsh is ruled out. 

      Concerning the choice of dsh allele, we appreciate the query of the reviewer regarding use of dsh[1] instead of a null, as there might be a concern that dsh[1] would give a less strong phenotype. The answer is that over more than two decades we and others have never found any evidence that dsh[1] does not act as a ‘null’ for planar polarity in the pupal wing, and furthermore use of dsh[1] preserves function in Wg signalling – and we would prefer to rule out any phenotypic effects due to any potential cross-talk between the two pathways that might be seen using a complete null. To expand on this point, dsh[1] mutant protein is never seen at cell junctions (Axelrod 2001; Shimada et al., 2001; our own work), and by every criteria we have used, planar polarity is completely disrupted in hemizygous or homozygous mutants e.g. see quantifications of polarity in (Warrington et al., 2017 Curr Biol).

      In terms of the broader point, whether we can rule out Dsh being limiting, we were very careful to be clear that we did not see evidence for Dsh (or other core proteins) being limiting in terms of ‘rates of core pathway de novo polarisation’. When the reviewer says ‘the statistically significant effect of Dsh dose reduction is puzzling’ we believe they are referring to the data in Fig. 3J, showing a small but significantly different reduction in stable Fz in de novo 6h conditions (also seen in 8h de novo conditions, Fig. S3I). As Dsh is known to stabilise Fz in complexes (Strutt et al., 2011 Dev Cell; Warrington et al., 2017 Curr Biol), in itself this result is not wholly surprising. Nevertheless, while this shows that halving Dsh levels does modestly reduce Fz stability, it does not alter our conclusion that halving Dsh levels does not affect Fz polarisation rate under either 6h or 8h de novo conditions.

      Unfortunately, we do not have available to us a practical way of achieving consistent intermediate reductions in Dsh levels (e.g. a series of verified transgenes expressing at different levels). Levels of all the core proteins could be dialled down using transgenes, to see when the system breaks, and indeed we have previously published that lower levels of polarity are seen if Fmi levels are <<50% or if animals are transheterozygous for pk, stbm, dgo or dsh, pk, stbm, dgo simultaneously (Strutt et al., 2016 Cell Reports). However, it seems to be a trivial result that eventually the ability to polarise is lost if insufficient core proteins are present at the junctions. For this reason we have focused on a simple set of experiments reducing gene dosage singly by 50% under two de novo induction conditions, and have been careful to state our results cautiously. The assays we carried out were a great deal of work even for just the 5 heterozygous conditions tested.

      We believe that the experiments shown effectively make the point that there is no strong dosage sensitivity – and it remains our contention that if protein levels were the key to setting up cell-scale polarity, then a 50% reduction would be expected to show an effect on the rate of polarisation. We further note that as Fz::mKate2-sfGFP levels are lower than endogenous Fz levels (see above), the system might be expected to be sensitised to further dosage reductions, and despite this we failed to see an effect on rate of polarisation.

      We note that Reviewer #3 made a similar point about whether we can rule out dosage sensitivity on the basis of 50% reductions in protein level. To address the comments of both reviewers we had now added some further narrative and caveats in the text.

      In a similar vein, Reviewer #2 requested data on whether dosage reduction altered protein levels by the expected amount. We have now added further explanation/references and western blot data to address this.

      Changes to manuscript: Added more explanation of our choice of dsh[1] as an appropriate mutant allele to use in Results section ‘Planar polarity establishment is…’. Added some narrative and caveats regarding whether lowering levels more than 50% would add to our findings in the Discussion. Revised conclusions to be more cautious including altering section title to read ‘Planar polarity establishment is not highly sensitive to variation in protein levels of core complex components’.

      Also added westerns and text/references showing that for the tested proteins there is a reduction in protein levels upon removal of one gene dosage in Results section ‘Planar polarity establishment is…’ and Fig.S2.

      The data in Fig 5 are somewhat internally inconsistent, and inconsistent with the authors' interpretation. In both repolarization conditions, the authors claim that repolarization extends only to row 1, and row 1 is statistically different from non-repolarized row 1, but so too is row 3. Row 2 is not. This makes no sense, and suggests either that the statistical tests are inappropriate and/or the data is too sparse to be meaningful. 

      As we’re sure the reviewer appreciates, this was an extremely complex experiment to perform and analyse. We spent a lot of time trying to find the best way to illustrate the results (finally settling on a 2D vector representation of polarity) and how to show the paired statistical comparisons between different groups. Moreover, in the end we were only able to detect generally quite modest (statistically significant) changes in cell polarity under the experimental conditions.

      However, we note that failure to see large and consistent changes in polarity is exactly the expected result if it is hard to repolarise from a boundary – and this is of course the conclusion that we draw. Conversely, if repolarisation were easy, which was our expectation at least under de novo conditions without existing polarity, then we would have expected large and highly statistically significant changes in polarity across multiple cell rows. Hence we stand by our conclusion that ‘it is hard to repolarise from a boundary of Fz overexpression in both control and de novo polarity conditions’.

      Overall, we were trying to establish three points:

      (1) to demonstrate that repolarisation occurs from a boundary of overexpression i.e. from boundary 0 to row 0

      (2) to establish whether a wave of repolarisation occurs across rows 1, 2 and 3

      (3) to determine if in repolarisation in de novo condition it is easier to repolarise than in repolarisation in the control (already polarised) condition Taking each in turn:

      (1) To detect repolarisation from a boundary relative to the control condition, we have to compare row 0 in repolarisation condition (Fig.5G,K) vs control condition (Fig.5F,J). This comparison shows a significative repolarisation (p=0.0014). From now, row 0 in repolarisation condition is our reference for repolarisation occurring.

      (2) To determine if there is a wave of repolarisation in the repolarisation condition we have to compare row 0 vs row 1 to 3 in the repolarisation condition (Fig.5K). Row 1 is not significantly different to row 0, but rows 2 and 3 are different and the vectors show obviously lower polarity than row 0. Hence no wave of repolarisation is detected over rows 1 to 3.

      (3) To determine if it is easier to repolarise in the de novo condition, our reference for establishment of a repolarisation pattern is the polarisation condition in rows 0 to 3. So, we compare repolarisation condition vs repolarisation in de novo condition, row 0 vs row 0, row 1 vs row 1, row 2 vs row 2 and row 3 vs row 3 – in each case no significative difference in polarity is detected, supporting our conclusion that it is not easier to repolarise in the de novo condition.

      We agree that the variations in row 3 are puzzling, but there is no evidence that this is due to propagation of polarity from row 0, and so in terms of our three questions, it does not alter our conclusions.

      Changes to manuscript: We have extensively revised the text describing the results in Fig.5 to hopefully make the reasons for our conclusions clearer and also be more cautious in our conclusions in Results section ‘Induced core protein relocalisation…’. 

      For the related boundary intensity data in Fig 6, the authors need to describe exactly how boundaries were chosen or excluded from the analysis. Ideally, all boundaries would be classified as either meido-lateral (meaning anterior-posterior) or proximal-distal depending on angle. 

      We thank the reviewer for pointing out that this was not clear.

      All boundaries were classified following their orientation compared to the Fz over-expression boundary using hh-GAL4 expressed in the wing posterior compartment. Horizontal junctions were defined as parallel to the Fz over-expression boundary (between 0 and 45 degrees) and mediolateral junctions as junctions linking two horizontal boundaries (between 45 and 90 degrees).

      Changes to manuscript: The boundary classification detailed above has been added in the Materials and Methods.

      If the authors believe their Fig 5 and 6 analyses, how do they explain that hairs are reoriented well beyond where the core proteins are not? This would be a dramatic finding, because as far as I know, when core proteins are polarized, prehair orientation always follows the core protein distribution. Surprisingly, the authors do not so much as comment about this. The authors should age their wings just a bit more to see whether the prehair pattern looks more like the adult hair pattern or like that predicted by their protein orientation results.

      Again the reviewer makes an interesting point, and we agree that this is something that we should have more directly addressed in the manuscript.

      There are three reasons why we might expect adult trichomes to show a different effect from the measured core protein polarity pattern seen in our experiments:

      (i) we are assaying core protein polarity at 28h APF, but trichomes emerge at >32h APF, so there is still time for polarity to propagate a bit further from the boundary. We now have added data showing that by the point of trichome initiation, the wave of polarisation extends 3-4 cell rows (Fig.S5A).

      (ii) it has long been known that a strong localisation of core proteins at a cell edge is not required for polarisation of trichome polarity from a boundary. For instance, in Strutt & Strutt 2007 we show clones of cells overexpressing Fz causing propagation through pk[pk-sple] mutant tissue where there is no detectable core protein polarity. We were following up prior observations of Adler et al., 2000 in the wing and Lawrence et al., 2004 in the abdomen.

      (iii) there is evidence to suggest that the polarity of adult trichomes is locally coupled, possibly mechanically. This point is hard to prove without live imaging taking in both initial core protein localisation, the site of actin-rich trichome initiation and then the final orientation of the much larger microtubule filled trichome, and we’re not aware that such data exist. However, Wong & Adler 1993 (JCB) showed that over a number of hours trichomes become much larger and move towards the centre of the cell, presumably becoming decoupled from any core protein cue. The images in Guild … & Tilney, 2005 (MBoC)  are also interesting to look at in this regard. Finally, septate junction proteins have been implicated in local alignment of trichomes, independently of the core pathway (Venema … & Auld, 2004 Dev Biol).

      Changes to manuscript: Added new data in Fig.S5A showing where trichomes initiate under 6h de novo induction conditions, for comparison to core protein localisation and adult trichome data in Fig.5. Added some text explaining why adult trichome repolarisation might be stronger than the observed effects on core protein localisation in Discussion. 

      Minor points

      As the authors know, there is a model in the literature that suggests microtubule trafficking provides a global cue to orient PCP. The authors' repolarization data in Fig 4 make a reasonably convincing case against a role for no role for microtubules in cell-scale signaling, but do not rule out a role as a global cue. The authors should be careful of language such as "...MTs and core proteins being oriented independently of each other" that would appear to possibly also refer to a role as a global cue. 

      Thank you for pointing out that this was not clear. We have now modified the text to hopefully address this.

      Changes to manuscript: Text updated in Results section ‘Microtubules do not provide…’.

      Significance:

      There are two negative conclusions and one positive conclusion made by the authors. Provided the above points are addressed, the negative conclusions, that core proteins are not limiting and that microtubules are not involved in cell-scale signaling are solid. The positive conclusion is more nebulous - the authors say that cell-scale signaling is strong relative to cell-cell signaling - but how strong is strong? Strong relative to their prior expectations? I'm not sure how to interpret such a conclusion. Overall, we learn something from these results, though it fails to reveal anything about mechanism. These results will be of some interest to those studying PCP.

      The reviewer raises an interesting point, which is how do you compare the strength of two different processes, even if both processes affect the same outcome (in this case cell polarity). Repolarisation from a boundary has not been carefully studied at the level of core protein localisation in any previous study to our knowledge – this is one of the important novel aspects of this study. Hence there is not a baseline for defining strong repolarisation. Similarly, there has been no investigation of the nature of ‘cell-scale signalling’. This was a considerable challenge for us in writing the manuscript, and we have done our best to find appropriate language that hopefully conveys our message adequately. Minimally our work may provide a baseline for helping to define the ‘strengths’ of these processes in future studies.

      One of our main points is that we can generate an artificial boundary of Fz expression, where Fz levels are at least several fold higher than in the neighbouring cell (e.g. compare Fig.4N’ and O’) and only two rows of cells show a significant change in polarity relative to controls. Even when the tissue next to the overexpression domain is still in the process of generating polarity (de novo condition) then the boundary has little effect on polarity in neighbouring cell rows. This was a result that surprised us, and we tried to convey that by using language to suggest cell-scale signalling was stronger than cell-cell signalling i.e. stronger in terms of the ability to define the final direction of polarity.

      Changes to manuscript: In the revised manuscript we have reviewed our use of language and now avoid saying ‘strong’ but instead use terms such as ‘effective’ and ‘robust’ in e.g. Results section ‘Induced core protein relocalisation…’, the Discussion and we have also changed the title of the manuscript to avoid claiming a ‘strong’ signal.

      Reviewer #2:

      Overview

      This paper aims to dissect the relative importance of the various cues that establish PCP in the wing disc of Drosophila, which remains a prominent and relevant model for PCP. The authors suggest that one must consider cues at three scales (molecular, cell and tissue) and specifically design tests for the importance of cell-level cues, which they call non-local cell scale signalling. They develop clever experimental approaches that allow them to track complex stability and also to induce polarity at experimentally defined times. In a first set of experiments, they restore PCP after the global cues have disappeared (de novo polarisation) and conclude from the results that another (cell scale) cue must exist. In another set of experiments, they show that de novo repolarization is robust to the dosage of various components of core PCP, leading them to conclude that there must be an underlying cell scale polarity, which, apparently, has nothing to do with microtubule or cell shape polarity. They then describe nice evidence that de novo polarisation is relatively short range both in a polarised and unpolarised field. They conclude by there is a strong cell-intrinsic polarity that remains to be characterised.

      Critique

      The experiments described in this paper are of high quality with a sophisticated level of design and analysis. However, there needs to be some recalibration of the extent of the conclusions that can be drawn (see below). Moreover, a limitation of this paper is that, despite the quality of their data, they cannot give a molecular hint about the nature of their proposed cell-scale signal. Below are a two key points that the authors may want to clarify.

      (1) The first set of repolarisation experiment is performed after the global cell rearrangements that have been shown to act as global signal. However, this approach does not exclude the possible contribution of an unknown diffusible global signal.

      A similar point was raised by Reviewer 1. For the convenience of this reviewer, we’ll summarise the arguments against such an unknown cue again below. More broadly, both reviewers asking a similar question indicates that we have failed to lay out the evidence in sufficient detail. In our defence, we have used the same ‘de novo’ paradigm in three previous publications (Strutt and Strutt 2002, 2007; Brittle et al 2022) without attracting (overt) controversy. We have now added text to the Introduction and Results that goes into more detail, as well as more experimental evidence (Fig.S5).

      Firstly, it is worth noting that the global cues acting in the wing are poorly understood, with mostly negative evidence against particular cues accruing in recent years. This makes it a hard subject to succinctly discuss. Secondly, we accept that it is hard to prove there is no influence of global cues, when the nature of those cues and the time at which they act remain unclear. Below we summarise the reasons why we believe there are not significance effects of global cues in our experiments that would influence the interpretation of our results.

      First, our reading of the literature supports a broad consensus that an early radial core planar polarity pattern is realigned by cell flow produced by hinge contraction beginning at around 16h APF (e.g. Aigouy et al., 2010; Strutt and Strutt, 2015; Aw and Devenport, 2017; Butler and Wallingford, 2017; Tan and Strutt, 2025). Taken at face value, this suggests that there are ‘radial’ cues present prior to hinge contraction, maybe coming from the wing margin – arguably these radial cues could be Ft-Ds or Wnts or both, given they are expressed in patterns consistent with such a role (notwithstanding the published evidence arguing against roles for either of these cues). It then appears that hinge contraction supercedes these cues to convert a radial pattern to a proximodistal pattern – whether the radial cues that affect the core pathway earlier remain active after hinge contraction is unclear, although both Ft-Ds and Wnts appear to maintain their ‘radial’ patterns beyond the beginning of hinge contraction (e.g. Merkel et al., 2014; Ewen-Campen et al.,2020; Yu et al., 2020).

      We think that the reviewers are proposing the presence of a proximodistal cue that is active in the proximal region of the wing that we use for our experiments shown e.g. in Fig.5, and that this cue orients core polarity here (but not elsewhere in the wing) in a time window after 18h APF. Ft-Ds and Wnts do not seem to be plausible candidates as they are still in ‘radial’ patterns. This leaves either an unknown proximodistal cue (a gradient of some unknown signalling molecule?), or possibly some ability of hinge contraction to align proximodistal polarity specifically in this wing region but not elsewhere. We cannot definitively rule out either of these possibilities, but neither do we think there is sufficient evidence to justify invoking their existence to explain our observations.

      In particular, the reason that we don’t think there is a proximodistal cue in the proximal part of the wing after 18h APF, is that work from our lab shows that induction of Fz or Stbm expression at times around or after the start of hinge contraction (i.e. >16 h APF) results in increasing levels of trichome swirling with polarity not being coordinated with the tissue axis either proximally or distally (Strutt and Strutt, 2002; Strutt and Strutt 2007). Our simplest interpretation of this is that induction at these stages fails to result in the early radial pattern of core pathway polarity being established and hence a failure of hinge contraction to reorient radial to proximodistal. If hinge contraction alone could specify proximodistal polarity in the absence of the earlier radial polarity, then we would not expect to see swirling over much of the proximal wing (where the forces from hinge contraction are strongest, Etournay et al., 2015).

      In this manuscript, our earliest de novo experiments begin at 18h APF (de novo 10h), then at 20h APF (de novo 8h) and at 22h APF (de novo 6h). The image in Fig. 5B referred to by Reviewer 1, is of a wing where Fz is induced de novo at 22 h APF. In these wings, as expected, the core proteins localise asymmetrically in stereotypical swirling patterns throughout the wing surface (see Fig. 2M and also Strutt and Strutt, 2002; Strutt and Strutt 2007), but – usefully for our experiments – they broadly localise along the proximal-distal axis in the region analysed in Fig. 5B. Given the strong swirling in surrounding regions when inducing at >20h APF, we feel reasonably confident in assuming that the pattern is not due to a proximodistal cue present in the proximal wing. We appreciate that the original manuscript did not show images including the trichome pattern in adjacent regions, so this point would not have been clear, but we now include these in Supplementary Fig.S5. We have also added a note in the legend to Fig. 5B to clarify that the proximodistal pattern seen is local to this wing region.

      Changes to manuscript: Text extended in Introduction and Results to better explain why we believe the de novo conditions that we use most likely result in a polarity pattern that is not significantly influenced by ‘global cues’. Now show zoomed-out images of the surrounding region around the experiment region proximal to the anterior cross-vein region in adult wings, showing that the polarity pattern does not become more proximodistal when induction time is longer, and also that there is not overall proximodistal polarity in proximal regions of the wing, arguing against an unknown proximodistal polarity cue at these stages of development (Fig.S5B-E’’’).

      (2) The putative non-local cell scale signal must be more precisely defined (maybe also given a better name). It is not clear to me that one can separate cell-scale from molecular-scale signal.

      Local signals can redistribute within a cell (or membrane) so local signals are also cell-scale. Without a clear definition, it is difficult to interpret the results of the gene dosage experiments. The link between gene dosage and cell-scale signal is not rigorously stated. Related to this, the concluding statement of the introduction is too cryptic.

      We thank the reviewer for raising this, as again a similar comment was made by Reviewer 1, so we are clearly falling short in defining the term. We have now had another attempt in the Introduction.

      To more specifically answer the point made by the reviewer regarding molecular vs cellular, we are essentially being guided here by the prior computational modelling work, as at the biological level the details are still being worked out. A specific class of previous models only allowed ‘signals’ between core proteins to act ‘locally’, meaning within a cell junction, and within the models there was no explicit mechanism by which proteins on other junctions could ‘detect’ the polarity of a neighbouring junction (e.g. Amonlirdviman et al., 2005; Le Garrec et al., 2006; Fischer et al., 2013). Other models implicitly or explicitly encode a mechanism by which cell junctions can be influenced by the polarity of other junctions (e.g. Meinhardt, 2007; Burak and Shraiman, 2009; Abley et al., 2013; Shadkhoo and Mani, 2019), for instance by diffusion of a factor produced by localisation of particular planar polarity proteins.

      We agree with the reviewer that a cell-scale signal will depend on ‘molecules’ and thus could be called ‘molecular-scale’, but here by ‘molecular-scale’ we mean signals that at the range of the sizes of molecules i.e. nanometers, rather than cell-scale signals that act at the size of cells i.e. micrometers. A caveat to our definition is that we implicitly include interactions that occur locally on cell junctions (<1 µm range) within ‘molecular-scale’, but this is a shorter range than ‘cellular-scale’ which requires signals acting over the diameter of a cell (3-5 µm). Nevertheless, we think the concept of ‘molecular-scale’ vs ‘cell-scale’ is a helpful one in this context, and have attempted to address the issue through a more careful definition of the terms.

      Changes to manuscript: Text revised in Introduction and legend to Fig.1 to more carefully define ‘cell-scale signalling’ and to distinguish it from ‘molecular-scale signalling’. Final sentence of Introduction also altered so we no longer cryptically speculate on the nature of the cell-scale signal but leave this to the Discussion.

      Minor comments. 

      Some of the (clever) genetic manipulation may need more details in the text. For example:

      - Need to specify if the hs-flp approach induces expression throughout the tissue.

      We apologise for the lack of clarity. In all the experiments, the hs-FLP transgene is present in all cells, and heat-shock results in ubiquitous expression. 

      Changes to manuscript: We have clarified this in the Results and Materials and Methods.

      - Need to specify in the text that in the unpolarised condition the tissue is both dsh and fz mutant.

      The reviewer is of course correct and we have updated this point in the text. The full genotype for the unpolarised condition is: w dsh<sup>1</sup> hsFLP22/y;; Act>>fz-mKate2sfGFP, fz<sup>P21</sup>/fz<sup>P21</sup> (see Table S1). So this line is mutant for dsh and fz with induced expression of Fz-mKate2sfGFP. 

      Changes to manuscript: We have clarified this in the relevant part of the Results.

      - Need to specify in the text that the experiment illustrated in Fig 5 is with hh-gal4. 

      As noted by the reviewer, we continued to use the same hh-GAL4 repolarisation paradigm as in Fig.4 and this info was in the legend to Fig.5 legend. However, we agree it is helpful to be explicit about this in the main text.

      Changes to manuscript: We have added this to this section of the Results.

      - Need to address a possible shortcoming of the hh experiment, that the AP boundary is a region of high tension.

      It is true that the AP boundary is under high tension in the wing disc (e.g. Landsberg et al., 2009). But we are not aware of any evidence that this higher tension persists into the pupal wing. In separate studies we have labelled for Myosin II in pupal wings (Trinidad et al 2025 Curr Biol; Tan & Strutt 2025 Nature Comms), and as far as we have noticed have not seen preferentially higher levels on the AP boundary. We think if tension were higher, the cell boundaries would appear straighter than in surrounding cells (as seen in the wing disc) and this is not evident in our images.

      - Need to dispel the possibility that there is no residual polarisation (e.g. of other components) in fz1 mutant (I assume this is the case).

      We use the null allele fz[P21] through this work, and we and others have consistently reported a complete loss of polarisation of other core proteins or downstream components in this background. The caveat to this is that core proteins that persist at cell junctions always appear at least slightly punctate in mutant backgrounds for other core proteins, and so any automated detection algorithm will always find evidence of individual cell polarity above a baseline level of uniform distribution. Hence we tend to use lack of local coordination of polarity (variance of cell polarity angle) as an additional measure of loss of polarisation, in addition to direct measures of average cell polarity. (We discuss this in the QuantifyPolarity manuscript Tan et al 2021 e.g. Fig.S6).

      Changes to manuscript: We now include in the Materials and Methods section ‘Fly genetics…’ a much more extensive explanation of the evidence for specific mutant alleles being ‘null’ for planar polarity function (including dsh1 as raised by Reviewer 1), specifically that they result in no detectable planar polarisation of either other core proteins or downstream effectors, and added appropriate references.

      - Need to provide evidence that 50% gene dosage commensurately affect protein level. 

      This is a good suggestion. In the case of Stbm, we have already published a western blot showing that a reduction in gene dosage results in reduced protein levels (Strutt et al 2016, Fig.S6). We have now performed western blots to quantify protein levels upon reduction of fmi, pk and dgo levels (we actually used EGFP-dgo for the latter, as we don’t have antibodies that can detect endogenous Dgo on western blots).

      Changes to manuscript: When presenting the dosage reduction experiments, we now refer back to Strutt et al., 2016 explicitly for Stbm, and have added western blot data for Fmi, Pk and EGFPDgo in new Fig.S2.

      - I am surprised that the relationship with microtubule polarity was never investigated. Is this true? 

      We agree this is a point that needed further clarification, as Reviewer 1 made a related point regarding the two possible roles for microtubules, one being as a mediator of a global cue upstream of the core pathway, and the second (which we investigate in this manuscript) as a mediator of a cell-scale signal downstream of the core pathway.

      Both the Uemura and Axelrod groups have published on potential upstream function as a global cue mediator in the Drosophila wing (e.g. Shimada et al., 2006; Harumoto et al., 2010; Matis et al., 2014).

      Both groups have also looked out whether core pathway components could affect orientation of microtubules (Harumoto et al., 2010; Olofsson at al., 2014; Sharp and Axelrod 2016). Notably Harumoto et al., 2010 observed that in 24h APF wings, loss of Fz or Stbm did not alter microtubule polarity from a proximodistal orientation consistent with the microtubules aligning along the long cell axis in the absence of other cues. However, this did not rule out an instructive effect of Fz or Stbm on microtubule polarity during core pathway cell-scale signalling. The Axelrod lab manuscripts saw interesting effects of Pk protein isoforms on microtubule polarity, albeit not throughout the entire wing, which hinted at a potential role in cell-scale signalling. Taken together this prior work was the motivation for our directed experiments to specifically test whether the core pathway might generate cell-scale polarity by instructing microtubule polarity.

      Changes to manuscript: We have revised the Results section ‘Microtubules do not…’ to make a clearer distinction regarding possible ‘upstream’ and ‘downstream’ roles of microtubules in Drosophila core pathway planar polarity and the motivation for our experiments investigating the latter.

      - The authors suggest that polarity does not propagate as a wave. And yet the range measured in adult is longer than in the pupal wing. Explain. 

      Again an excellent point, also made by Reviewer 1, which we have now addressed explicitly in the manuscript. For the convenience of this reviewer, we lay out the reasons why we think the propagation of polarity seen in the adult is further than seen for core protein localisation.

      There are three reasons why we might expect adult trichomes to show a different effect from the measured core protein polarity pattern seen in our experiments:

      (i) we are assaying core protein polarity at 28h APF, but trichomes emerge at >32h APF, so there is still time for polarity to propagate a bit further from the boundary. We now have added data showing that by the point of trichome initiation, the wave of polarisation extends 3-4 cell rows (Fig.S5A).  

      (ii) it has long been known that a strong localisation of core proteins at a cell edge is not required for polarisation of trichome polarity from a boundary. For instance, in Strutt & Strutt 2007 we show clones of cells overexpressing Fz causing propagation through pk[pk-sple] mutant tissue where there is no detectable core protein polarity. We were following up prior observations of Adler et al 2000 in the wing and Lawrence et al 2004 in the abdomen.

      (iii) there is evidence to suggest that the polarity of adult trichomes is locally coupled, possibly mechanically. This point is hard to prove without live imaging taking in both initial core protein localisation, the site of actin-rich trichome initiation and then the final orientation of the much larger microtubule filled trichome, and we’re not aware that such data exist. However, Wong & Adler 1993 (JCB) showed that over a number of hours trichomes become much larger and move towards the centre of the cell, presumably becoming decoupled from any core protein cue. The images in Guild … & Tilney, 2005 (MBoC)  are also interesting to look at in this regard. Finally, septate junction proteins have been implicated in local alignment of trichomes, independently of the core pathway (Venema … & Auld, 2004 Dev Biol).

      Changes to manuscript: Added new data in Fig.S5A showing where trichomes initiate under 6h de novo induction conditions, for comparison to core protein localisation and adult trichome data in Fig.5. Added some text explaining why adult trichome repolarisation might be stronger than the observed effects on core protein localisation in Discussion. 

      - The discussion states that the cell-intrinsic system remains to be fully characterised, implying that it has been partially characterised. What do we know about it? 

      As the reviewer probably realises, we were attempting to side-step a long speculative discussion about the various hints and ideas in the literature by grouping them under the umbrella of ‘remaining to be fully characterised’. We would argue that this current manuscript is the first to attempt to systematically investigate the nature of ‘cell-scale signalling’. The lack of prior work is probably due to two factors (i) pioneering theoretical work showed that a sufficiently strong global signal coupled with ‘local’ (i.e. confined to one cell junction) protein interactions was sufficient to polarise cells without the need to invoke the existence of a cell-scale signal; (ii) there is no easy way to identify cell-scale signals as their loss results in loss of polarity which will also occur if other (i.e. more locally acting) core pathway functions are compromised.

      The main investigation of the potential for cell-scale signalling has been another set of theory studies (Burak and Shraiman 2009; Abley et al., 2013; Shadkhoo and Mani 2019) which have considered the possibility of diffusible signals. In our present work we have further considered the possibility of a ‘depletion’ model, based on the pioneering theory work of Hans Meinhardt, and as discussed above the possibility that microtubules could mediate a cell-scale signal.

      Changes to manuscript: We have revised the Discussion to hopefully be clearer about the current state of knowledge.

      Reviewer #3:

      The manuscript by Carayon and Strutt addresses the role of cell-scale signaling during the establishment of planar cell polarity (PCP) in the Drosophila pupal wing. The authors induce locally the expression of a tagged core PCP protein, Frizzled, and observe and analyze the de novo establishment of planar cell polarity. Using this system, the authors show that PCP can be established within several hours, that PCP is robust towards variation in core PCP protein levels, that PCP proteins do not orient microtubules, and that PCP is robust towards 'extrinsic' repolarization. The authors conclude that the polarization at the cell-scale is strongly intrinsic and only weakly affected by the polarity of neighboring cells. 

      Major comments

      The data are clearly presented and the manuscript is well written. The conclusions are well supported by the data. 

      (1) The authors use a system to de novo establish PCP, which has the advantage of excluding global cues orienting PCP and thus to focus on the cell-intrinsic mechanisms. At the same time, the system has the limitation that it is unclear to what extent de novo PCP establishment reflects 'normal' cell scale PCP establishment, in particular because the Gal4/UAS expression system that is used to induce Fz expression will likely result in much higher Fz levels compared with the endogenous levels. The authors should briefly discuss this limitation. 

      We apologise if this wasn’t clear. We only used GAL4/UAS overexpression when we were generating an artificial boundary of Fz expression with hh-GAL4 to induce repolarisation. The de novo induction system involves Fz::mKate2-sfGFP being expressed directly under an Act5C promoter without use of GAL4/UAS. In response to a comment from Reviewer 1 we have now carried out western blot analysis which shows that Fz::mKate2-sfGFP levels under Act5C are actually lower than endogenous Fz levels. As we achieve normal levels of polarity, similar to what we measure in wild-type conditions when measured using QuantifyPolarity, we assume that therefore Fz levels are not limiting under these conditions. However, we note that lower than normal levels of Fz might sensitise the system to perturbation, which in fact would be advantageous in our study, as it might for instance have been expected to more readily reveal dosage sensitivity of other components.

      Changes to manuscript: We now describe the levels of expression achieved using the de novo induction system (Fig.S1C-D) and discuss possible consequences in the relevant Results sections and Discussion.

      (2) Fig. 3. The authors use heterozygous mutant backgrounds to test the robustness of de novo PCP establishment towards (partial) depletion in core PCP proteins. The authors conclude that de novo polarization is 'extremely robust to variation in protein level'. Since the authors (presumably) lowered protein levels by 50%, this conclusion appears to be somewhat overstated. The authors should tune down their conclusion. 

      Reviewer 1 makes a similar point about whether we can argue that the lack of sensitivity to a 50% reduction in protein levels actually rules out the depletion model. To address the comments of both reviewers we had now added some further narrative and caveats in the text.

      We nevertheless believe that the experiments shown effectively make the point that there is no strong dosage sensitivity – and it remains our contention that if protein levels were the key to setting up cell-scale polarity, then a 50% reduction would be expected to show an effect on the rate of polarisation. We further note that as Fz::mKate2-sfGFP levels are lower than endogenous Fz levels, the system might be expected to be sensitised to further dosage reductions, and despite this we fail to see an effect on rate of polarisation.

      In a similar vein, Reviewer 2 requested data on whether dosage reduction altered protein levels by the expected amount. We have now added further explanation/references and western blot data to address this.

      Changes to manuscript: Added some narrative and caveats regarding whether lowering levels more than 50% would add to our findings in the Discussion. Revised conclusions to be more cautious including altering section title to read ‘Planar polarity establishment is not highly sensitive to variation in protein levels of core complex components.

      Also added westerns and text/references showing that for the tested proteins there is a reduction in protein levels upon removal of one gene dosage in Results section ‘Planar polarity establishment is…’ and Fig.S2.

      Minor comments 

      (1) Page 3. The authors mention and reference that they used the PCA method to quantify cell polarity magnification and magnitude. It would help the unfamiliar reader, if the authors would briefly describe the principle of this method. 

      Changes to manuscript: More details have been added in Materials & Methods.

      Significance:

      The manuscript contributes to our understanding of how planar cell polarity is established. It extends previous work by the authors (Strutt and Strutt, 2002,2007) that already showed that induction of core PCP pathway activity by itself is sufficient to induce de novo PCP. This manuscript further explores the underlying mechanisms. The authors test whether de novo PCP establishment depends on an 'inhibitory signal', as previously postulated (Meinhardt, 2007), but do not find evidence. They also test whether core PCP proteins help to orient microtubules (which could enhance cell intrinsic polarization of core PCP proteins), but, again, do not find evidence, corroborating previous work (Harumoto et al, 2010). The most significant finding of this manuscript, perhaps, is the observation that local de novo PCP establishment does not propagate far through the tissue. A limitation of the study is that the mechanisms establishing intrinsic cell scale polarity remain unknown. The work will likely be of interest to specialists in the field of PCP.

    1. L ‘„»I2'8

      If we can assume by the sign-off that he began his book in 1938, and then published in 1944, that would place us in Switzerland during Nazi Germany and the beginning of WWII. I wonder if any of his writings in this book were influenced by current events and if he considered war strategy as a form of play. It is easy for us to think of war as play, but for those who lived through it, it may have seemed like an outrageous statement.

    1. It's not: Can schools save more of our students? Because I think we have the answer to that -- and it's yes they can, if we save our schools first. We can start by caring about the education of other people's children ...

      Tying the amount of money we have lost as a nation to the lack of attention paid to the education system was an interesting point. The financial loss could sway people who previously did not care about other people's children (and their education). Due to the current state of the country it may be difficult to get people to "start caring about other people's children." in tems of improving the condition of our current educational system but the financial implications and losing earning potential could sway stakeholders to invest in educational reform.

    1. This is because our expectations are often based on previous experience and patterns we have observed and internalized, which allows our brains to go on “autopilot” sometimes and fill in things that are missing or overlook extra things.

      This sentence is very relatable. It highlights how our brains rely on past experiences and familiar patterns to make sense of what’s around us, sometimes without us even realizing it. The idea of going on “autopilot,” as stated in the text, is something I experience often. For example, there are times when I’m sitting in my living room and I think I see someone walking past my big front window. But when I actually look outside, there’s nobody there. This has happened multiple times, and I’ve always wondered why. Now, I think it’s because the walkway to the front door is right outside that window, so my brain may be expecting someone to come up to the door.

    1. We anticipate that layers that account for this depth order, e.g. through convolutions or possibly self-attention (as used in spatio-temporal graphs (e.g. Guo et al. 2019, Su et al. 2020)), will often be complementary to other layers acting on the topology (encoded in the phylogenetic graph), e.g. through graph convolutions.

      Related to the pooling operator, I think large gains may come from the use of 1) edge weights in your GCN layers so that not all neighbors are treated equally by the message passing mechanism, and 2) alternative MPNN layer types, including use of the graph attention mechanism (i.e. GAT) or graph transformers, which use the attention mechanism to learn which neighbors are more "important." I suspect that even with simple mean-pooling, these alternative layer types will be much more performant and generalizable (e.g. from CRBD to BiSSE). In effect the GCN layers (particularly without using edge weights) is more akin to the CRBD in that it assumes uniform, homogeneous contribution by all neighbors to feature updates.

    1. And some have suggested we may have been thinking about agriculture wrong. It now seems likely that agriculture began in a very gradual process that goes back much farther than we had imagined.

      I find it interesting how our understanding of the agricultural revolution has changed over the years. We as humans tend to think about history, and really a lot of things, in a chronological order. We’ve learned over the years that it isn’t always the cause, especially in our understanding of pre-written eras.

    1. Author response:

      Reviewer 1:

      (1) Line 65 "(Figure 1A). Inactivation causes a change in the leg's rest position; however, in preliminary experiments, the body rotation did not have a large effect on the rest positions of the leg following inactivation. This result is consistent with the one already reported for stick insects and shows that passive forces within the leg are much larger than the gravitational force on a leg and dominate limb position [1]." This is the direct replication of the previous work by Hooper et al 2009 and therefore authors should ideally show the data for this condition (no weight attached).

      We did not present this data – the effect of inactivation on the leg’s rest position in unweighted leg - because it was already reported in the case of stick insects. However, we understand the reviewer’s point that it is important to present the data showing this replication. We will do the same in the revised version.

      (2) The authors use vglut-gal4, a very broad driver for inactivating motor neurons. The driver labels all glutamatergic neurons, including brain descending neurons and nerve cord interneurons, in addition to motor neurons. Additionally, the strength of inactivation might differ in different neurons (including motor neurons) depending on the expression levels of the opsins. As a result, in this condition, the authors might not be removing all active forces. This is a major caveat that authors do not address. They explore that they are not potentially silencing all inputs to muscles by using an additional octopaminergic driver, but this doesn't address the points mentioned above. At the very least, the authors should try using other motor neuron drivers, as well as other neuronal silencers. This driver is so broad that authors couldn't even use it for physiology experiments. Additionally, the authors could silence VGlut-labeled motor neurons and record muscle activity (potentially using GCaMP as has been done in several recent papers cited by the authors, Azevedo et al, 2020) as a much more direct readout.

      This reviewer critique is related to the use of vglut-gal4 –a broad driver– to inactivate motor neurons (MNs). The reviewer argues that the use of a broad driver might result in some effects that are not due to MN inactivation. Conversely, it is possible that not all MNs are inactivated. These critiques raise important points that we will address in the revision by 1) performing experiments with other MN drivers as suggested by the reviewer, 2) performing experiments in flies that are inactivated by freezing. These measurements will provide other estimates of passive forces allowing us to better triangulate the range of values for the passive forces. Moreover, it appears that one of the reviewer’s main concern is that the passive forces are overestimated because of the residual active forces. We will discuss this possibility in detail. It is important to note that in the end what we hope to accomplish is to provide a useful estimate of the passive forces. It is unlikely that the passive force will be a precise number like a physical constant as the passive forces likely depend on recent history.

      (3) Figure 4 uses an extremely simplified OpenSim model that makes several assumptions that are known to be false. For example, the Thorax-Coxa joint is assumed to be a ball and socket joint, which it is not. Tibia-tarsus joint is completely ignored and likely makes a major contribution in supporting overall posture, given the importance of the leg "claw" for adhering to substrates. Moreover, there are a couple of recent open-source neuromechanical models that include all these details (NeuromechFly by Lobato-Rios et al, 2022, Nat. Methods, and the fly body model by Vaxenburg et al, 2025, Nature). Leveraging these models to rule in or rule out contributions at other joints that are ignored in the authors' OpenSim model would be very helpful to make their case.

      Our OpenSim model predates the newer mechanical model. In the revised manuscript, we will revisit the model in light of recent developments.

      (4) Figure 5 shows the experimental validation of Figure 4 simulations; however, it suffers from several caveats.

      a) The authors track a single point on the head of the fly to estimate the height of the fly. This has several issues. Firstly, it is not clear how accurate the tracking would be. Secondly, it is not clear how the fly actually "falls" on VGlut silencing; do all flies fall in a similar manner in every trial? Almost certainly, there will be some "pitch" and "role" in the way the fly falls. These will affect the location of this single-tracked point that doesn't reflect the authors' expectations. Unless the authors track multiple points on the fly and show examples of tracked videos, it is hard to believe this dataset and, hence, any of the resulting interpretations.

      b) As described in the previous point, the "reason" the fly falls on silencing all glutamatergic neurons could be due to silencing all sorts of premotor/interneurons in addition to the silencing of motor neurons.

      c) (line 175) "The first finding is that there was a large variation in the initial height of the fly (Figure 5C), consistent with a recent study of flies walking on a treadmill[20]." The cited paper refers to how height varies during "walking". However, in the current study, the authors are only looking at "standing" (i.e. non-walking) flies. So it is not the correct reference. In my opinion, this could simply reflect poor estimation of the fly's height based on poor tracking or other factors like pitch and role.

      d) "The rate at which the fly fell to the ground was much smaller in the experimental flies than it was in the simulated flies (Figure 5E). The median rate of falling was 1.3 mm/s compared to 37 mm/s for the simulated flies (Figure 5F). (Line 190) The most likely reason for the longer than expected time for the fly to fall is delays associated with motor neuron inactivation and muscle inactivation." I don't believe this reasoning. There are so many caveats (which I described in the above points) in the model and the experiment, that any of those could be responsible for this massive difference between experiment and modeling. Simply not getting rid of all active forces (inadequate silencing) could be one obvious reason. Other reasons could be that the model is using underestimates of passive forces, as alluded to in point 3.

      (4a) Although we agree that measuring different points on the body would allow us to estimate the moments, we disagree that the height of the fly cannot be evaluated from the measurement of a single point. The measurements have been performed using the same techniques that we used to assess the fly’s height in a different study where we estimated the resolution of our imaging system to be ~20 mm(Chun et. al. 2021). We will include these details in the revised manuscript. The video showing the falling experiments are not available or referenced in the manuscript. These will be made available.

      b) We will repeat the “falling” experiment with a more restrictive driver.

      c) We disagree with the reviewer on this point. The system has a resolution of ~20 mm and is sufficient to make conclusion about the difference in the height of the fly. We will clarify this point in the revised manuscript.

      d) We do not follow the reviewer’s rationale here. The passive forces in the model (along with any residual forces) are the same in the model as well as in the experiment. Moreover, there will be a delay between light onset, neuronal inactivation and muscle inactivation. These processes are not instantaneous. In Figure 6, we estimate these delays and have concluded that they will cause substantial delay. In the revised manuscript, we will discuss other reasons for the delay suggested by the reviewer.

      (5) Final figure (Figure 6) focuses on understanding the time course of neuronal silencing. First of all, I'm not entirely sure how relevant this is for the story. It could be an interesting supplemental data. But it seems a bit tangential. Additionally, it also suffers from major caveats.

      a) The authors now use a new genetic driver for which they don't have any behavioral data in any previous figures. So we do not know if any of this data holds true for the previous experiments. The authors perform whole-cell recordings from random unidentified motor neurons labeled by E49-Gal4>GtACR1 to deduce a time constant for behavioral results obtained in the VGlut-Gal4>GtACR1 experiments.

      b) The DMD setup is useful for focal inactivation, however, the appropriate controls and data are not presented. Line 200 "A spot of light on the cell body produces as much of the hyperpolarization as stimulating the entire fly (mean of 11.3 mV vs 13.1 mV across 9 neurons). Conversely, excluding the cell body produces only a small effect on the MN (mean of 2.6 mV)." First of all, the control experiment for showing that DMD is indeed causing focal inactivation would be to gradually move the spot of light away from the labeled soma, i.e. to the neighboring "labelled" soma and show that there is indeed focal inactivation. Instead authors move it quite a long distance into unlabeled neuropil. Secondly, I still don't get why the authors are doing this experiment. Even if we believe the DMD is functioning perfectly, all this really tells us is that a random subset motor neurons (maybe 5 or 6 cells, legend is missing this info) labeled by E49-Gal4 is strongly hyperpolarized by its own GtACR1 channel opening, rather than being impacted because of hyperpolarizations in other E49-Gal4 labeled neurons. This has no relevance to the interpretation of any of the VGlut-Gal4 behavioral data. VGLut-Gal4 is much broader and also labels all glutamatergic neurons, most of which are inhibitory interneurons whose silencing could lead to disinhibition of downstream networks.

      (5 a) However, we can address the reviewer critique by recording from the Vglut line while using a MN line to target the recordings to MNs.

      b) Once we use the Vglut driver to perform these recordings, it will help assess how much of the MN inactivation is due to the GtACR expressed in the MN versus other neurons.

      Reviewer 2:

      While (as mentioned above) the study's conclusions are well-supported by the results and modeling, limitations arise because of the assumptions made. For instance, using a linear approximation may not hold at larger joint angles, and future studies would benefit from accounting for nonlinearities. Future studies could also delve into the source of passive forces, which is important for more deeply understanding the anatomical and physical basis of the results in this study. For instance, assessments of muscle or joint properties to correlate stiffness values with physical structure might be an area of future consideration.

      We agree with these comments but believe that these studies represent avenues for future work.

      Reviewer 3:

      (1) Passive torques are measured, but only some short speculative statements, largely based on previous work, are offered on their functional significance; some of these claims are not well supported by experimental evidence or theoretical arguments. Passive forces are judged as "large" compared to the weight force of the limb, but the arguably more relevant force is the force limb muscles can generate, which, even in equilibrium conditions, is already about two orders of magnitude larger. The conclusion that passive forces are dynamically irrelevant seems natural, but contrasts with the assertion that "passive forces [...] will have a strong influence on limb kinematics". As a result, the functional significance of passive joint torques in the fruit fly, if any, remains unclear, and this ambiguity represents a missed opportunity. We now know the magnitude of passive joint torques - do they matter and for what? Are they helpful, for example, to maintain robust neuronal control, or a mechanical constraint that negatively impacts performance, e.g., because they present a sink for muscle work?

      To us, measuring passive forces was the first step to understanding neural/biomechanical control of limb. In general, we agree with these comments and would like to understand the role of passive forces in overall control of limb. A complete discussion of the role of the significance of passive forces in the control of limb is beyond the scope of this study. We would like to note that it is unlikely that the active forces are two orders of magnitude larger during unloaded movement of the limb. However, these issues will have to be settled in future work.

      (2) The work is framed with a scaling argument, but the assumptions that underpin the associated claims are not explicit and can thus not be evaluated. This is problematic because at least some arguments appear to contradict textbook scaling theory or everyday experience. For example, active forces are assumed to scale with limb volume, when every textbook would have them scale with area instead; and the asserted scaling of passive forces involves some hidden assumptions that demand more explicit discussion to alert the reader to associated limitations. Passive forces are said to be important only in small animals, but a quick self-experiment confirms that they are sufficient to stabilize human fingers or ankles against gravity, systems orders of magnitude larger than an insect limb, in seeming contradiction with the alleged dominance of scale. Throughout the manuscript, there are such and similar inaccuracies or ambiguities in the mechanical framing and interpretation, making it hard to fairly evaluate some claims, and rendering others likely incorrect.

      We interpret this comment as making two separate points. The first one is that the reviewer says that our statement that active forces depend on the third power of the limb or L<sup>3</sup> is incorrect. We agree and apologize for this oversight. Specifically, on L6-7 we say, “both inertial forces and active forces scale with the mass if the limb which in turn scales with the volume of the limb and therefore depends on the third power of limb length (L<sup>3</sup>)”. Instead, this statement should read “inertial forces scale with the mass if the limb which in turn scales with the volume of the limb and therefore depends on the third power of limb length (L<sup>3</sup>)”. However, this oversight does not affect the scaling argument as the scaling arguments in the rest of the manuscript only involves inertial forces and not active forces.

      The second point is about the scaling law that governs passive forces. In the current manuscript, we have assumed that the passive forces scale as L<sup>2</sup> based on previous work. The reviewer has pointed out that this assumption might be incorrect or at the very least needs a rationale. We agree with this assessment: passive forces that arise in the muscle are likely to scale as L<sup>2</sup> but passive forces that arise in the joint might not. In the revised manuscript, we will discuss this concern.

      Response to the public comment:

      There was a comment from a reader: “None of our work cited in various places in this preprint (i.e., Zakotnik et al. 2006, Guschlbauer et al. 2007, Page et al. 2008, Hooper et al. 2009, Hooper 2012, Ache and Matheson 2012, Blümel et al. 2012, Ache and Matheson 2013, von Twickel et al. 2019, and Guschlbauer et al. 2022) claims or implies that passive forces could be sufficient to support the weight of an insect or any animal. To claim or suggest otherwise (as done in lines 33-35) is incorrect and sets up a misleading straw man that misrepresents our work. All statements in the preprint regarding our work related to this specific matter need to be removed or edited accordingly. For instance, the investigations, calculations, and interpretations in Hooper et al. 2009 are solely about limbs that are not being used in stance or other loaded tasks (indeed, the article's title specifically refers to "unloaded" leg posture and movements). Trying to use this work to predict whether passive muscle forces alone can support a stick insect against gravity requires considering much more than the oversimplified calculation given in lines 290-292. Other “back of the envelope calculations” (lines 299-300) are likely also insufficient and erroneous. The discussion in lines 289-304 needs to be edited accordingly”

      We thank the reader for their comment. However, we interpret these studies differently. The studies above rightly focused on unloaded legs because it would be difficult to study passive forces in an intact insect without genetic tools. The commenter correctly points out that these studies do not comment on whether passive forces are strong enough to support the weight of the fly. However, we disagree that our arguments based on their results are unreasonable or strawman. We think that our interpretation of their measurements is correct. Moreover, we were motivated by Yox et. el. 1982 who states in so many words: “Stiffness of the muscles in the joints of all the legs might be sufficient to support a resting arthropod. A more rigorous analysis of all supporting limbs and joint angles would be required to prove this hypothesis”. We were inspired by this comment. In the revised manuscript, we will make it clear that the statement made in Line 33 is based on Yox. et. al. and our interpretation of measurements made by others.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      GENERAL COMMENTS

      We thank the three reviewers for their comments on the paper.

      We are pleased to see that they consider it be a comprehensive and well-executed study, which clearly establishes a previously overlooked connection between MRTF-SRF signalling and proliferation, and that its conclusions require no further experimentation.

      As review 3 points out, this work has implications for cancer biology, and suggests new research routes to understand the relation between cell adhesion, proliferation, and transformation.

      However, two referees raise significant concerns about its impact

      Review 1 suggests that the paper lacks impact without exploration the wider biological significance of our observations, although it considers it to be a good basic cell biology study. It suggests further work extending the findings to tissue- or tumor-based systems. While we consider such studies worthwhile – indeed we are currently pursuing these directions – we consider them beyond the scope of the present paper.

      Review 2 questions the novelty of our findings. We strongly disagree. This is is the first study to show that MRTF-SRF signalling is required for the proliferation of both primary and immortalised fibroblasts, and epithelial cells. We show that MRTF inactivation leads cells to enter a quiescence-like state under conditions that would permit efficient cell cycle progression in wildtype cells. The study will alter the field's perspective on the role of MRTF-SRF signalling, previously viewed as concerned with cell adhesion, morphology, and motility.

      Responses to individual reviews (italic) follow in regular text.

      RESPONSE TO INDIVIDUAL REVIEWS (comments in italic, response in regular, changes made)

      __Reviewer #1 __

      *(Evidence, reproducibility and clarity (Required)): *

      *The manuscript by Neilsen et al. presents a thorough and well-structured study showing that Myocardin-related transcription factors (MRTF-A/B), via MRTF-SRF, are essential for the proliferation of both primary and immortalized fibroblasts and epithelial cells. Using a combination of knockouts/rescue experiments, cytoskeletal analysis, and transcriptomics, the authors demonstrate that MRTF-SRF signalling controls actin dynamics and contractility-key drivers of cell cycle progression. Notably, they show that the proliferative arrest caused by MRTF loss is reversible, distinguishing it from classical senescence. **

      Major points*

      • The link between MRTF-SRF activity, cytoskeletal organisation, and cell proliferation is clearly established. The fact that disrupting contractility phenocopies MRTF loss strengthens the case that the pathway acts through mechanical control.*
      • The authors support their conclusions using multiple cell types (MEFs, primary fibroblasts, epithelial cells), a range of complementary assays (RNA-seq, traction force microscopy, adhesion/spreading), and genetic tools (CRISPR, inducible rescue).*
      • The ability to restore proliferation by re-expressing MRTF-A argues against true senescence and instead suggests a quiescence-like state driven by cytoskeletal disruption.*
      • This work particularly highlights how mechanical inputs feed into transcriptional programs to regulate proliferation, with implications for understanding anchorage-dependent growth.**

      Suggestions While the authors argue convincingly against classical senescence, elevated SA-βGal and SASP expression suggest a more nuanced arrest state. It not really clear what this state is or is not, therefore a deeper discussion of possible hybrid or intermediate states would be helpful - maybe potential additional experiments to include or exclude potential explanations - e.g. how does it differ from G0 exit?* Our findings show that MRTF inactivation inhibits cell proliferation under conditions that would permit efficient cell cycle progression in wildtype cells, inducing a state with some features associated with classical senescence, and others conventionally associated with reversible cell cycle arrest/quiescence. The reviewer correctly points out that this raises problems with accurately defining the nature of the MRTF-null proliferation defect.

      To our knowledge there are no rigorously defined unambiguous markers for senescence, quiescence, or G0. Indeed, recent studies have shown that senescence and quiescence / G0 states are not as distinct as previously assumed (Anwar et al, 2018; Ashraf et al 2023) as we reviewed in detail in Discussion p27, §2; p28 §3. We therefore do not consider it a productive endeavour to define markers for the MRTF-null state as opposed to defining its mechanistic basis. However, we agree that we should have been clearer about how the phenotypes we observe relate to classical cell arrest states.

      We have therefore revised the presentation of the Results to make it clear which features of the non-proliferative state associated with MRTF inactivation are seen in classical senescence, and which are found in reversible cell cycle exit or quiescence.

      Things done:

      • __Results pp16-17 and Fig 1. Figure panels and presentation are reordered to present “senescence” features together before marker expression (panel G is now panel I). Text now explicitly points out that the spectrum of cell cycle markers, specifically p27 upregulation, is not that associated with classical senescence (p16, p21,etc) but previously linked to reversible arrest or quiescence. Lines 371-380 have been moved up from the succeeding paragraph; statement added re p27 and reversible cell cycle exit on lines 387-389; summary sentence added in lines 398-401). __
      • Statement added that reversibility distinguishes the MRTF defect from classical senescence p20§1 line 454-455.
      • Note that p27 is associated with reversible arrest included on p20§2 line 460. We also explicitly summarised the features of the phenotype at the start of the Discussion.

      • Sentences added p27§1 lines 626-631.

      • Emphasis that p27 protein upregulation is associated with reversible cell cycle inhibition and quiescence is added on p28 line 668-669.

      • The transcriptomic data are strong, but the paper would benefit from zooming in on specific MRTF-SRF targets (e.g., actin isoforms, adhesion molecules) that directly link cytoskeletal regulation to cell cycle control.*

      We have now clarified presentation of the RNAseq data in Figure 5 and the data summary tables. Figure 5B now identifies which of those genes showing deficits in MRTF-null MEFs were previously identified as direct genomic targets for MRTF-SRF, and that the majority are cytoskeletal.

      • __Additional columns added in Table 1 to indicate whether genes are candidate genomic MRTF-SRF targets; Table 2 now show gene symbol lists as well as ENSMBL IDs for GO categories and NCBI Entrez IDs for GSEA categories, respectively. __
      • __Figure 5B revised to point out cytoskeletal genes that are genomic MRTF-SRF targets in bold, legend clarified p40 lines 920-922. __
      • Now noted____ p23 lines 527-529 that cytoskeletal genes affected include many direct MRTF-SRF targets. Our data confirms that in MEFs, MRTF inactivation affects fibroblast cell morphology, adhesion, spreading, motility and contractility (Figures 5, 6), as seen in many other settings.

      A critical question remains as to whether these effects a reflect limitation in one MRTF target gene or several, and how this defect relates to proliferation.

      Concerning specific MRTF-SRF gene targets:

      Cells lacking cytoplasmic actins are reported to exhibit defective proliferation, (__now noted in Results p23 lines 529-532). __We are currently evaluating whether this defect has similarities with the MRTF-null proliferation phenotype (see Discussion p31, §2).

      Previous findings suggest that defective cytoplasmic actin expression may underlie most MRTF knockout phenotypes (Salvany et al, 2014; Maurice et al., 2024) previously noted in the Discussion (see p31, §2).

      The myoferlin gene promotes growth of liver cancer cells by inhibiting ERK activation and oncogene induced senescence. We showed that myoferlin expression does not promote proliferation of MRTF-null MEFs in the original submission (see Figure S5E). Additionally, we now point out that the RNAseq data show that myoferlin expression is not significantly affected in MRTF-null MEFs __(new text p23, lines 532-534). __

      • It depends on where what target journal would be, but this is is a very well executes mechanistic study that doesn't really have an impact. Extending the discussion to human systems-or tissues where contractility is critical-could broaden the impact and applicability of the findings.*

      We interpret this comment as indicating that our paper does not address the wider biological implications of our findings by extension to studies in tissue or tumour systems.

      As outlined in our response to review 3, our study provides strong evidence that MRTF-SRF will be required for cell proliferation in settings where physical progression through cell cycle transitions requires high contractility, either owing to intrinsic factors or external physical constraints such as tissue stiffness, fibrosis, or tumour microenvironment.

      Discussion now explicitly addresses potential roles for tissue stiffness (pp30§2 lines 717-718, and p32§1 725-727). However, we feel that resolution of this question is beyond the scope of the present paper.

      • As above, the paper briefly mentions transformation, but it would be valuable to elaborate on whether MRTF-SRF acts as a barrier or enabler in tumorigenesis under different conditions. This I feel is the main weakness remaining - e.g. it would be fine with enabling different effects driven by other transcription events in emerging tumour cells (oncogenic in context of RAS, suppressive in context of p53) but I think the manuscript fails to be definitive on this points. Addressing this would make a much stronger and impactful study. I believe they have an impact peice of science that outlines how mechanical events impact cell fate decisions, but this is unlikely to be the driver - ie it facilitates cell fate decisions in context of tissue stiffness.*

      We find it difficult to understand the precise points being made here.

      However, transformation has long been known to bypass physical constraints on proliferation such as the requirement for adhesion. Moreover, MRTF-SRF activity is not necessarily required for proliferation of all transformed cells (Hampl et al, 2013; Medjkane et al, 2009; our unpublished data). The relation of our findings to transformation is thus an open question, which we are actively pursuing. Now noted in revised Discussion p32, lines 752-755.

      MRTF-independent proliferation of tumor cells could reflect oncogenic signals substituting for MRTF-dependent ones (eg from focal adhesions), or from relief of cytoskeletal contraints on proliferation (adhesion independent proliferation). In contrast, in proliferation of DLC1-deleted cancer cells is dependent on suppression of oncogene-induced senescence by MRTF-SRF signalling (Hampl et al, 2013). These points were already made in Discussion p28, pp30-31.

      Although our current work is focussed on cell transformation, we would respectfully suggest the in-depth resolution of this complex question is beyond the scope of the present paper.

      See also response to (3) above.

      *Reviewer #1 (Significance (Required)): *

      *Overall *

      This is a well-executed and insightful study that deepens our understanding of how cytoskeletal signals drive proliferation through MRTF-SRF. It broadens the role of this pathway beyond motility and offers new perspectives on mechanotransduction and cellular plasticity. If is weak in its demonstration of biological significance, but if the aim to to present a pure basic cell biology story it is good.

      The vast majority of work with the SRF system has led to the common perception that its role is exclusively with cell motility and adhesive processes, not proliferation. The results presented in the paper, even if limited to cell culture models, are therefore novel.

      Reviewer #2

      (Evidence, reproducibility and clarity (Required)):

      *In this manuscript, Nielsen and colleagues examine the impact of MRTF-A/B and SRF gene inactivation on cell proliferation. They performed an extensive body of work (using multiple cell types and multiple clones) to show that MRTF inactivation causes cell cycle arrest and senescence (mimicking the phenotype of SRF knockout cells) although the changes in the expression of various CDK inhibitors were cell-type specific. *

      *Very interestingly, simultaneous inactivation of all three major CDK inhibitors failed to rescue MRTF knockout cells from their proliferation defect. Expectedly, MRTF knockout cells exhibited defects in actin cytoskeleton, adhesion, and contractility. Interestingly, hyperactivating Rho also failed to rescue MRTF knockout cells from proliferation defect. The main conclusion of the paper was derived from experiments which showed that inhibition of either ROCK or myosin caused wild-type cells to behave like MRTF knockout cells rather than demonstration of any molecular perturbation that could reverse the proliferation defect of MRTF knockout cells. *

      While the experimental studies are thorough and rigorous, a vast majority of the core findings related to the loss-of-function of MRTF that are reported herein (i.e. defects in cell proliferation, elevation of CDK inhibitors, migration, actin cytoskeleton, contractility) are not conceptually new and have been previously reported in other cell systems by several investigators including this research group.

      This is the first study showing that MRTF-SRF signalling is required for the proliferation of both primary and immortalised fibroblasts, and epithelial cells. We show that the MRTF-SRF non-proliferative state combines features of both classical senescence and reversible cell cycle exit / quiescence.

      The vast majority of previous work with the SRF system has led to the common perception that its role is exclusively related to cell motility and adhesive processes and not proliferation (see Olson and Nordheim 2010). Where proliferation has been examined directly, both others and our own previous studies of the MRTFs in immune cells and cancer cells lines have revealed no direct role in proliferation (Schratt et al, 2001;Medjkane et al 2009; Maurice et al, 2024).

      The results presented here are therefore novel.

      In the reviewer's opinion, since the authors have not been able to identify a molecular strategy to reverse the proliferation phenotype of MRTF knockout cells, the underlying mechanisms of MRTF-dependent regulation of cell proliferation remain largely unanswered.

      Indeed, our attempts to rescue the phenotype (knockouts of the CKIs, and overexpression of different downregulated factors) did not restore proliferation. We therefore now aim to attack the problem (i) through overexpression screens, and (ii) by identifying differences between MRTF-SRF dependent and -independent (eg transformed) cells. However, these are new projects that are beyond the scope of a revised paper.

      • *

      Other comments: Majority of the immunoblot data have not been quantified.

      P16 data in Fig 1G vs Fig S1A are not similar (although the authors mention that the findings are similar)

      We have addressed these issues by reorganisation and quantification the immunoblotting data as follows:

      • Figure S1A has been moved to new Figure 1I, replacing the limited analysis shown in old Figure 1G. This more comprehensive, and displays data from all three WT and Mrtfab-/-
      • Figure 1I data is quantified. Marker expression in each Mrtfab-/- pool is evaluated relative its mean expression in the three WT pools treated in parallel.
      • A new Figure S1A shows mean marker expression across the three Mrtfab-/- pools, drawn from 5 independent analyses (not all markers included in each analysis). Different analyses of marker expression may exhibit variation, resulting from differences in handling, culture medium, plating density, relative confluence, etc. However, Mrtfab-/- cells exhibit markedly increased p27 and TLR2 expression, while expression of the other markers tested, including p16, consistently decreases.
      • Spearman comparisons among the WT and Mrtfab-/- pools show that relative marker expression is indeed well correlated between the pools of each genotype. Note on quantitation added in Methods p10 lines 209-213.

      Figure 1I moved from former Figure S1A, to replace former Figure 1G. New legend now includes quantitation, and reference to Spearman correlations, p44 lines 834-841.

      New Figure S1A displays data from multiple independent experiments with all 3 Mrtfab-/- pools. New legend, p44 lines 997-1002.

      Figure S1B legend notes correlation between relative marker expression in untreated WT and Mrtfab-/- cells, p44, lines 1005-1008.

      Results text rewritten p17 lines 383-391; no reference to “similar”.

      *Reviewer #2 (Significance (Required)): *

      *This study aims to investigate a fundamental biological question of how an actin-regulated transcription machinery regulates cell proliferation and is therefore of broad significance. Strengths and limitations of this study are described above. *

      Reviewer #3

      *(Evidence, reproducibility and clarity (Required)): *

      Summary

      *The manuscript by Nielsen et al. (Treisman lab) entitled "MRTF-dependent cytoskeletal dynamics drive efficient cell cycle progression" investigates the effects on cell proliferation elicited upon cellular depletion of the transcription factors MRTF-A and MRTF-B. The MRTFs are actin-dependent co-factors of SRF, which direct the transcription of SRF target genes. The MRTF-SRF regulatory circuit defines both the functioning and the control of actin-driven cytoskeletal dynamics. *

      *The work presented identifies essential molecular links that interconnect cytoskeleton-dependent cellular activities (cell-cell adhesion, cell-substrate contact, cell spreading) and cell proliferation. *

      *General assessment on used methodology. *

      *The presented comprehensive body of work is performed competently; it includes all relevant and necessary state-of-the-art technologies. *

      • *

      Reviewer #3 (Significance (Required)):

      Advance

      Previously published evidence by others (including the Treisman group) had indicated that SRF does not seem essential for the proliferation of some cell types (i. e., embryonic (stem) cells, activation-dependent immune cells, etc.). In regard to this, the authors discuss in the current manuscript: "Although further work is needed to elucidate the basis for these context-dependent dfferences, our data show that MRTF-SRF signalling is likely to play a more general role in proliferation than previously thought." The current manuscript already delineates this "general role": MRTF-SRF signalling impinges on cell proliferation whenever proliferative activities are dependent upon cytoskeletal dynamics.

      We of course support the view that it is MRTF-SRF's role in cytoskeletal dynamics, especially contractility, that is a limiting factor for cell cycle progression in our cells; however, this may not be the cases or other cell types or settings, such adhesion-independent or transformed cells, and/or stiff tissue environments.

      We have stated this view more strongly, modifying the abstract and discussion, and rewording the sentence quoted above.

      The major point is that MRTF-SRF-dependent proliferation may be more common than previously thought, the field having focussed on its role in cytoskeletal dynamics rather than proliferation.

      Abstract lines 48-49; Discussion p28, line 668-669; pp30-31, lines 713-714, 725-727. See also last para pp31/32, __added lines 752-755. __

      *The work has implications for cancer biology. It offers new directions to investigate the regulation of proliferative activities of anchorage-independent tumor cells. **

      Audience *

      *The insights generated serve the wide interests of a large and diverse group of cell and tumor biologists. *

      *Reviewers field of expertise (keywords). *

      Cytoskeletal dynamics, transcriptional con*

  2. Aug 2025
    1. Author response:

      Point-by-point description of the revisions

      Reviewer #1 (Evidence, reproducibility, and clarity):

      The work by Pinon et al describes the generation of a microvascular model to study Neisseria meningitidis interactions with blood vessels. The model uses a novel and relatively high throughput fabrication method that allows full control over the geometry of the vessels. The model is well characterized. The authors then study different aspects of Neisseriaendothelial interactions and benchmark the bacterial infection model against the best disease model available, a human skin xenograft mouse model, which is one of the great strengths of the paper. The authors show that Neisseria binds to the 3D model in a similar geometry that in the animal xenograft model, induces an increase in permeability short after bacterial perfusion, and induces endothelial cytoskeleton rearrangements. Finally, the authors show neutrophil recruitment to bacterial microcolonies and phagocytosis of Neisseria. The article is overall well written, and it is a great advancement in the bioengineering and sepsis infection field, and I only have a few major comments and some minor.

      Major comments:

      Infection-on-chip. I would recommend the authors to change the terminology of "infection on chip" to better reflect their work. The term is vague and it decreases novelty, as there are multiple infection on chips models that recapitulate other infections (recently reviewed in https://doi.org/10.1038/s41564-024-01645-6) including Ebola, SARS-CoV-2, Plasmodium and Candida. Maybe the term "sepsis on chip" would be more specific and exemplify better the work and novelty. Also, I would suggest that the authors carefully take a look at the text and consider when they use VoC or to current term IoC, as of now sometimes they are used interchangeably, with VoC being used occasionally in bacteria perfused experiments.

      We thank Reviewer #1 for this suggestion. Indeed, we have chosen to replace the term "Infection-on-Chip" by "infected Vessel-on-chip" to avoid any confusion in the title and the text. Also, we have removed all the terms "IoC" which referred to "Infection-on-Chip" and replaced with "VoC" for "Vessel-on-Chip". We think these terms will improve the clarity of the main text.

      Author response image 1.

      F-actin (red) and ezrin (yellow) staining after 3h of infection with N. meningitidis (green) in 2D (top) and 3D (bottom) vessel-on-chip models.

      Fig 3 and Supplementary 3: Permeability. The authors suggest that early 3h infection with Neisseria do not show increase in vascular permeability in the animal model, contrary to their findings in the 3D in vitro model. However, they show a non-significant increase in permeability of 70 KDa Dextran in the animal xenograft early infection. This seems to point that if the experiment would have been done with a lower molecular weight tracer, significant increases in permeability could have been detected. I would suggest to do this experiment that could capture early events in vascular disruption.

      Comparing permeability under healthy and infected conditions using Dextran smaller than 70 kDa is challenging. Previous research (1) has shown that molecules below 70 kDa already diffuse freely in healthy tissue. Given this high baseline diffusion, we believe that no significant difference would be observed before and after N. meningitidis infection and these experiments were not carried out. As discussed in the manuscript, bacteria induced permeability in mouse occurs at later time points, 16h post infection as shown previoulsy (2). As discussed in the manuscript, this difference between the xenograft model and the chip likely reflect the absence in the chip of various cell types present in the tissue parenchyma.

      The authors show the formation of actin of a honeycomb structure beneath the bacterial microcolonies. This only occurred in 65% of the microcolonies. Is this result similar to in vitro 2D endothelial cultures in static and under flow? Also, the group has shown in the past positive staining of other cytoskeletal proteins, such as ezrin in the ERM complex. Does this also occur in the 3D system?

      We thank the Reviewer #1 for this suggestion.

      • According to this recommendation, we imaged monolayers of endothelial cells in the flat regions of the chip (the two lateral channels) using the same microscopy conditions (i.e., Obj. 40X N.A. 1.05) that have been used to detect honeycomb structures in the 3D vessels in vitro. We showed that more than 56% of infected cells present these honeycomb structures in 2D, which is 13% less than in 3D, and is not significant due to the distributions of both populations. Thus, we conclude that under both in vitro conditions, 2D and 3D, the amount of infected cells exhibiting cortical plaques is similar. We have added the graph and the confocal images in Figure S4B and lines 418-419 of the revised manuscript.

      • We recently performed staining of ezrin in the chip and imaged both the 3D and 2D regions. Although ezrin staining was visible in 3D (Fig. 1 of this response), it was not as obvious as other markers under these infected conditions and we did not include it in the main text. Interpretation of this result is not straight forward as for instance the substrate of the cells is different and it would require further studies on the behaviour of ERM proteins in these different contexts.

      One of the most novel things of the manuscript is the use of a relatively quick photoablation system. I would suggest that the authors add a more extensive description of the protocol in methods. Could this technique be applied in other laboratories? If this is a major limitation, it should be listed in the discussion.

      Following the Reviewer’s comment, we introduced more detailed explanations regarding the photoablation:

      • L157-163 (Results): "Briefly, the chosen design is digitalized into a list of positions to ablate. A pulsed UV-LASER beam is injected into the microscope and shaped to cover the back aperture of the objective. The laser is then focused on each position that needs ablation. After introducing endothelial cells (HUVEC) in the carved regions,…"

      • L512-516 (Discussion): "The speed capabilities drastically improve with the pulsing repetition rate. Given that our laser source emits pulses at 10kHz, as compared to other photoablation lasers with repetitions around 100 Hz, our solution could potentially gain a factor of 100."

      • L1082-1087 (Materials and Methods): "…, and imported in a python code. The control of the various elements is embedded and checked for this specific set of hardware. The code is available upon request." Adding these three paragraphs gives more details on how photoablation works thus improving the manuscript.

      Minor comments:

      Supplementary Fig 2. The reference to subpanels H and I is swapped.

      The references to subpanels H and I have been correctly swapped back in the reviewed version.

      Line 203: I would suggest to delete this sentence. Although a strength of the submitted paper is the direct comparison of the VoC model with the animal model to better replicate Neisseria infection, a direct comparison with animal permeability is not needed in all vascular engineering papers, as vascular permeability measurements in animals have been well established in the past.

      The sentence "While previously developed VoC platforms aimed at replicating physiological permeability properties, they often lack direct comparisons with in vivo values." has been removed from the revised text.

      Fig 3: Bacteria binding experiments. I would suggest the addition of more methodological information in the main results text to guarantee a good interpretation of the experiment. First, it would be better that wall shear stress rather than flow rate is described in the main text, as flow rate is dependent on the geometry of the vessel being used. Second, how long was the perfusion of Neisseria in the binding experiment performed to quantify colony doubling or elongation? As per figure 1C, I would guess than 100 min, but it would be better if this information is directly given to the readers.

      We thank Reviewer #1 for these two suggestions that will improve the text clarity (e.g., L316). (i) Indeed, we have changed the flow rate in terms of shear stress. (ii) Also, we have normalized the quantification of the colony doubling time according to the first time-point where a single bacteria is attached to the vessel wall. Thus, early adhesion bacteria will be defined by a longer curve while late adhesion bacteria by a shorter curve. In total, the experiment lasted for 3 hours (modifications appear in L318 and L321-326).

      Fig 4: The honeycomb structure is not visible in the 3D rendering of panel D. I would recommend to show the actin staining in the absence of Neisseria staining as well.

      According to this suggestion, a zoom of the 3D rendering of the cortical plaque without colony had been added to the figure 4 of the revised manuscript.

      Line 421: E-selectin is referred as CD62E in this sentence. I would suggest to use the same terminology everywhere.

      We have replaced the "CD62E" term with "E-selectin" to improve clarity.

      Line 508: "This difference is most likely associated with the presence of other cell types in the in vivo tissues and the onset of intravascular coagulation". Do the authors refer to the presence of perivascular cells, pericytes or fibroblasts? If so, it could be good to mention them, as well as those future iterations of the model could include the presence of these cell types.

      By "other cell types", we refer to pericytes (3), fibroblasts (4), and perivascular macrophages (5), which surround endothelial cells and contribute to vessel stability. The main text was modified to include this information (Lines 548 and 555-570) and their potential roles during infection disussed.

      Discussion: The discussion covers very well the advantages of the model over in vitro 2D endothelial models and the animal xenograft but fails to include limitations. This would include the choice of HUVEC cells, an umbilical vein cell line to study microcirculation, the lack of perivascular cells or limitations on the fabrication technique regarding application in other labs (if any).

      We thank Reviewer #1 for this suggestion. Indeed, our manuscript may lack explaining limitations, and adding them to the text will help improve it:

      • The perspectives of our model include introducing perivascular cells surrounding the vessel and fibroblasts into the collagen gel as discussed previously and added in the discussion part (L555-570).

      • Our choice for HUVEC cells focused on recapitulating the characteristics of venules that respect key features such as the overexpression of CD62E and adhesion of neutrophils during inflammation. Using microvascular endothelial cells originating from different tissues would be very interesting. This possibility is now mentioned in the discussion lines 567-568.

      • Photoablation is a homemade fabrication technique that can be implemented in any lab harboring an epifluorescence microscope. This method has been more detailed in the revised manuscript (L1085-1087).

      Line 576: The authors state that the model could be applied to other systemic infections but failed to mention that some infections have already been modelled in 3D bioengineered vascular models (examples found in https://doi.org/10.1038/s41564-024-01645-6). This includes a capillary photoablated vascular model to study malaria (DOI: 10.1126/sciadv.aay724).

      Thes two important references have been introduced in the main text (L84, 647, 648).

      Line 1213: Are the 6M neutrophil solution in 10ul under flow. Also, I would suggest to rewrite this sentence in the following line "After, the flow has been then added to the system at 0.7-1 µl/min."

      We now specified that neutrophils are circulated in the chip under flow conditions, lines 1321-1322.

      Significance

      The manuscript is comprehensive, complete and represents the first bioengineered model of sepsis. One of the major strengths is the carful characterization and benchmarking against the animal xenograft model. Its main limitations is the brief description of the photoablation methodology and more clarity is needed in the description of bacteria perfusion experiments, given their complexity. The manuscript will be of interest for the general infection community and to the tissue engineering community if more details on fabrication methods are included. My expertise is on infection bioengineered models.

      Reviewer #2 (Evidence, reproducibility, and clarity):

      Summary:

      The authors develop a Vessel-on-Chip model, which has geometrical and physical properties similar to the murine vessels used in the study of systemic infections. The vessel was created via highly controllable laser photoablation in a collagen matrix, subsequent seeding of human endothelial cells and flow perfusion to induce mechanical cues. This vessel could be infected with Neisseria meningitidis, as a model of systemic infection. In this model, microcolony formation and dynamics, and effects on the host were very similar to those described for the human skin xenograft mouse, which is the current gold standard for these studies, and were consistent with observations made in patients. The model could also recapitulate the neutrophil response upon N. meningitidis systemic infection.

      Major comments:

      I have no major comments. The claims and the conclusions are supported by the data, the methods are properly presented and the data is analyzed adequately. Furthermore, I would like to propose an optional experiment could improve the manuscript. In the discussion it is stated that the vascular geometry might contribute to bacterial colonization in areas of lower velocity. It would be interesting to recapitulate this experimentally. It is of course optional but it would be of great interest, since this is something that can only be proven in the organ-on-chip (where flow speed can be tuned) and not as much in animal models. Besides, it would increase impact, demonstrating the superiority of the chip in this area rather than proving to be equal to current models.

      We have conducted additional experiments on infection in different vascular geometries now added these results figure 3/S3 and lines 288-305. We compared sheared stress levels as determined by Comsol simulation and experimentally determined bacterial adhesion sites. In the conditions used, the range of shear generated by the tested geometries do not appear to change the efficiency of bacterial adhesion. These results are consistent with a previous study from our group which show that in this range of shear stresses the effect on adhesion is limited (6) . Furthermore, qualitative observations in the animal model indicate that bacteria do not have an obvious preference in terms of binding site.

      Minor comments:

      I have a series of suggestions which, in my opinion, would improve the discussion. They are further elaborated in the following section, in the context of the limitations.

      • How to recapitulate the vessels in the context of a specific organ or tissue? If the pathogen is often found in the luminal space of other organs after disseminating from the blood, how can this process be recapitulated with this mode, if at all?

      For reasons that are not fully understood, postmortem histological studies reveal bacteria only inside blood vessels but rarely if ever in the organ parenchyma. The presence of intravascular bacteria could nevertheless alter cells in the tissue parenchyma. The notable exception is the brain where bacteria exit the bacterial lumen to access the cerebrospinal fluid. The chip we describe is fully adapted to develop a blood brain barrier model and more specific organ environments. This implies the addition of more cell types in the hydrogel. A paragraph on this topic has been added (Lines 548 and 552-570).

      • Similarly, could other immune responses related to systemic infection be recapitulated? The authors could discuss the potential of including other immune cells that might be found in the interstitial space, for example.

      This important discussion point has been added to the manuscript (L623-636). As suggested by Reviewer #2, other immune cells respond to N. meningitis and can be explored using our model. For instance, macrophages and dendritic cells are activated upon N. meningitis infection, eliminate the bacteria through phagocytosis, produce pro-inflammatory cytokines and chemokines potentially activating lymphocytes (7). Such an immune response, yet complex, would be interesting to study in our model as skin-xenograft mice are deprived of B and T lymphocytes to ensure acceptance of human skin grafts.

      • A minor correction: in line 467 it should probably be "aspects" instead of "aspect", and the authors could consider rephrasing that sentence slightly for increased clarity.

      We have corrected the sentence with "we demonstrated that our VoC strongly replicates key aspects of the in vivo human skin xenograft mouse model, the gold standard for studying meningococcal disease under physiological conditions." in lines 499-503.

      Strengths and limitations

      The most important strength of this manuscript is the technology they developed to build this model, which is impressive and very innovative. The Vessel-on-Chip can be tuned to acquire complex shapes and, according to the authors, the process has been optimized to produce models very quickly. This is a great advancement compared with the technologies used to produce other equivalent models. This model proves to be equivalent to the most advanced model used to date, but allows to perform microscopy with higher resolution and ease, which can in turn allow more complex and precise image-based analysis. However, the authors do not seem to present any new mechanistic insights obtained using this model. All the findings obtained in the infection-on-chip demonstrate that the model is equivalent to the human skin xenograft mouse model, and can offer superior resolution for microscopy. However, the advantages of the model do not seem to be exploited to obtain more insights on the pathogenicity mechanisms of N. meningitidis, host-pathogen interactions or potential applications in the discovery of potential treatments. For example, experiments to elucidate the role of certain N. meningiditis genes on infection could enrich the manuscript and prove the superiority of the model. However, I understand these experiments are time-consuming and out of the scope of the current manuscript. In addition, the model lacks the multicellularity that characterizes other similar models. The authors mention that the pathogen can be found in the luminal space of several organs, however, this luminal space has not been recapitulated in the model. Even though this would be a new project, it would be interesting that the authors hypothesize about the possibilities of combining this model with other organ models. The inclusion of circulating neutrophils is a great asset; however it would also be interesting to hypothesize about how to recapitulate other immune responses related to systemic infection.

      We thank Reviewer #2 for his/her comment on the strengths and limitations of our work. The difficulty is that our study opens many futur research directions and applications and we hope that the work serves as the basis for many future studies but one can only address a limited set of experiments in a single manuscript.

      • Experiments investigating the role of N. meningitidis genes require significant optimization of the system. Multiplexing is a potential avenue for future development, which would allow the testing of many mutants. The fast photoablation approach is particularly amenable to such adaptation.

      • Cells and bacteria inside the chambers could be isolated and analyzed at the transcriptomic level or by flow cytometry. This would imply optimizing a protocol for collecting cells from the device via collagenase digestion, for instance. This type of approach would also benefit from multiplexing to enhance the number of cells.

      • As mentioned above, the revised manuscript discusses the multicellular capabilities of our model, including the integration of additional immune cells and potential connections to other organ systems. We believe that these approaches are feasible and valuable for studying various aspects of N. meningitidis infection.

      Advance

      The most important advance of this manuscript is technical: the development of a model that proves to be equivalent to the most complex model used to date to study meningococcal systemic infections. The human skin xenograft mouse model requires complex surgical techniques and has the practical and ethical limitations associated with the use of animals. However, the Infection-on-chip model is completely in vitro, can be produced quickly, and allows to precisely tune the vessel’s geometry and to perform higher resolution microscopy. Both models were comparable in terms of the hallmarks defining the disease, suggesting that the presented model can be an effective replacement of the animal use in this area.

      Other vessel-on-chip models can recapitulate an endothelial barrier in a tube-like morphology, but do not recapitulate other complex geometries, that are more physiologically relevant and could impact infection (in addition to other non-infectious diseases). However, in the manuscript it is not clear whether the different morphologies are necessary to study or recapitulate N. meningitidis infection, or if the tubular morphologies achieved in other similar models would suffice.

      Audience

      This manuscript might be of interest for a specialized audience focusing on the development of microphysiological models. The technology presented here can be of great interest to researchers whose main area of interest is the endothelium and the blood vessels, for example, researchers on the study of systemic infections, atherosclerosis, angiogenesis, etc. Thus, the tool presented (vessel-on-chip) can have great applications for a broad audience. However, even when the method might be faster and easier to use than other equivalent methods, it could still be difficult to implement in another laboratory, especially if it lacks expertise in bioengineering. Therefore, the method could be more of interest for laboratories with expertise in bioengineering looking to expand or optimize their toolbox. Alternatively, this paper present itself as an opportunity to begin collaborations, since the model could be used to test other pathogen or conditions.

      Field of expertise:

      Infection biology, organ-on-chip, fungal pathogens.

      I lack the expertise to evaluate the image-based analysis.

      References

      (1) Gyohei Egawa, Satoshi Nakamizo, Yohei Natsuaki, Hiromi Doi, Yoshiki Miyachi, and Kenji Kabashima. Intravital analysis of vascular permeability in mice using two-photon microscopy. Scientific Reports, 3(1):1932, Jun 2013. ISSN 2045-2322. doi: 10.1038/srep01932.

      (2) Valeria Manriquez, Pierre Nivoit, Tomas Urbina, Hebert Echenique-Rivera, Keira Melican, Marie-Paule Fernandez-Gerlinger, Patricia Flamant, Taliah Schmitt, Patrick Bruneval, Dorian Obino, and Guillaume Duménil. Colonization of dermal arterioles by neisseria meningitidis provides a safe haven from neutrophils. Nature Communications, 12(1):4547, Jul 2021. ISSN 2041-1723. doi: 10.1038/s41467-021-24797-z.

      (3) Mats Hellström, Holger Gerhardt, Mattias Kalén, Xuri Li, Ulf Eriksson, Hartwig Wolburg, and Christer Betsholtz. Lack of pericytes leads to endothelial hyperplasia and abnormal vascular morphogenesis. Journal of Cell Biology, 153(3):543–554, Apr 2001. ISSN 0021-9525. doi: 10.1083/jcb.153.3.543.

      (4) Arsheen M. Rajan, Roger C. Ma, Katrinka M. Kocha, Dan J. Zhang, and Peng Huang. Dual function of perivascular fibroblasts in vascular stabilization in zebrafish. PLOS Genetics, 16(10):1–31, 10 2020. doi: 10.1371/journal.pgen.1008800.

      (5) Huanhuan He, Julia J. Mack, Esra Güç, Carmen M. Warren, Mario Leonardo Squadrito, Witold W. Kilarski, Caroline Baer, Ryan D. Freshman, Austin I. McDonald, Safiyyah Ziyad, Melody A. Swartz, Michele De Palma, and M. Luisa Iruela-Arispe. Perivascular macrophages limit permeability. Arteriosclerosis, Thrombosis, and Vascular Biology, 36(11):2203–2212, 2016. doi: 10.1161/ATVBAHA. 116.307592.

      (6) Emilie Mairey, Auguste Genovesio, Emmanuel Donnadieu, Christine Bernard, Francis Jaubert, Elisabeth Pinard, Jacques Seylaz, Jean-Christophe Olivo-Marin, Xavier Nassif, and Guillaume Dumenil. Cerebral microcirculation shear stress levels determine Neisseria meningitidis attachment sites along the blood–brain barrier . Journal of Experimental Medicine, 203(8):1939–1950, 07 2006. ISSN 0022-1007. doi: 10.1084/jem.20060482.

      (7) Riya Joshi and Sunil D. Saroj. Survival and evasion of neisseria meningitidis from macrophages. Medicine in Microecology, 17:100087, 2023. ISSN 2590-0978. doi: https://doi.org/10.1016/j.medmic. 2023.100087.

    1. Author response:

      The following is the authors’ response to the current reviews

      Reviewer #2 (Public review): 

      This manuscript describes the role of the production of c-di-AMP on the chlamydial developmental cycle. The main findings remain the same. The authors show that overexpression of the dacA-ybbR operon results in increased production of c-di-AMP and early expression of transitionary and late genes. The authors also knocked down the expression of the dacA-ybbR operon and reported a modest reduction in the expression of both hctA and omcB. The authors conclude with a model suggesting the amount of c-di-AMP determines the fate of the RB, continued replication, or EB conversion. 

      Overall, this is a very intriguing study with important implications however the data is very preliminary and the model is very rudimentary. The data support the observation that dramatically increased c-di-AMP has an impact on transitionary gene expression and late gene expression suggesting dysregulation of the developmental cycle. This effect goes away with modest changes in c-di-AMP (detaTM-DacA vs detaTM-DacA (D164N)). However, the model predicts that low levels of c-di-AMP delays EB production is not not well supported by the data. If this prediction were true then the growth rate would increase with c-di-AMP reduction and the data does not show this. The levels of of c-di-AMP at the lower levels need to be better validated as it seems like only very high levels make a difference for dysregulated late gene expression. However, on the low end it's not clear what levels are needed to have an effect as only DacAopMut and DacAopKD show any effects on the cycle and the c-di-AMP levels are only different at 24 hours. 

      These appear to be the same comments the reviewer presented last time, so we will reiterate our prior points here and elsewhere. We do not think and nor do we predict that low c-di-AMP levels should increase growth rate (as measured by gDNA levels), and this conclusion cannot be drawn from our data. Rather, we predict that the inability to accumulate c-di-AMP should delay production of EBs, and this is what the data show. The reviewer has applied their own subjective (and erroneous) interpretation to the model. The asynchronicity of the normal developmental cycle means RBs continue to replicate as EBs are forming, so gDNA levels cannot be used as the sole metric for determining RB levels. We show that reduced c-di-AMP levels reduce EB levels as well as transcripts associated with late stages of development. The parsimonious interpretation of these data support that low c-di-AMP levels delay progression through the developmental cycle consistent with our model.

      The data still do not support the overall model.

      We disagree.  We have presented quantified data that include appropriate controls and statistical tests, and the reviewer has not disputed that or pointed to additional experiments that need to be performed.  The reviewer has imposed a subjective interpretation of our model based on their own biases.  A reader is free, of course, to disagree with our model, but a reviewer should not block a manuscript based on such a disagreement if no experimental flaws have been identified. 

      In Figure 1 the authors show at 24 hpi. 

      We also showed data from 16hpi, which is a more relevant timepoint for assessing premature transition to EBs.  In contrast, the 24hpi is more important for assessing developmental effects of reduced c-di-AMP levels.

      DacA overexpression increases cdiAMP to ~4000 pg/ml 

      DacAmut overexpression reduces cdiAMP dramatically to ~256 pg/ml) 

      DacATM overexpression increases cdiAMP to ~4000 pg/ml. 

      DacAmutTM overexpression does not seem to change cdiAMP ~1500 pg/ml . 

      dacAKD decreases cdiAMP to ~300 pg/ml . 

      dacAKDcom increased cdiAMP to ~8000 pg/ml. 

      DacA-ybbRop overexpression increased cdiAMP to ~500,000 pg/ml. 

      DacA-ybbRopmut ~300 pg/ml. 

      However in Figure 2 the data show that overexpression of DacA (cdiAMP ~4000 pg/ml) did not have a different phenotype than over expression of the mutant (cdiAMP ~256 pg/ml). HctA expression down, omcB expression down, euo not much change, replication down, and IFUs down. Additionally, Figure 3 shows no differences in anything measured although cdiAMP levels were again dramatically different. DacATM overexpression (~4000 pg/ml) and DacAmutTM (~1500). This makes it unclear what cdiAMP is doing to the developmental cycle. 

      As we have explained in the text and in response to reviewer comments on previous rounds of review, overexpressing the full-length WT or mutant DacA is detrimental to developmental cycle progression for reasons that have nothing to do with c-di-AMP levels (likely disrupting membrane function), since, as the reviewer notes, the WT DacA deltaTM strain had similar c-di-AMP levels but no negative effects on growth/development. If we had not presented the effects of overexpressing the individual isoforms, then a reviewer would surely have requested such, which is why we present these data even though they don’t seem to support our model.  This is an honest representation of our findings.  The reviewer seems intent on nitpicking a minor datapoint that seems to contradict the rest of the manuscript while ignoring or not carefully reading the rest of the manuscript.

      In Figure 4 the authors knockdown dacA (dacA-KD) and complement the knockdown (dacA-KDcom) 

      dacAKD decreases cdiAMP (~300) while DacA-KDcom increases cdiAMP much above wt (~8000). 

      KD decreased hctA and omcB at 24hpi. Complementation resulted in a moderate increase in hctA at a single time point but not at 24 hpi and had no effect on euo or omcB expression.

      By 24hpi, late gene transcripts are being maximally produced during a normal developmental cycle. It is unclear why the reviewer thinks that these transcripts should be elevated above this level in any of our strains that prematurely transition to EBs. There is no basis in the literature to support such an assumption. As we noted in the text, the dacA-KDcom strain phenocopied the dacAop OE strain, and we showed RNAseq data and EB production curves for the latter that support our conclusions of the effect of increased c-di-AMP levels on developmental progression.

      Importantly, complementation decreased the growth rate.

      Yes, since the c-di-AMP levels breached the “EB threshold” at 16hpi, it causes premature transition to EBs, which do not replicate their gDNA, at an earlier stage of the cycle when fewer organisms are present. Therefore, the gDNA levels are decreased at 24hpi, which is consistent with our model.

      Based on the proposed model, growth rate should increase as the chlamydia should all be RBs and replicating and not exiting the cell cycle to become EBs (not replicating).

      This is a spurious conclusion from the reviewer. As we clearly showed, the dacA-KDcom did not restore a wild-type phenotype and instead mimicked the dacAop OE strain. This was commented on in the text.

      Interestingly reducing cdiAMP levels by over expressing DacAmut (~256 pg/ml) did not have an effect on the cycle but the reduction in cdiAMP by knockdown of dacA (~300 pg/ml) did have a moderate effect on the cycle. 

      This is again a spurious conclusion from the reviewer. The dacAMut and dacA-KD strains are distinct. As noted in the text and above for DacA WT OE, overexpressing the DacAMut similarly disrupts organism morphology, which is different from dacA-KD. These strains should not be directly compared because of this. This point has been previously highlighted in the text (in Results and Discussion).

      For Figure 5 DacA-ybbRop was overexpressed and this increased cdiAMP dramatically ~500,000 pg/ml as compared to wt ~1500. This increased hctA only at an early timepoint and not at 24hpi and again had no effect on omcB or euo.

      As we explained in prior reviews, our RNAseq data more comprehensively assessed transcripts for the dacAop OE strain. These data show convincingly that late gene transcripts (not just hctA and omcB) are elevated earlier in the developmental cycle. Again, it is not clear why the reviewer should expect that late gene transcripts should be higher in these strains than they are during a normal developmental cycle. This is not part of our model and appears to be a bias that the reviewer has imposed that is not supported by the literature.

      Overexpression of the operon with the mutation DacA-ybbRopmut reduced cdiAMP to ~300 pg/ml and this showed a reduction in growth rate similar to dacAmut but a more dramatic decrease in IFUs. 

      As we described in the text, in earlier revisions, and above, the dacAMut OE strain has distinct effects unrelated to c-di-AMP levels and, therefore, should not be compared to other strains in terms of linking its c-di-AMP levels to its phenotype.

      Overall: 

      DacA overexpression increases cdiAMP to ~4000 pg/ml (decreased everything except euo) 

      DacAmut overexpression reduces cdiAMP dramatically (~256 pg/ml). (decreased everything except euo) 

      DacATM overexpression increases cdiAMP to ~4000 pg/ml (no changes noted) 

      DacAmutTM overexpression does not seem to change cdiAMP ~1500 pg/ml (no changes noted) 

      dacAKD decrease cdiAMP to ~300 pg/ml (decreased everything except euo) 

      dacAKDcom increased cdiAMP to ~8000 pg/ml (decreases growth rate, increase hctA a little but not omcB) 

      DacA-ybbRop overexpression increased cdiAMP to ~500,000 pg/ml (decreases growth rate, increase hctA a little but not omcB) <br /> DacA-ybbRopmut ~300 pg/ml (decreased everything except euo) 

      Overall, the data show that increasing cdiAMP only has a phenotype if it is dramatically increased, no effect at 4000 pg/ml.

      Yes, this clearly shows there is a threshold - as we hypothesize!  However, these thresholds are more important at the 16hpi timepoint not 24hpi (which the reviewer is referencing) when assessing premature transition to EBs.  We specifically highlighted in our prior revision in Figure 1E this EB threshold to make this point clearer for the reader.  Once the threshold is breached, then the overall c-di-AMP levels become irrelevant as the RBs have begun their transition to EBs.

      Decreasing cdiAMP has a consistent effect, decreased growth rate, IFU, hctA expression and omcB expression. However, if their proposed model was correct and low levels of cdiAMP blocked EB conversion then more chlamydial cells would be RBs (dividing cells) and the growth rate should increase.

      The only effect should be normal gDNA levels, which is what we see in the dacA-KD.  Given the asynchronicity of a normal developmental cycle in which RBs continue to replicate as EBs are still forming, there is no basis to assume gDNA levels should increase under these conditions for the dacA-KD strain at 24hpi.

      Conversely, if cdiAMP levels were dramatically raised then all RBs would all convert and the growth rate would be very low.

      We agree. This is what is reflected by the dacAop OE and dacA-KDcom strains, with reduced gDNA levels at 24hpi since organisms have transitioned to EBs at an earlier time post-infection.

      When cdiAMP was raised to ~4000 pg/ml there was no effect on the growth rate.

      Yes, because it had not breached the EB threshold at 16hpi – consistent with our model!  The reviewer is confusing effects of elevated c-di-AMP at 24hpi when they should be assessed at the 16hpi timepoint for strains overproducing this molecule.

      However, an increase to ~8000 pg/ml resulted in a significant decrease but growth continued.

      If the reviewer is referring to the dacA-KDcom strain, then this is not accurate. gDNA levels were decreased in this strain at 24hpi when the c-di-AMP levels were increased compared to the WT (mCherry OE) control at 16hpi, indicating this strain had breached the “EB threshold” and initiated conversion to EBs at an earlier timepoint post-infection when fewer organisms were present.

      Increasing cdAMP to ~500,000 pg/ml had less of an impact on the growth rate.

      It is not clear what this conclusion is based on and what the reviewer is comparing to.  This is a subjective assessment not based on our data.

      Overall, the data does not cleanly support the proposed model.

      It is an unfortunate aspect of biology, particularly for obligate intracellular bacteria – a challenging experimental system on which to work, that the data are not always “clean”.  The overall effects of increased c-di-AMP levels on chlamydial developmental cycle progression we have documented support our model, and we think the reader, as always, should make their own assessment.


      The following is the authors’ response to the original reviews.

      Reviewer #2 (Public review): 

      This manuscript describes the role of the production of c-di-AMP on the chlamydial developmental cycle. The main findings remain the same. The authors show that overexpression of the dacA-ybbR operon results in increased production of c-di-AMP and early expression of transitionary and late genes. The authors also knocked down the expression of the dacA-ybbR operon and reported a modest reduction in the expression of both hctA and omcB. The authors conclude with a model suggesting the amount of c-di-AMP determines the fate of the RB, continued replication, or EB conversion. 

      Overall, this is a very intriguing study with important implications however, the data is very preliminary, and the model is very rudimentary. The data support the observation that dramatically increased c-di-AMP has an impact on transitionary gene expression and late gene expression suggesting dysregulation of the developmental cycle. This effect goes away with modest changes in c-di-AMP (detaTM-DacA vs detaTM-DacA (D164N)). However, the model predicts that low levels of c-di-AMP delays EB production is not not well supported by the data. If this prediction were true then the growth rate would increase with c-di-AMP reduction and the data does not show this.

      Thank you for the comments. We have apparently not adequately communicated our predictions and the model. We do not think and nor do we predict that low c-di-AMP levels should increase growth rate, and there is no basis in any of our data to support that. Rather, we predict that the inability to accumulate c-di-AMP should delay production of EBs, and this is what the data show. We have clarified this in the text (line 89 paragraph).

      The levels of c-di-AMP at the lower levels need to be better validated as it seems like only very high levels make a difference for dysregulated late gene expression. However, on the low end it's not clear what levels are needed to have an effect as only DacAopMut and DacAopKD show any effects on the cycle and the c-di-AMP levels are only different at 24 hours.

      Our hypothesis is that increasing concentrations of c-di-AMP within a given RB is a signal for it to undergo secondary differentiation to the EB, and the data support this as noted by the reviewers. Again, we stress that low levels of c-di-AMP are irrelevant to the model. We have revised Figure 1E to indicate the level of c-di-AMP in the control strain at the 24hpi timepoint that coincides with increased EB levels. We hope this will further clarify the goals of our study. That a given strain might be below the EB control is not relevant to the model beyond indicating that it has not reached the necessary threshold for triggering secondary differentiation.

      The authors responded to reviewers' critiques by adding the overexpression of DacA without the transmembrane region. This addition does not really help their case. They show that detaTM-DacA and detaTM-DacA (D164N) had the same effects on c-di-AMP levels but the figure shows no effects on the developmental cycle.

      As it relates directly to the reviewer’s point, the delta-TM strains did not show the same level of c-di-AMP. It may be that the reviewer misread the graph. The purpose of testing these strains was to show that the negative effects of overexpressing full-length WT DacA were due to its membrane localization. Both the FL and deltaTM-DacA (WT) overexpression had equivalent c-di-AMP levels even though the delta-TM overexpression looked like the mCherry-expressing strain based on the measured parameters. This shows that the c-di-AMP levels were irrelevant to the phenotypes observed when overexpressing these WT isoforms. For the mutant isoforms, the delta-TM looked like the mCherry-expressing control while the FL isoform was negatively impacted for reasons we described in the Discussion (e.g., dominant negative effect). In addition, at 16hpi, neither delta-TM strain had c-di-AMP levels that approached the 24h control as denoted in Figure 1E (dashed line) and in the text, which explains why these strains did not show increased late gene transcripts at an earlier timepoint like the dacAop and dacA-KDcom strains.

      Describing the significance of the findings: 

      The findings are important and point to very exciting new avenues to explore the important questions in chlamydial cell form development. The authors present a model that is not quantified and does not match the data well. 

      We respectfully disagree with this assessment as noted above in response to the reviewer’s critique. All of our data are quantified and support the hypothesis as stated.

      Describing the strength of evidence: 

      The evidence presented is incomplete. The authors do a nice job of showing that overexpression of the dacA-ybbR operon increases c-di-AMP and that knockdown or overexpression of the catalytically dead DacA protein decreases the c-di-AMP levels. However, the effects on the developmental cycle and how they fit the proposed model are less well supported. 

      Overall this is a very intriguing finding that will require more gene expression data, phenotypic characterization of cell forms, and better quantitative models to fully interpret these findings. 

      It is not clear what quantitative models the reviewer would prefer, but, ultimately, it is up to the reader to decide whether they agree or not with the model we present. The data are the data, and we have tried to present them as clearly as possible. We would emphasize that, with the number of strains we have analyzed, we have presented a huge amount of data for a study with an obligate intracellular bacterium. As a comparison, most publications on Chlamydia might use a handful of transformant strains, if any. Given the cost and time associated with performing such studies, it is prohibitive to attempt all the time points that one might like to do, and it is not clear to us that further studies will add to or alter the conclusions of the current manuscript.

      Reviewer #2 (Recommendations for the authors): 

      Minor critiques 

      The graphs have red and blue lines but the figure legends are red and black. It would be better if these matched. 

      Changed.

      For Figure 1C. The labels are not very helpful. It's not clear what is HeLa vs mCherry. I believe it is uninfected vs Chlamydia infected.

      Changed.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      This study uses mesoscale simulations to investigate how membrane geometry regulates the multiphase organization of postsynaptic condensates. It reveals that dimensionality shifts the balance between specific and non-specific interactions, thereby reversing domain morphology observed in vitro versus in vivo.

      Strengths:

      The model is grounded in experimental binding affinities, reproduces key experimental observations in 3D and 2D contexts, and offers mechanistic insight into how geometry and molecular features drive phase behavior.

      Weaknesses:

      The model omits other synaptic components that may influence domain organization and does not extensively explore parameter sensitivity or broader physiological variability.

      We thank the reviewer for his/her time and effort to our manuscript. We agree with the point that the contribution of other synaptic components should be addressed. We have included a discussion of the effects of environmental factors such as protein and ion concentrations, as well as other omitted postsynaptic components (SAPAP, Shank, and Homer) on phase morphology. In the middle of the 2<sup>nd</sup> paragraph of Discussion, we added: 

      “While these in vivo results contain additional scaffold and cytoskeletal elements omitted in our model, such as SAPAP, Shank and Homer, nearly all proteins in the middle and lower layers of the PSD associate directly or indirectly with PSD-95 in the upper PSD layer. Consequently, it is probable that other scaffold proteins contribute to the mobility of AMPAR-containing and NMDAR-containing nanodomains indistinguishably. They may increase the stability of the AMPAR and NMDAR clusters but are unlikely to have a distinct effect to reverse the phase-separation phenomenon.”

      Also, as the reviewer pointed out, we agree with that physiological factors such as ion concentration may influence the phase. However, conditions such as ion concentration are implicitly implemented as the specific and nonspecific interactions in this model, which makes it difficult to estimate the effect of each physiological condition individually. We added the variability potential of physiological conditions to the discussion section as a limitation of this model. To investigate parameter sensitivity in more detail, we performed additional MD simulations with weakened membrane constraints to account for the behavior between 3D and 2D. We added:

      “First, our results did not provide direct insights to physiological conditions, such as ion concentrations. Since such factors are implicitly implemented in our model, it is difficult to estimate these effects individually. This suggests the need for future implementation of environmental factors and validation under a broader range of in vivo-like settings.”

      Reviewer #2 (Public review):

      This is a timely and insightful study aiming to explore the general physical principles for the sub-compartmentalization--or lack thereof--in the phase separation processes underlying the assembly of postsynaptic densities (PSDs), especially the markedly different organizations in three-dimensional (3D) droplets on one hand and the twodimensional (2D) condensates associated with a cellular membrane on the other. Simulation of a highly simplified model (one bead per protein domain) is carefully executed. Based on a thorough consideration of various control cases, the main conclusion regarding the trade-off between repulsive excluded volume interactions and attractive interactions among protein domains in determining the structures of 3D vs 2D model PSD condensates is quite convincing. The results in this manuscript are novel; however, as it stands, there is substantial room for improvement in the presentation of the background and the findings of this work. In particular,

      (i) conceptual connections with prior works should be better discussed 

      (ii) essential details of the model should be clarified, and

      (iii) the generality and limitations of the authors' approach should be better delineated.

      We appreciate the reviewer for his/her time and effort on our manuscript and for encouraging comments and helpful suggestions. We answered every technical comment the reviewer mentioned below.

      Specifically, the following items should be addressed (with the additional references mentioned below cited and discussed):

      (1) Excluded volume effects are referred to throughout the text by various terms and descriptions such as "repulsive force according to the volume" (e.g., in the Introduction), "nonspecific volume interaction", and "volume effects" in this manuscript. This is somewhat curious and not conducive to clarity, because these terms have alternate or connotations of alternate meanings (e.g., in biomolecular modeling, repulsive interactions usually refer to those with longer spatial ranges, such as that between like charges). It will be much clearer if the authors simply refer to excluded volume interactions as excluded volume interactions (or effects).  

      Thank you for this comment. We have substituted the words “excluded volume interactions” for words of similar meaning. However, we have left the expression of “non-specific interactions” as they are referring to explicit interactions that are given as force fields in the model, rather than in the general meaning of excluded volume effect.

      (2) In as much as the impact of excluded volume effects on subcompartmentalization of condensates ("multiple phases" in the authors' terminology), it has been demonstrated by both coarse-grained molecular dynamics and field-theoretic simulations that excluded volume is conducive to demixing of molecular species in condensates [Pal et al., Phys Rev E 103:042406 (2021); see especially Figures 4-5 of this reference]. This prior work bears directly on the authors' observation. Its relationship with the present work should be discussed.  

      We appreciate the reviewer’s insightful comment. We have now included a more detailed discussion on excluded volume effect in the revised manuscript, which provides important context for our findings. Furthermore, we have cited the references to support and enrich the discussion, as recommended.

      (3)  In the present model setup, activation of the CaMKII kinase affects only its binding to GluN2Bc. This approach is reasonable and leads to model predictions that are essentially consistent with the experiment. More broadly, however, do the authors expect activation of the CaMKII kinase to lead to phosphorylation of some of the molecular species involved with PSDs? This may be of interest since biomolecular condensates are known to be modulated by phosphorylation [Kim et al., Science 365:825-829 (2019); Lin et al, eLife 13:RP100284 (2025)].  

      We agree that phosphorylation effect on phase separation is an important and interesting aspect to consider. Some experimental results have shown that activation of CaMKII can lead to phosphorylation of various proteins and make PSD condensate more stable by altering their interactions. We included the sentence below in limitations:

      “In this context, we also do not explicitly account for downstream phosphorylation events. Although such proteins are not included in the current components, they will regulate PSD-95, affecting its binding valency, or diffusion coefficient. This is a subject worthy of future research.”

      (4) The forcefield for confinement of AMPAR/TARP and NMDAR/GluN2Bc to 2D should be specified in the main text. Have the authors explored the sensitivity of their 2D findings on the strength of this confinement?

      We thank the reviewer for the helpful recommendation. We have revised the manuscript to include membrane-mimicking potential on main text. Furthermore, we also think that exploring the shape of the 3D/2D condensate phase due to the sensitivity of confinement is a very interesting point. We have additionally performed MD simulations with smaller/larger membrane constraints and included the results in supporting information as Figure S5. The following parts are added:

      “We further attempted to mimic intermediate conditions between 3D and 2D systems in two different manners. First, we applied a weaker membrane constraint in 2D system. Even when the strength of membrane constraints is reduced by a factor of 1000, NMDARs are located on the inner side when the CaMKII was active, as well as the result in 2D system (Fig.S5ABC). Second, to weaken further the effect of membrane constraints, we artificially altered the membrane thickness from 5 nm to 50 nm, in addition to reducing the membrane constraints by 1000. As a result, NMDAR clusters move to the bottom and surround AMPAR (Fig.S5DEF). In this artificial intermediate condition, both states in which the NMDARs are outside (corresponding to 3D) and in which the NMDARs are inside (corresponding to 2D) are observed, depending on the strength of the membrane constraint.”

      (5)  Some of the labels in Figure 1 are confusing. In Figure 1A, the structure labeled as AMPAR has the same shape as the structure labeled as TARP in Figure 1B, but TARP is labeled as one of the smaller structures (like small legs) in the lower part of AMPAR in Figure 1A. Does the TARP in Figure 1B correspond to the small structures in the lower part of AMPAR? If so, this should be specified (and better indicated graphically), and in that case, it would be better not to use the same structural drawing for the overall structure and a substructure. The same issue is seen for NMDAR in Figure 1A and GluN2Bc in Figure 1B. 

      (6) In addition to clarifying Figure 1, the authors should clarify the usage of AMPAR vs TARP and NMDAR vs GluN2Bc in other parts of the text as well.

      (7) The physics of the authors' model will be much clearer if they provide an easily accessible graphical description of the relative interaction strengths between different domain-representing spheres (beads) in their model. For this purpose, a representation similar to that given by Feric et al., Cell 165:1686-1697 (2016) (especially Figure 6B in this reference) of the pairwise interactions among the beads in the authors' model should be provided as an additional main-text figure. Different interaction schemes corresponding to inactive and activated CAMKII should be given. In this way, the general principles (beyond the PSD system) governing 3D vs 2D multiple-component condensate organization can be made much more apparent.  \

      We sincerely appreciate the reviewer’s comments. According to the recommendation, we have changed the diagram in Figure 1B into interaction matrix with each mesoscale molecular representation and the expression in main text to be clearer about AMPAR and TARP, and about the relationship between NMDAR and GluN2Bc. Former diagram of the pairs of specific interaction is moved to supplementary figure. 

      (8) Can the authors' rationalization of the observed difference between 3D and 2D model PSD condensates be captured by an intuitive appreciation of the restriction on favorable interactions by steric hindrance and the reduction in interaction cooperativity in 2D vs 3D?  

      We thank the reviewer for the comment. As pointed out, the multiphase morphology change observed in this study can be attributed to a decrease in coordination number in 2D compared to 3D. We have included the physicochemical rationalization in the discussion.  

      (9) In the authors' model, the propensity to form 2D condensates is quite weak. Is this prediction consistent with the experiment? Real PSDs do form 2D condensates around synapses.  

      We are grateful to the reviewer for highlighting this important point. We agree with that the real PSD forms 3D condensates beneath the 2D membrane. Some lower PSD components under the membrane (i.e. SAPAP, Shank, and Homer) are omitted in our system, which may cause a weak condensation. To emphasize this, we have added the following sentence:

      “While these in vivo results contain additional scaffold and cytoskeletal elements omitted in our model, such as SAPAP, Shank and Homer, nearly all proteins in the middle and lower layers of the PSD associate directly or indirectly with PSD-95 in the upper PSD layer. Consequently, it is probable that other scaffold proteins contribute to the mobility of AMPAR-containing and NMDAR-containing nanodomains indistinguishably. They may increase the stability of the AMPAR and NMDAR clusters but are unlikely to have a distinct effect to reverse the phase-separation phenomenon.”

      However, we believe that the clusters formed on the 2D membrane are not a robust “phase” because they do not follow scaling law. In fact, in our previous study of PSD system with AMPAR(TARP)<sub>4</sub> and PSD-95, we have already reported that phase separation is less likely to occur in 2D than in 3D. The previous result suggests that phase separation on membrane may be difficult to achieve, which is consistent with the results of this study.

      (10) More theoretical context should be provided in the Introduction and/or Discussion by drawing connections to pertinent prior works on physical determinants of co-mixing and de-mixing in multiple-component condensates (e.g., amino acid sequence), such as Lin et al., New J Phys 19:115003 (2017) and Lin et al., Biochemistry 57:2499-2508 (2018). 

      (11) In the discussion of the physiological/neurological significance of PSD in the Introduction and/or Discussion, for general interest it is useful to point to a recently studied possible connection between the hydrostatic pressure-induced dissolution of model PSD and high-pressure neurological syndrome [Lin et al., Chem Eur J 26:11024-11031 (2020)].

      We thank the reviewer for the helpful recommendation. We have added the recommended references in each relevant part in introduction, respectively.

      (12) It is more accurate to use "perpendicular to the membrane" rather than "vertical" in the caption for Figure 3E and other such descriptions of the orientation of the CaMKII hexagonal plane in the text.

      We thank you for your comment. We replaced the word “vertical” with “perpendicular" in the main text and caption.

      Reviewer #3 (Public review):

      Summary:

      In this work, Yamada, Brandani, and Takada have developed a mesoscopic model of the interacting proteins in the postsynaptic density. They have performed simulations, based on this model and using the software ReaDDy, to study the phase separation in this system in 2D (on the membrane) and 3D (in the bulk). They have carefully investigated the reasons behind different morphologies observed in each case, and have looked at differences in valency, specific/non-specific interactions, and interfacial tension.

      Strengths:

      The simulation model is developed very carefully, with strong reliance on binding valency and geometry, experimentally measured affinities, and physical considerations like the hydrodynamic radii. The presented analyses are also thorough, and great effort has been put into investigating different scenarios that might explain the observed effects.

      Weaknesses:

      The biggest weakness of the study, in my opinion, has to do with a lack of more in-depth physical insight about phase separation. For example, the authors express surprise about similar interactions between components resulting in different phase separation in 2D and 3D. This is not surprising at all, as in 3D, higher coordination numbers and more available volume translate to lower free energy, which easily explains phase separation. The role of entropy is also significantly missing from the analyses. When interaction strengths are small, entropic effects play major roles. In the introduction, the authors present an oversimplified view of associative and segregative phase transitions based on the attractive and repulsive interactions, and I'm afraid that this view, in which all the observed morphologies should have clear pairwise enthalpic explanations, diffuses throughout the analysis. Meanwhile, I believe the authors correctly identify some relevant effects, where they consider specific/nonspecific interactions, or when they investigate the reduced valency of CaMKII in the 2D system.

      We thank the reviewer for the insightful and constructive comments. Regarding the difference in phase behavior between 2D and 3D systems, we appreciate the reviewer’s clarification that differences in coordination number and entropy in higher dimensions can account for the observed morphology of the phases. While it may be clear that entropy decreases due to the decrease of coordination number, our objective was to uncover how such an isotropic entropy reduction regulates the behavior of each phase driven by different interactions, which remains largely unknown. To emphasize this, we modified the introduction and have now included a discussion of the entropic contributions to phase behavior in both 2D and 3D systems, and we have made this clearer in the revised manuscript by referencing relevant theoretical frameworks. In the Discussion, we added the sentence below:

      “Generally, phase separation can be explained by the Flory-Huggins theory and its extensions: phase separation can be favored by the difference in the effective pairwise interactions in the same phase compared to those across different phases, and is disfavored by mixing entropy. The effective interactions contain various molecular interactions, including direct van der Waals and electrostatic interactions, hydrophobic interactions, and purely entropic macromolecular excluded volume interactions. For the latter, Asakura-Oosawa depletion force can drive the phase separation. Furthermore, the demixing effect was explicitly demonstrated in previous simulations and field theory (61). Importantly, we note that the effective pairwise interactions scale with the coordination number z. The coordination number is a clear and major difference between 3D and 2D systems. In 3D systems, large z allows both relatively strong few specific interactions and many weak non-specific interactions. While a single specific interaction is, by definition, stronger than a single non-specific interaction, contribution of the latter can have strong impact due to its large number. On the other hand, a smaller z in the membrane-bound 2D system limits the number of interactions. In case of limited competitive binding, specific interactions tend to be prioritized compared to non-specific ones. In fact, Fig. 3A clearly shows that number of specific interactions in 2D is similar to that in 3D, while that of non-specific interactions is dramatically reduced in 2D. In the current PSD system, CaMKII is characterized by large valency and large volume. In the 3D solution system, non-specific excluded volume interactions drive CaMKII to the outer phase, while this effect is largely reduced in 2D, resulting in the reversed multiphase.   

      Also, I sense some haste in comparing the findings with experimental observations. For example, the authors mention that "For the current four component PSD system, the product of concentrations of each molecule in the dilute phase is in good agreement with that of the experimental concentrations (Table S2)." But the data used here is the dilute phase, which is the remnant of a system prepared at very high concentrations and allowed to phase separate. The errors reported in Table S2 already cast doubt on this comparison. 

      We thank the reviewer for the insightful comment. In the validation process, we adjusted the parameters so that the number of molecules in dilute phase is consistent with the experimental lower limit of phase separation, based on the assumption that phase-separated dilute phase is the same concentration as the critical concentration. That is why we focus on comparing dilute phase concentration in Table S2. However, in our simulations, the number of protein molecules is relatively small since it is based on the average number per synapse spine. For example, there are only about 60 CaMKII molecules at most, and its presence in the dilute phase is highly sensitive to concentration, as the reviewer pointed out. This is one of the limitations, so we have added a description to the Limitations section. We added:

      “Second, parameter calibration contains some uncertainty. Previous in vitro study results used for parameter validation are at relatively high concentrations for phase separation, which may shift critical thresholds compared to that in in vivo environments. Also, since the number of molecules included in the model is small, the difference of a single molecule could result in a large error during this validation process.”

      Or while the 2D system is prepared via confining the particles to the vicinity of the membrane, the different diffusive behavior in the membrane, in contrast to the bulk (i.e., the Saffman-Delbrück model), is not considered. This would thus make it difficult to interpret the results of a coupled 2D/3D system and compare them to the actual system.

      We appreciate the reviewer’s helpful comment. We agree with that there is a concern that the Einstein-Stokes equation does not adequately reproduce the diffusion of membrane-embedded particles. We recalculated the diffusion coefficients for every membrane particle used in this model using the Saffman-Delbrück model and found that diffusion coefficients for receptor cores (AMPAR and NMDAR) were approximately three times larger. These values are still about ~10 times smaller than that of molecules diffusing under the cytoplasm. Additionally, since this study focuses on the morphology of the phase/cluster at the thermodynamic equilibrium, we think that the magnitude of the diffusion coefficient has little influence on the final structure of the cluster. However, we will incorporate the membrane-embedded diffusion as a future improvement item for better modelling and implementation. We added:

      “Third, we estimated all the diffusion coefficients from the Einstein-Stokes equation, which may oversimplify membrane-associated dynamics. Applying the Saffmann-Delbrück model to membrane-embedded particles would be desired although the resulting diffusion coefficients remain of the same order of magnitude. These limitations highlight the need for further research, yet they do not undermine the core significance of the present findings in advancing our understanding of multiphase morphologies.”

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      There has been intense controversy over the generality of Hamilton's inclusive fitness rule for how evolution works on social behaviors. All generally agree that relatedness can be a game changer, for example allowing for otherwise unselectable altruistic behaviors when 𝑐 < 𝑟𝑏, where 𝑐 is the fitness cost to the altruism, 𝑏 is the fitness benefit to another, and 𝑟 their relatedness. Many complications have been successfully incorporated into the theory, including different reproductive values and viscous population structures.

      I agree, especially if by incorporating viscous population structures, the reviewer means the discovery of the cancellation effect (Wilson, Pollock, and Dugatkin, 1992, Taylor, 1992).

      The controversy has centered on another dimension; Hamilton's original model was for additive fitness, but how does his result hold when fitnesses are non-additive? One approach has been not to worry about a general result but just find results for particular cases. A consistent finding is that the results depend on the frequency of the social allele - nonadditivity causes frequency dependence that was absent in Hamilton's approach.

      Just to be extra precise: Hamilton’s (1964) original model did not use the Price equation nor the regression approach to define costs and benefits, and it did indeed simply presuppose fixed, additive fitness effects.

      Also for extra precision on terminology: many researchers will describe all fitnesses in social evolution as frequency dependent. The reason they do, is that with or without additivity, both the fitness of cooperators (with the social allele) and the fitness of defectors (without the social alle) typically increase in the frequency of cooperators in the population; the more cooperators there are, the more individuals run into them, which increases average fitness. The result depending on the frequency I take to mean that which of those two fitnesses is larger flips at a certain frequency, which automatically implies that the difference between them is depending on the frequency of the social allele. This is indeed the result of non-additivity. We will return to this in more detail in the response to Reviewer #3. Also at the end of Appendix B I have added a bit to be extra precise regarding frequency dependence.

      Two other approaches derive from Queller via the Price equation. Queller 1 is to find forms like Hamilton's rule, but with additional terms that deal with non-additive interaction, each with an r-like population structure variable multiplied by a b-like fitness effect (Queller, 1985). Queller 2 redefines the fitness effects c and b as partial regressions of the actor's and recipient's genes on fitness. This leaves Hamilton's rule intact, just with new definitions of c and b that depend on frequency (Queller, 1992a).

      Queller 2 is the version that has been most adopted by the inclusive fitness community along with assertions that Hamilton's rule in completely general. In this paper, van Veelen argues that Queller 1 is the correct approach. He derives a general form that Queller only hinted at. He does so within a more rigorous framework that puts both Price's equation and Hamilton's rule on firmer statistical ground. Within that framework, the Queller 2 approach is seen to be a statistical misspecification - it employs a model without interaction in cases that actually do have interaction. If we accept that this is a fatal flaw, the original version of Hamilton's rule is limited to linear fitness models, which might not be common.

      I totally agree.

      Strengths:

      While the approach is not entirely new, this paper provides a more rigorous approach and a more general result. It shows that both Queller 1 and Queller 2 are identities and give accurate results, because both are derived from the Price equation, which is an identity. So why prefer Queller 1? It identifies the misspecification issue with the Queller 2 approach and points out its consequences. For example, it will not give the minimum squared differences between the model and data. It does not separate the behavioral effects of the individuals from the population state (𝑏 and 𝑐 become dependent on 𝑟 and the population frequency).

      Just to be precise on a detail: in the data domain, as long as the number of parameters in a statistical model is lower than the number of data points, adding parameters typically (generically) lowers the sum of squared errors. That is to say, for an underspecified statistical model, the sum of squared errors goes down if a parameter is added, but for an already overspecified statistical model, the same is still true (although, typically, by how much the sum of squared errors is reduced will differ). The model specification task for a statistician includes knowing when to keep adding parameters, because the data suggest that the model is still underspecified, and when to stop adding parameters, because the model is well-specified, even if adding parameters still reduces the sum of squared errors.

      In a modeling context, on the other hand, one can say that sum of squared differences will stop decreasing at the point where the statistical model is well-specified, that is: when it matches the model we are considering.

      The paper also shows how the same problems can apply to non-social traits. Epistasis is the non-additivity of effects of two genes within the individual. (So one wonders why have we not had a similarly fierce controversy over how we should treat epistasis?)

      The paper is clearly written. Though somewhat repetitive, particularly in the long supplement, most of that repetition has the purpose of underscoring how the same points apply equally to a variety of different models.

      Finally, this may be a big step towards reconciliation in the inclusive fitness wars. Van Veelen has been one of the harshest critics of inclusive fitness, and now he is proposing a version of it.

      I am very happy to hear this, because I am indeed hopeful for reconciliation. I would like to add a comment, though. The debate on Hamilton’s rule/inclusive fitness is regularly thought of as a battle between two partizan camps, where both sides care at least as much about winning as they do about getting things right. This is totally understandable, because to some degree that is true. Also, I agree that it is fair to position me in the camp that is critical of the inclusive fitness literature. However, I would like to think that I have not been taking random shots at Hamilton’s rule. I have pointed to problems with the typical use of the Price equation and Hamilton’s rule, and I think I did for very good reasons. I am obviously very happy that finding the Generalized Price equation, and the general version of Hamilton’s rule, allowed me to go beyond this, and (finally) offer a correct alternative, and I totally appreciate that this opens the door for reconciliation, as this reviewer points out. But I would not describe this as a road-toDamascus moment. In order to illustrate the continuity in my work, I would like to point to three papers.

      In van Veelen (2007), I pointed to the missing link between the central result in Hamilton’s (1964) famous paper (which states that selection dynamics take the population to a state where mean inclusive fitness is maximized), and Hamilton’s actual rule (which states that selection will lead to individuals maximizing their individual inclusive fitness). My repair stated the additional assumptions that were necessary to make the latter follow from the former. I would say that this can hardly be characterized as an attack on Hamilton’s rule. Reading Hamilton (1964) with enough care to notice something is missing, and then repairing it, I think is a sign of respect, and not an attack.

      Van Veelen (2011) is about the replicator dynamics for n-player games, with the possibility of assortment. This puts the paper in a domain that does not assume weak selection, and that is typically not much oriented towards inclusive fitness. I included a theorem that implies that, under the condition of linearity, inclusive fitness not only gets the direction of selection right, but 𝑟𝑏 − 𝑐 becomes a parameter that also determines the speed of selection. This I think is representative, in the sense that in many of my papers, I carefully stake out when the classic version of Hamilton’s rule does work.

      In Akdeniz and van Veelen (2020), we moreover take a totally standard inclusive fitness approach in a model of the cancellation effect at the group level.

      I would say that this does not line up with the image of a harsh critic that takes random shots at Hamilton’s rule or inclusive fitness.

      Weaknesses:

      van Veelen argues that the field essentially abandoned the Queller 1 approach after its publication. I think this is putting it too strongly - there have been a number of theoretical studies that incorporate extra terms with higher-order relatednesses. It is probably accurate to say that there has been relative neglect. But perhaps this is partly due to a perception that this approach is difficult to apply.

      I can imagine that the perceived difficulty in application may have played a role in the neglect of the Queller 1 approach. What for sure has played a role, and I would think a much bigger one, is that the literature has been pretty outspoken that the Queller 1 approach is the wrong way to go. The main text cites a number of papers that hold this position very emphatically (The first one of those was a News and Views by Alan Grafen (1985) that accompanied the paper in which Queller presented his Queller 1 approach. I am very happy that Appendix B shows on how many levels this News and Views was wrong.). There is only a handful of papers that follow the Queller 1 example.

      The model in this paper is quite elegant and helps clarify conceptual issues, but I wonder how practical it will turn out to be. In terms of modeling complicated cases, I suspect most practitioners will continue doing what they have been doing, for example using population genetics or adaptive dynamics, without worrying about neatly separating out a series of terms multiplying fitness coefficients and population structure coefficients.

      I am not sure if I see what the reviewer envisions practitioners that use population genetics will keep on doing. I would think that the Generalized Price equation in regression form is a description of population genetic dynamics, and therefore, if practitioners will not make an effort to “neatly separate out a series of terms multiplying fitness coefficients and population structure coefficients”, then all I can say is that they should. I cannot do more than explain why, if they do not, they are at risk of mischaracterizing what gets selected and why.

      Regarding those that use adaptive dynamics, I would say that this is a whole different approach. Within this approach, one can also apply inclusive fitness; see Section 6 and Appendix D of van Veelen et al. (2017). Appendix D is full of deep technical results and was done by Benjamin Allen.

      For empirical studies, it is going to be hard to even try to estimate all those additional parameters. In reality, even the standard Hamilton's rule is rarely tested by trying to estimate all its parameters. Instead, it is commonly tested more indirectly, for example by comparative tests of the importance of relatedness. That of course would not distinguish between additive and non-additive models that both depend on relatedness, but it does test the core idea of kin selection. It will be interesting to see if van Veelen's approach stimulates new ways of exploring the real world.

      Regarding the impact on empirical studies, there are a few things that I would like to say. The first is that I would just like to repeat, maybe a bit more elaborately, what I wrote at the end of the main text. Given that the generalized version of Hamilton’s rule produces a host of Hamilton-like rules, and given the fact that all of them by construction indicate the direction of selection accurately, the question whether or not Hamilton’s rule holds turns out to be illposed. That means that we can stop doing empirical tests of Hamilton’s rule, which are predicated on the idea that Hamilton’s rule, with benefits and costs being determined by the regression method, could be violated – which it cannot (Side note: it is possible to violate Hamilton’s rule, if costs and benefits are defined according to the counterfactual method; see van Veelen et al. (2017) and van Veelen (2018). This way of defining costs and benefits is less common, although there are authors that find this definition natural enough to assume that this is the way in which everybody defines costs and benefits (Karlin and Matessi, 1983, Matessi and Karlin, 1984).). Instead, we should do empirical studies to find out which version of Hamilton’s rule applies to which behaviour in which species.

      would like to not understate what a step forward this is. The size of the step forwards is of course also due to the dismal point of departure. As theorists, we have failed our empiricists, because all 12 studies included in the review by Bourke (2014) of papers that explicitly test Hamilton’s rule are based on the misguided idea that the traditional Hamilton’s rule, with costs and benefits defined according to the regression method, can be violated. While the field does sometimes have disdain for mathematical nit-picking, this is a point where a little more attention to detail would have really helped. If the hypothesis is that Hamilton’s rule holds, and the null is that it does not, then trying to specify how the empirical quantity that reflects inclusive fitness would be distributed under the null hypothesis (in order to do the right statistical tests) would have forced researchers to do something with the information that this quantity is not distributed at all, because Hamilton’s rule is general (in the sense that it holds for any way in which the world works). If one would prefer to reverse the null and the alternative hypothesis, one would run into similar problems. Understanding that the question is ill-posed therefore is a big step forwards from the terrible state of statistics and the waste of research time, attention and money on the empirical side of this field (see also Section 8 of van Veelen et al., 2017).

      I would agree that doing comparative statics may not be much affected by this. Section 5 of van Veelen et al. (2017) indicates that there can be a large set of circumstances under which the general idea “relatedness up → cooperation up” still applies. But that may be a bit unambitious, and Section 8 of van Veelen et al. (2017), and the final section of van Veelen (2018) contain some reflections on empirical testing that may allow us to go beyond that. As long as there is change happening in the Generalized Price equation, the population is not in equilibrium. For empirical tests, one can either aim to capture selection as it happens, or assume that what we observe reflects properties of an equilibrium. This leads to interesting reflections on how to do empirics, which may differ between traits that are continuous and traits that are discrete (again: see van Veelen et al. (2017), and van Veelen (2018).

      Reviewer #2 (Public review):

      Summary:

      This manuscript reconsiders the "general form" of Hamilton's rule, in which "benefit" and "cost" are defined as regression coefficients. It points out that there is no reason to insist on Hamilton's rule of the form −𝑐 + 𝑏𝑟 > 0, and that, in fact, arbitrarily many terms (i.e. higherorder regression coefficients) can be added to Hamilton's rule to reflect nonlinear interactions. Furthermore, it argues that insisting on a rule of the form −𝑐 + 𝑏𝑟 > 0 can result in conditions that are true but meaningless and that statistical considerations should be employed to determine which form of Hamilton's rule is meaningful for a given dataset or model.

      Totally right. I cannot help to want to be extra precise, though, by distinguishing between the data domain and the modelling domain. In the data domain, statistical considerations apply in order to avoid misspecification. In this domain, avoiding misspecification can be complicated, because we do not know the underlying data generating process, and we depend on noisy data to make a best guess. In the modeling domain, however, there is no excuse for misspecification, as the model is postulated by the modeler. I therefore would think that in this domain, it does not really require “statistical considerations” to minimize the probability of misspecification; we can get the probability of misspecification all the way down to 0 by just choosing not to do it.

      Strengths:

      The point is an important one. While it is not entirely novel-the idea of adding extra terms to Hamilton's rule has arisen sporadically (Queller, 1985, 2011; Fletcher et al., 2006; van Veelen et al., 2017)--it is very useful to have a systematic treatment of this point. I think the manuscript can make an important contribution by helping to clarify a number of debates in the literature. I particularly appreciate the heterozygote advantage example in the SI.

      Me too, and I really hope the readers make it this far! I have thought of putting it in the main text, but did not know where that would fit.

      Weaknesses:

      Although the mathematical analysis is rigorously done and I largely agree with the conclusions, I feel there are some issues regarding terminology, some regarding the state of the field, and the practice of statistics that need to be clarified if the manuscript is truly to resolve the outstanding issues of the field. Otherwise, I worry that it will in some ways add to the confusion.

      (1) The "generalized" Price equation: I agree that the equations labeled (PE.C) and (GPE.C) are different in a subtle yet meaningful way. But I do not see any way in which (GPE.C) is more general than (PE.C). That is, I cannot envision any circumstance in which (GPE.C) applies but (PE.C) does not. A term other than "generalized" should be used.

      This is a great point! Just to make sure that those that read the reports online understand this point, let me add some detail. The equation labeled (PE.C) – which is short for Price equation in covariance form – is

      The derivation in Appendix A then assumes that we have a statistical model that includes a constant and a linear term for the p-score. It then defines the model-estimated fitness of individual 𝑖 as , where 𝑤<sub> 𝑖</sub> is the realized number of offspring of individual 𝑖, and 𝜀<sub> 𝑖</sub> is the error term – and it is the sum over all individuals of this error term-squared that is minimized. The vector of model-estimated fitnesses will typically be different for different choices of the statistical model. Appendix A then goes on to show that, whatever the statistical model is that is used, for all of them , as long as the statistical model includes a constant and a linear term for the p-score. That means that we can rewrite (PE.C) as

      The point that the reviewer is making, is that this is not really a generalization. For a given dataset (or, more generally, for a given population transition, whether empirical or in a model), is just a number, and it happens to be the case that 𝐶𝑜𝑣(𝑤:, 𝑝) returns the same number, whatever statistical model we use for determining what the model-estimated fitnesses 𝑤<sub> 𝑖</sub> are (as long as the statistical model includes a constant and a linear term for the p-score). In other words, (PE.C) is not really nested in (GPE.C), so (GPE.C) is not a proper generalization of (PE.C).

      This is a totally correct point, and I had actually struggled a bit with the question what terminology to use here. Equation (GPE.C) is definitely general, in the sense that we can change the statistical model, and thereby change the vector of model-estimated fitnesses , but as long as we keep the constant and the linear term in the statistical model, the equation still applies. But it is not a generalization of (PE.C).

      I do however have a hard time coming up with a better label. The General Price equation may be a bit better, but it still suggests generalization. The Statistical Model-based Price equation does not suggest or imply generalization, but it does not convey how general it is, and it suggests that it could be an alternative to the normal Price equation that one may or may not choose to use – while this version really is the one we should use. It may moreover create the impression that this is only for doing statistics, and one might use the traditional Price equation for anything that is not statistics. I cannot really think of other good alternatives, but I am of course open to suggestions.

      So, by lack of a better label, I called this the Generalized Price equation in covariance form. Though clearly imperfect, there are still a few good things about this label. The first is that, as mentioned above, this equation is general, in the sense that it holds, regardless of the statistical model. The second reason is that this is Step 1 in a sequence of three steps., the other two of which do produce proper generalizations. Step 2 goes from this equation in covariance form to the Generalized Price Equation in regression form, which is a proper generalization of the traditional Price equation in regression form. Step 3 goes from the Generalized Price Equation in regression form to the general version of Hamilton’s rule, which is also a proper generalization of the classical Hamilton’s rule. Since I would suggest that Step 1 on its own is kind of useless, and therefore Step 1 and Step 2 will typically come as a package, I would be tempted to think that this justifies the abuse of terminology for the Price Equation in covariance form. I did however add the observation made by the reviewer at the point where the Generalized Price equation (in both forms) is derived, so I hope this at least partly addresses this concern.

      (2) Regression vs covariance forms of the Price equation: I think the author uses "generalized" in reference to what Price called the "regression form" of his equation. But to almost everyone in the field, the "Price Equation" refers to the covariance form. For this reason, it is very confusing when the manuscript refers to the regression form as simply "the Price Equation".

      As an example, in the box on p. 15, the manuscript states "The Price equation can be generalized, in the sense that one can write a variety of Price-like equations for a variety of possible true models, that may have generated the data." But it is not the Price equation (covariance form) that is being generalized here. It is only the regression that Price used that is being generalized.

      To be consistent with the field, I suggest the term "Price Equation" be used only to refer to the covariance form unless it is otherwise specified as in "regression form of the Price equation".

      I am not sure about the level of confusion induced here, but I totally see that it can be helpful to avoid all ambiguity. I therefore went over everything, and whenever I wrote “Price equation”, I tried to make sure it comes either with “in covariance form” or with “in regression form”. At some places, it is a bit over the top to keep repeating “in regression form”, when it is abundantly clear which form is being discussed. Also, I added no qualifiers if a statement is true for both forms of the Price equation, or if the claim refers to the whole package of going through Step 1 and Step 2 mentioned above.

      (3) Sample covariance: The author refers to the covariance in the Price equation as “sample covariance”. This is not correct, since sample covariance has a denominator of N-1 rather than N (Bessel’s correction). The correct term, when summing over an entire population, is “population covariance”. Price (1972) was clear about this: “In this paper we will be concerned with population functions and make no use of sample functions”. This point is elaborated on by Frank (2012), in the subsection “Interpretation of Covariance”.

      I totally agree. On page 418 of van Veelen (2005), I wrote:

      “Another possibility is that we think of 𝑧<sub>i</sub> and 𝑞<sub>i</sub>, 𝑖 = 1,…,𝑁 as realizations of a jointly distributed random variable. […] In that case the expression between square brackets is a good approximation for what statisticians […] call a sample covariance. A sample covariance is defined as but in large samples it is OK to replace 𝑁 − 1 by 𝑁, and then this formula reduces to Price’s 𝐶𝑜𝑣(𝑧, 𝑞).”

      In van Veelen et al. (2012), I slid a little, because in Box 1 on page 66, I wrote that is the sample covariance, and only in footnote 1 on the same page did I include Bessel’s correction, when I wrote:

      “To be perfectly precise, the sample covariance is defined as

      In this manuscript, I slid a little further, and left Bessel’s correction out altogether. I am happy that the reviewer pointed this out, so I can make this maximally precise again.

      The reviewer also quotes Price (1972), page 485:

      “In this paper we will be concerned with population functions and make no use of sample functions”.

      Below, the reviewer will return to the issue of distinguishing between the sample covariance with Bessel’s correction, and the sample covariance without Bessel’s correction, where the latter is regularly also referred to as the population covariance. A natural interpretation of the quote from Price (1972), if we read a bit around this quote in the paper, is that the difference between his “population functions” and his “sample functions” is indeed Bessel’s correction.

      The reviewer also states that Frank (2012) elaborates on this in the subsection “Interpretation of Covariance”. What is interesting, though, is that, when Frank (2012) writes, on page 1017 “It is important to distinguish between population measures and sample measures”, the difference between those is not that one does, and the other does not include Bessel’s correction. The difference between “population measures” and “sample measures” in Frank (2012), page 1017

      “It is important to distinguish between population measures and sample measures”,

      the difference between those is not that one does, and the other does not include Bessel’s correction. The difference between “population measures” and “sample measures” in Frank (2012), page 1017, is that

      “In many statistical applications, one only has data on a subset of the full population, that subset forming a sample.”

      The distinction between a population covariance and a sample covariance in Frank (2012) therefore is that they are “covariances” of different things (where the word covariances is in quotation marks, because, again, they are not really covariances). Besides just making sure that Price (1972) and Frank (2012) are not using these terms in the same way, this also perfectly illustrates the mix-up between statistical populations (or data generating processes) and biological populations that I discuss on pages 8 and 9 of Appendix A. I will return to this below, when I explain why I want to avoid using the word “population covariance” for the sample covariance without Bessel’s correction.

      Of course, the difference is negligible when the population is large. However, the author applies the covariance formula to populations as small as 𝑁 = 2, for which the correction factor is significant.

      Absolutely right.

      The author objects to using the term "population covariance" (SI, pp. 8-9) on the grounds that it might be misleading if the covariance, regression coefficients, etc. are used for inference because in this case, what is being inferred is not a population statistic but an underlying relationship. However, I am not convinced that statistical inference is or should be the primary use of the Price equation (see next point). At any rate, avoiding potential confusion is not a sufficient reason to use incorrect terminology.

      There are a few related, but separate issues. One is what to call the 𝐶𝑜𝑣(𝑤, 𝑝)-term. Another, somewhat broader, is to avoid mixing up statistical populations and biological populations. A third is what the primary use of the Price equation is. The third issue I will respond to below, where it reappears. Here I will focus on the first two, which can be discussed without addressing the third.

      In a data context, I now call the 𝐶𝑜𝑣(𝑤, 𝑝)-term “’" times the sample covariance, or, in other words, the sample covariance without Bessel’s correction”. This should be unambiguous. In a modeling context I refer to 𝐶𝑜𝑣(𝑤, 𝑝)-term as “the 𝐶𝑜𝑣(𝑤, 𝑝)-term” and describe it as a summary statistic or a notational convention. There are two reasons for this choice.

      The first is that neither of these use the word “population”. I like this, because there is a persistent scope for confusion between statistical populations and biological populations (as exemplified by Frank, 2012). This leads to an incorrect, but widespread intuition that if we “know the entire (biological) population” in a data context, there is nothing that can be estimated. This is what pages 8 and 9 of Appendix A are all about.

      The second reason is that by using two labels, I also differentiate between the data context and the modeling context. This is important for reasons I will return to later.

      Relatedly, I suggest avoiding using 𝐸 for the second term in the Price equation, since (as the ms points out), it is not the expectation of any random variable. It is a population mean. There is no reason not to use something like Avg or bar notation to indicate population mean. Price (1972) uses "ave" for average.

      I totally agree that the second term in the Price equation is not an expectation. I made this point in van Veelen (2005), and I repeated this in the manuscript. This remark by the reviewer prompted me to spell this out a bit more emphatically in Appendix A. That still leaves me with the choice what notation to use.

      I therefore looked up all contributions to the Theme issue “Fifty years of the Price equation” in the Philosophical Transactions of the Royal Society B, and found that almost all contributions use 𝐸, sometimes saying that this refers to an expectation or an average. Of course, this is wrong. However (and this is another argument), it is equally wrong as using 𝐶𝑜𝑣 or 𝑉𝑎𝑟. The terms abbreviated as 𝐶𝑜𝑣 and 𝑉𝑎𝑟 are equally much not a covariance and a variance as the term abbreviated as 𝐸 is not an expectation. So I would think that there are a few reasons for sticking with 𝐸 here; 1) consistency with the literature; 2) consistency with the treatment of other terms; and 3) the fact that this term is not really of any importance in this manuscript. I do however totally understand the reviewer’s reasons, which I suppose include that for using 𝐸, there are relatively unproblematic alternatives (ave or upper bar) that are not available for the other terms. I hope therefore that being a bit more emphatic in the manuscript about 𝐸 not being an expectation at least partly addresses this concern.

      I should add, however, that the distinction between population statistics vs sample statistics goes away for regression coefficients (e.g. b, c, and r in Hamilton's rule) since in this case, Bessel's correction cancels out.

      Totally correct.

      (4) Descriptive vs. inferential statistics: When discussing the statistical quantities in the Price Equation, the author appears to treat them all as inferential statistics. That is, he takes the position that the population data are all generated by some probabilistic model and that the goal of computing the statistical quantities in the Price Equation is to correctly infer this model.

      Before I respond to this, I would like to point out that this literature has started going off the rails right from the very beginning. One of the initial construction errors was to use the ungeneralized Price equation in regression form. The other one is that the paper in which Price (1970) presented his equation is inconsistent, and suggests that the equation can be used for constructing hypotheses and for testing them at the same time (see van Veelen (2005), page 416). That, of course, is not possible; the first happens in the theory/modeling domain, and the second in the empirical testing/statistics domain, and they are separate exercises.

      These construction errors have warped the literature based on it, and have resulted in a lot of mental gymnastics and esoteric statements, which are needed if we are not willing to consider the possibility that there could be anything amiss with the original paper by Price (1970).

      In this paper, I undo both of these construction errors. Undoing the second one means exploring both domains separately. In Sections 2-4 of Appendix A I explore the possibility that the Price equation is applied to data. In Section 5 of Appendix A I explore the possibility that it is used in a modelling context. The primary effort here is just to do it right, and I have not read anything to suggest that I did not succeed in doing this. Secondarily, of course, I also want to contrast this to what happens in the existing literature. That is what this point by the reviewer is about. It is therefore important to be aware that seeing the contrast accurately is complicated by the apologetic warp in the existing literature.

      As a first effort to unwarp, I would like to point to the fact that I am not taking any position on what the Price equation should be used for. All I do here is explore (and find) possibilities, both in the statistical inference domain and in the modeling domain. I also find that there is scope for misspecification in both, and that, in both domains, we should want to avoid misspecification. The thing that I criticize in the existing literature therefore is not the choice of domain. The thing that I criticize is the insistence on, and celebrating of what is most accurately described as misspecification. This typically happens in the modeling domain.

      It is worth pointing out that those who argue in favor of the Price Equation do not see it this way: "it is a mistake to assume that it must be the evolutionary theorist, writing out covariances, who is performing the equivalent of a statistical analysis." (Gardner, West, and Wild, 2011); "Neither data nor inferences are considered here" (Rousset, 2015). From what I can tell, to the supporters of the Price equation and the regression form of Hamilton's rule, the statistical quantities involved are either population-level *descriptive* statistics (in an empirical context), or else are statistics of random variables (in a stochastic modeling context).

      Again, this description of the friction between my paper and the existing literature is predicated on the suggestion that I have only one domain in mind where the Price equation can be applied. That is not the case; I consider both.

      In the previous paragraph, the reviewer states that I “treat statistical quantities as inferential statistics”, and in this paragraph the reviewer contrasts that with the supporters of the (ungeneralized) Price equation that supposedly treat the same quantities as “descriptive statistics”. This is also beside the point, but it will take some effort to sort out the spaghetti of entangled arguments (where the spaghetti is the result of the history in this field, as indicated earlier).

      First of all, it is not unimportant to point out that the way most people use the terms “inferential statistics” and “descriptive statistics” is that the first refers to an activity, and the second to a function of a bunch of numbers, typically data. Inferential statistics is a combination of parameter estimation and model specification (those are activities). Descriptive statistics are for instance the average values of variables of interest (which makes them a function of a set of numbers). When doing inferential statistics (or statistical inference), looking at the descriptive statistics of the dataset is just a routine before the real work begins. It is important to remember that.

      Now I suppose that this reviewer uses these words a little differently. When he or she writes that I “treat statistical quantities as inferential statistics”, I assume that the reviewer means that I want to use a term like for doing statistical inference, or that, when I want to interpret such a term, I include considerations typical of statistical inference. Within the data domain, that is totally correct. In the paper I argue that there are very good reasons for this. We would like to know what the data can tell us about the actual fitness function, and if we do our statistical inference right, and choose our Price-like equation accordingly, then that means that we would be able to give a meaningful interpretation to a term like . It also means that we then have an equation that describes the genetic population dynamics accurately.

      When the reviewer states that other papers treat them as “population level descriptive statistics” in an empirical context, I have a hard time coming up with papers for which that is the case. Most papers apply the Price equation in the modeling domain (That is to say: this is true in evolution. In ecology the Price equation is often applied to data; see Pillai and Gouhier (2019) and Bourrat et al. (2023)). But even if there are researchers that apply the Price equation to data, then considering these statistical quantities as “descriptive statistics” would not make sense. Looking at the descriptive statistics alone is not an empirical exercise; it is just a routine that happens before the actual statistical inference starts. In a data context, saying that considerations that are standard in statistical inference do not apply, because one is just not doing statistical inference, is the equivalent of an admission of guilt. If you do not consider statistical significance, and never mention that sample size could matter, because you are using these terms as “descriptive statistics, not inferential statistics”, then you’re basically admitting to not doing a serious empirical study.

      Besides treating statistical quantities as descriptive statistics in a data context, the reviewer also states that, in a stochastic modeling context, other researchers treat the same statistical quantities as “statistics of random variables”. This is first of all very generous to the existing literature. I imagine that the reviewer is imagining a modeling exercise where for instance the covariance between two variables is postulated. A theory exercise would then take that as a starting point for the derivation of some theoretical result. This, however, is not what happens in most of the literature.

      There are two things that I would like to point out. First of all, postulating covariances and deriving results from assumptions regarding those covariances is not an activity that requires using the Price equation. There are many stochastic models that function perfectly fine without the Price equation. This is maybe a detail, but it is important to realize that what the reviewer probably thinks of as a legitimate theoretical exercise may be something that can very well be done without the Price equation.

      Secondly, I would like to repeat something that I have pointed out before, which is that the Price equation can be written for any transition, whether this transition is likely or unlikely, given a model, and even for transitions that are impossible. For all of those transitions, one can write the (ungeneralized) Price equation, and for all of those, the Price equation will be an identity, and it will contain the things that the reviewer refers to as “statistical quantities”. It is important to realize that these “statistical quantities”, therefore, are properties of a transition, and that every transition comes with its own ”statistical quantity”. That implies that they are not properties of random variables; they reflect something regarding one transition. What one could imagine, though, is the following. To fix ideas, let’s take the Price equation in regression form, and focus on . A meaningful modeling exercise starts with assumptions about the likelihood of all different transitions, and therefore the likelihood of different values of 𝛽 materializing – or it starts with assumptions that imply those probabilities. In a theoretical exercise, one could then derive statements about the expectation and variance of those “statistical quantities”. For instance, one can calculate the expected value 𝐸[𝛽] =𝐸, and the variance 𝑉𝑎𝑟[𝛽] = 𝑉𝑎𝑟 , where this expectation is a proper expectation (taken over the probabilities with which these transitions materialize) and this variance is a proper variance, for the same reason.

      This is what I do on page 416 of van Veelen (2005) and in Section 5 of Appendix A. I think something like this is what the reviewer may have in mind, but it is worth pointing out that this still does not mean that the from the Price equation for any given transition is now a property of a random variable. Much of the literature, however, is not at the level of sophistication that I imagine the reviewer has in mind – although there are papers that are; see the discussion below of Rousset and Billiard (2000) and Van Cleve (2015).

      In the appendix to this reply, I will address the quotes from Gardner, West, and Wild (2011) and Rousset (2015). This takes up some space, so that is why it is at the end of this reply.

      In short, the manuscript seems to argue that Price equation users are performing statistical inference incorrectly, whereas the users insist that they are not doing statistical inference at all.

      That is not what the manuscript argues, but I am happy to clarify. The manuscript explores both the use of the Price equation when applied to data (and therefore for statistical inference) and when applied to transitions in a model. The criticism on the existing literature is not that it performs statistical inference incorrectly. The criticism is that the literature insists on misspecification, which typically happens in a modelling context.

      The problem (and here I think the author would agree with me) arises when users of the Price equation go on to make predictive or causal claims that would require the kind of statistical analysis they claim not to be doing. Claims of the form "Hamilton's rule predicts.." or use of terms like "benefit" and "cost" suggest that one has inferred a predictive or causal relationship in the given data, while somehow bypassing the entire theory of statistical inference.

      I do not really know how to interpret this paragraph. The use of the word “data” suggests that this pertains to a data context, but I do not know what would qualify as a “predictive claim” in that domain, or how any study would go from data to a claim of the form “Hamilton’s rule predicts …”. Again, I do not really know papers that apply the Price equation to data. None of the empirical papers reviewed in Bourke (2014) for instance do. I would however agree that it is close to obvious that an approach that does indeed bypass the entire theory of statistical inference cannot identify causal relations in datasets. I think the examples in Section 2 of Appendix A also clearly illustrate that a literature in which the word “sample size” is absent, cannot be doing statistical inference.

      There is also a third way to use the Price equation which is entirely unobjectionable: as a way to express the relationship between individual-level fitness and population-level gene frequency change in a form that is convenient for further algebraic manipulation. I suspect that this is actually the most common use of the Price equation in practice.

      I am not sure if I understand what it means for the Price equation to “express the relationship between individual-level fitness and population-level gene frequency change”. That is a bit reminiscent of how John Maynard Smith saw the Price equation (Okasha, 2005), but he also emphasized that he was unable to follow George Price and his equation. For sure, it cannot be that one side of the Price equation reflects something at the individual level and the other something at the population level, because both sides of the Price equation are equally aggregated over the population. Just to be safe, and to avoid unwarranted associative thinking, I would therefore choose to be minimalistic, and say that the Price equation is an identity for a transition between a parent population and an offspring population.

      Regardless of the words we choose, however, the question how harmless or objectionable the use of the Price equation is in the literature is absolutely relevant. In earlier papers I have tried to cover a spectrum of examples of different ways to use (or misuse) the Price equation. In van Veelen (2005) I cover Grafen (1985a), Taylor (1989), Price (1972), and Sober and Wilson (2007). The main paper that is discussed in van Veelen et al. (2012) is Queller (1992b), but Section 7 of that paper also discusses the way the Price equation is used in Rousset and Billiard (2000), Taylor (1989), Queller (1985), and Page and Nowak (2002). These discussions also come with a description of how much it takes to repair them, and this varies all the way from nothing, or a bit of minor rewording, to being beyond repair.

      What is good to observe, is that the papers in which the use of the Price equation is the least problematic, are also the papers in which, if the reference to the Price equation would be taken out, nothing really changes. These are papers that start with a model, or a collection of models, and that, at some point in the derivation of their results, point to a step that can, but does not have to be described as using the Price equation. An example of this is Rousset and Billiard (2000); see the detailed description in Section 7 of van Veelen et al. (2012).

      I am happy to point to a few more papers on the no harm, no foul end of the spectrum here.

      Allen and Tarnita (2012) discuss properties of the dynamics in a well-defined set of models.

      Towards the end of the paper, a version of the Price equation more or less naturally appears. This is more of an interesting aside, though, and does not really play a role in derivation of the core results of the paper. Van Cleve (2015) is similar to Rousset and Billiard (2000), in that the “application of the Price equation” there is a minor ingredient of the derivation of the results. (A detail that this reviewer may find worth mentioning, given earlier comments, is that Van Cleve (2015) writes the left-hand side of the Price equation as 𝐸(𝑤Δ𝑝|𝐩), instead of . First two very unimportant things. Van Cleve (2015) uses 𝑤 for mean fitness, for which is a more common symbol. Another detail of lesser importance is that it includes the vector of parent p-scores in the notation, which in their notation is 𝐩. More importantly, however, is that Van Cleve (2015) writes 𝐸(Δ𝑝) for , which extends the (mis)use of the symbol 𝐸 for what really is just an average. This is consistent within the Price equation, in the sense that it now denotes the average with 𝐸, both on the right-hand side and on the left-hand side of the Price equation. It can however be a little bit confusing, because when Rousset and Billiard (2000) write , then this is a proper expectation. In their case, this summarizes all possible transitions out of a given state, and weighs them by their probabilities of happening, given a state summarized by 𝑝.). I am also happy to extend the spectrum a bit here. Some papers on inclusive fitness do not use the Price equation at all, even though one could imagine places where it could be inserted. A nice example of such a paper is Taylor et al. (2007).

      In this paper, I hope I can be excused from taking a complete inventory of this literature, and I hope that I do not have to count how many papers fall into the different categories. This would help assess the veracity of the suspicion the reviewer has, which is that the most common use of the Price equation is entirely unobjectionable, but I just do not have the time. I would however not want to underestimate the aggregate damage done in this field. The spectrum spanned in my earlier papers does include a fair amount of nonsense results. This typically happens in papers that do not study a specific model or set of models, but that take the Price equation as their point of departure for their theorizing. Also there seems to be a positive correlation between how exalted and venerating the language is that is used when describing the wonders and depths of the Price equation, and how little sense the claims make that are “derived” with it.

      We also should not set the bar too low. This is a literature that, at the starting point, has a few construction errors in it, as described in the paper. That is reason for concern. Moreover, one of the main end products of this literature is what we send our empiricists to the field with. As Section 8 of van Veelen et al. (2017) indicates, what we have supplied to our empiricists to work with is nothing short of terrible. I would therefore want to maintain that the damage done is enormous, and if there are also a few papers around that may use the ungeneralized Price equation in an innocuous way, then that is not enough redemption for my taste. We are still facing a literature in which, at every instance where the Price equation is used, we still need to check in which category it falls.

      For a paper that aims to clarify these thorny concepts in the literature, I think it is worth pointing out these different interpretations of statistical quantities in the Price equation (descriptive statistics vs inferential statistics vs algebraic manipulation). One can then critique the conclusions that are inappropriately drawn from the Price equation, which would require rigorous statistical inference to draw. Without these clarifications, supporters of the Price equation will again argue that this manuscript has misunderstood the purpose of the equation and that they never claimed to do inference in the first place.

      I would like to return to the point that I made at the beginning of my response to point (4), which is that the “thorniness” of these concepts is the result of the warp in the literature, resulting from the construction errors in Price (1970). If people want to understand how to apply the Price equation right, I think that reading Appendix A and B would work just fine. Again, I have not read anything that suggests that there is anything incorrect in there, so if the literature contains “thorny” concepts, it might just be that this is the result of the mental gymnastics necessitated by the unwillingness to accept that there might be something not completely right with Price (1970). Moreover, given my experiences in the field, I am not sure that there is anything that I could say that would convince the supporters of the ungeneralized Price equation.

      (5) "True" models: Even if one accepts that the statistical quantities in the Price equation are inferential in nature, the author appears to go a step further by asserting that, even in empirical populations, there is a specific "true" model which it is our goal to infer. This assumption manifests at many points in the SI when the author refers to the "true model" or "true, underlying population structure" in the context of an empirical population.

      Again, in Appendix A I explore both a data context and a modeling context. In the modeling context none of this applies, because in such a context, there is only the model that we postulate. In the part in which I explore what the Price equation can do in a data context, I do indeed use words like “true model” or "true underlying population structure".  

      I do not think it is necessary or appropriate, in empirical contexts, to posit the existence of a Platonic "true" model that is generating the data. Real populations are not governed by mathematical models. Moreover, the goal of statistical inference is not to determine the "true model" for given data but to say whether a given statistical model is justified based on this data. Fitting a linear model, for example, does not rule out the possibility there may be higher-order interactions - it just means we do not have a statistical basis to infer these higher-order interactions from the data (say, because their p-scores are insignificant), and so we leave them out.

      This remark suggests that the statistical approach in Sections 2-4 of Appendix A is more naïve than it should be, and that I would overlook the possibility of, for instance, interaction effects that are really nonzero, but that are statistically not significant. Now first of all, at a superficial level, I would like to say that this strikes me as somewhat inconsistent. In the remarks further back, the reviewer seems to excuse those that use the Price equation on data without any statistical considerations whatsoever. The reason why the reviewer is giving them a pass, is that they are “just not doing statistical inference”. Instead, they are doing this whole other thing with, you know, descriptive statistics. As I indicated above, that is just a fancy way of saying that they are not doing serious statistics – or serious empirics, for that matter.

      In this comment, on the other hand, the reviewer also suggests that the statistics that I use to replace the total absence of any statistical considerations with, is not quite up to snuff. Below, I will indicate why that is not the case at all, but I think it is also worth registering a touch of irony there.

      In order to address this issue, it is worth first observing that the whole of classical statistics is based on probability theory in the following sense. We are always asking ourselves the question: if the data generating process works like this, what would the likelihood be of certain outcomes (datasets); and if the data generating process works some other way (sometimes: the complement of whatever “this” is), what would the likelihood then be of the same outcomes. By comparing those, we draw inferences about the underlying data generating process (which is a word suggestive of a “Platonic” world view that the reviewer seems to reject). Therefore, if one would impose a ban on using Platonic words like “true data generating process”; “actual fitness function”; or “the population structure that is out there”, it would be impossible to teach any course in statistics, basic or advanced. Also it would be impossible to practice, and talk about, applied statistics.

      Now the reviewer claims that “Real populations are not governed by mathematical models”. I do not really know if I agree or disagree with that statement, but the example that the reviewer gives does not fit that claim. The reviewer suggests that if we find a higher order term not to be statistically significant (and therefore we reject the hypothesis that it is nonzero), then that would not necessarily mean that it is not there. That is totally true, and statisticians tend to be fully aware of that. But that does not imply that there is no true data-generating process; the whole premise of this example is that there is, but that the sample size is not large enough to determine it in a detailed enough way so as to include this interaction effect, that apparently is small relative to the sample size.

      The third thing to reflect on here, is that the reviewer seems to suggest that the Generalized Price equation in regression form, as presented in my paper, comes with a specific statistical approach, that he or she classifies as philosophically naïve or unsophisticated. That, however, is not the case, and I am very grateful that this remark by this reviewer allows me to make a point that I think shines a light on how the Generalized Price equation puts the train that started going off the rails in 1970 back on track, and reconnects it with the statistics it borrows its terminology from. To see that, it is good to be aware that statistics never gives certainty. The whole discipline is built around the awareness that it is possible to draw the wrong inference, and the aim is to determine, minimize, and balance, the likelihoods of making different wrong inferences. So, statistics produces statements about the confidence with which one can say that something works one way or the other. In some instances, the data are not enough to say anything with any confidence. In other cases, the data are rich enough so that it is really unlikely that we incorrectly infer that for instance a certain gene matters for fitness.

      The nice thing about the setup with the Generalized Price equation, is that those statistical considerations translate one-to-one to considerations regarding which Price-like equation to choose. If the data do not allow us to pick any model with confidence, then we should be equally agnostic about which Price-like equation describes the population genetic dynamics accurately. If the statistics gives us high confidence that a certain model matches the data, then we should pick the matching Price-like equation with the same confidence. This also carries over to higher level statistical considerations.

      If we think about terms that, if we would gather a gargantuan amount of data, might be statistically significant, but very small, then economists call those statistically significant, but economically insignificant. When rejecting the statistical significance on the basis of a not gargantuan dataset, statisticians are aware that terms that really have a zero effect, as well as terms, the effect of which is really small, are rejected with the same statistical test – and that we should be fine with that. All such considerations carry over to what we think of regarding the choice of a Price-like equation to describe the population genetic dynamics. Even if people disagree about whether or not to include a term that is statistically significant, but relatively small, such a disagreement can still happen within this setup, and just translates to a disagreement on which Price-like equation to choose.

      Similarly, people could also disagree about whether it is justified to use polynomials to characterize a fitness function. If we decide that we can, because of Taylor expansions, then the core result of the paper implies that the population genetic dynamics can be summarized by a generalized Hamilton’s rule (as long as the fitness function includes a constant and a linear term regarding the p-score). On the other hand, if we do not believe this is justified, and prefer to use an altogether different family of fitness functions, then we can no longer do this. All of this leaves space for all kinds of statistical considerations and disagreements, that just carry over to the choice for one or the other Price-like equation as an accurate description of the population genetic dynamics. Or, if one does not believe polynomials should be used, then this leads to not picking any Price-like equation at all.

      So, this is a long way of saying that the Generalized Price equation creates space for all statistical considerations to regain their place, and does not hinge on one approach to statistics or another.

      What we can say is that if we apply the statistical model to data generated by a probabilistic model, and if these models match, then as the number of observations grows to infinity, the estimators in the statistical model converge to the parameters of the data-generating one.

      But this is a mathematical statement, not a statement about real-world populations.

      Again, I do not know if I agree or disagree with the last sentence. However, that does not really matter, because either option only has implications for how we are to think of the relation between a Price-like equation describing a population genetic dynamics and real-world populations. It is not relevant for the question which Price-like equation to pick, or whether to pick one at all.

      A resolution I suggest to points 3, 4, and 5 above is:

      *A priori, the statistical quantities in the Price Equation are descriptive statistics, pertaining only to the specific population data given.

      *If one wishes to impute any predictive power, generalizability, or causal meaning to these statistics, all the standard considerations of inferential statistics apply. In particular, one must choose a statistical model that is justified based on the given data. In this case, one is not guaranteed to obtain the standard (linear) Hamilton's rule and may obtain any of an infinite family of rules.

      *If one uses a model that is not justified based on the given data, the results will still be correct for the given population data but will lack any meaning or generalizability beyond that.

      *In particular, if one considers data generated by a probabilistic model, and applies a statistical model that does not match the data-generating one, the results will be misleading, and will not generalize beyond the randomly generated realization one uses.

      Of course, the author may propose a different resolution to points 3-5, but they should be resolved somehow. Otherwise, the terminology in the manuscript will be incorrect and the ms will not resolve confusion in the field.

      I have outlined my solutions extensively above. I really appreciate that Reviewers #1 and #2 have spent time and attention on the manuscript and on the long appendices.  

      Appendix to the response to reviewer #2: Some remarks on Gardner, West & Wild (2011), Frank (2012), and Rousset (2015)

      An accurate response to the quote from Gardner, West, and Wild (2011) in the review report takes up space. I therefore wanted to put that in an appendix to the response to reviewer #2. I also include a few paragraphs regarding Frank (2012) and Rousset (2015), both of which are also mentioned by reviewer #2. All of this might also be of interest to people that are curious about how what I find in my paper relates to the existing literature.

      Gardner, West & Wild (2011) The quote I am responding to is “it is a mistake to assume that it must be the evolutionary theorist, writing out covariances, who is performing the equivalent of a statistical analysis” I want to put that into context, so I will go over the whole paragraph that surrounds the quote. The paragraph is called Statistics and Evolutionary Theory and can be found on page 1038 of the paper. I think that it is worth pointing out that it is not easy to respond to their somewhat impressionistic collages of words and formulas. I will therefore cut the paragraph up in a few smaller bits and try to make sense of it bit by bit. The paragraph begins with:

      “Our account of the general theory of kin selection has been framed in statistical terms.” Based on what they write two sentences down, the best match between those words and what they do in the paper would be: “our account uses words like “covariance”, “variance” and “expectation” for things that are not what “covariance”, “variance” and “expectation” mean in probability theory and statistics.” I would be totally open to an argument why that is nonetheless OK to do, but the way Gardner, West, and Wild (2011) phrase it obscures the fact that this needs any justification or reflection at all. “Framing something in statistical terms” is unspecific enough to sound completely harmless.

      “The use of statistical methods in the mathematical development of Darwinian theory has itself been subjected to recent criticism (van Veelen, 2005; Nowak et al., 2010b), so we address this criticism here.

      Also here, specifics would be helpful. The “use of statistical methods” sounds like it is more than just using terms from statistics, so this might refer to the minimizing of the sum of squared differences, which is also mentioned a sentence down in Gardner, West, and Wild (2011). If it does, then it is worth observing that in statistics, the minimizing of the sum of squared differences (or residuals, or errors) comes with theorems that point very clearly to what is being achieved by doing this. The Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest variance within the class of linear unbiased estimators. This implies that minimizing the sum of squared errors helps answering a well-defined question in statistics; under certain conditions, an OLS estimator is our best shot at uncovering an unknown relation between variables. To also minimize a sum of squared differences, but now in the modeling domain, qualifies as “use of statistical methods” only in a very shallow way. It means that a similar minimization is performed. Without an equivalent of the Gauss-Markov theorem that would shine a light on what it is that is being achieved by doing so, that does not carry the same weight as it does in the statistics domain – in that it does not carry any weight at all.

      “The concern is that statistical terms – such as covariances and least-squares regressions – should properly be reserved for conventional statistical analyses, where hypotheses are tested against explicit data, and that they are out of place in the foundations of evolutionary theory (van Veelen, 2005; Nowak et al., 2010b).”

      Again, a few things are a bit vague. What are “explicit data”? Are there data that are not explicit? Why the generic “foundations of evolutionary theory”, instead of a more specific description of what these statistical terms are used for? But either way, this is a misrepresentation of what I wrote in van Veelen (2005). I did not suggest to “reserve statistical terms for conventional statistical analysis” just because. As I do here in the current paper, what I did there was explore the possibilities for the Price equation to help with what I then called Type I and Type II questions. Type I questions find themselves in the modeling domain and Type II questions find themselves in the statistical domain. I was not arguing for a ban on applying statistical concepts outside of the domain of statistical inference. All that I said is that in its current practice, it does not really help answering questions of either type.  

      “However, this concern is misplaced. First, natural selection is a statistical process, and it is therefore natural that this should be defined in terms of aggregate statistics, even if only strictly by analogy (Frank, 1997a, 1998).”

      This is a vague non-argument. Almost nothing is well-defined here. What does it mean for natural selection to be a statistical process? Is that just an unusual term for a random process? If so, then I suppose I agree, but that has nothing to do with what I state or claim. And what does it mean to be defined in terms of aggregate statistics? What is the alternative? I have no idea how any of this relates to anything that I claim or state in my papers.

      “Second, Fisher (1930, p198) coined the term ‘covariance’ in the context of his exposition of the genetical theory of natural selection, so the evolutionary usage of this term has precedent over the way the term is used in other fields.”

      This is what I would call a “historic fallacy”. The fact that Fisher coined the term “covariance” in a book on genetics and natural selection does not mean that any “evolutionary usage” of the term “covariance”, however nonsensical, now has precedent over the way the term is used in other fields. Irrespective of the path that the history of science, genetics, or statistics took, right now we are in a place where about every student at every university anywhere in the world that takes a course in probability theory and/or statistics, learns that covariance is a property of a random variable (see also Wikipedia). And they do for a very good reason; it is essential in recognizing the relation between probability theory on the one hand and statistics on the other. Being curious how this “evolutionary usage” of the term covariance works, if covariance turns out not to be a property of a random variable, is therefore perfectly justified, and “Fisher coined the term” is not a safe word that exempts it from scrutiny. 

      Third, it is a mistake to assume that it must be the evolutionary theorist, writing out covariances, who is performing the equivalent of a statistical analysis.

      Again, that is just not what anyone is saying. Nobody is suggesting that an evolutionary theorist should perform the equivalent of statistical analysis. All I did was point to how little is being achieved by transferring formulas from statistics to a modeling context.

      A better analogy is to regard Mother Nature in the role of statistician, analysing fitness effects of genes by the method of least-squares, and driving genetic change according to the results of her analyses (cf. Crow, 2008).

      I have no idea what any of this means. Mother Nature is a personification of something that is not a person, and that does not have cognition. Without sentience, “Mother Nature” cannot assume the role of statistician, and cannot analyse fitness effects.

      More generally, analogy is the basis of all understanding, so when isomorphisms arise unexpectedly between different branches of mathematics (in this case, theoretical population genetics and statistical least-squares analysis) this represents an opportunity for advancing scientific progress and not an anomaly that is to be avoided.

      This is a strawman argument, puffed up with platitudes. Nobody is arguing against analogies. But what is the analogy supposed to be here? Just taking least squares from statistical inference and performing it in a modeling context does not make it an analogy. The GaussMarkov theorem, which is the basis for why least squares helps answering questions in statistics, just does not mean anything in a modeling context. OLS in modeling is just willful misspecification, and nothing that it does in statistics translates to anything meaningful in modeling. Again, declaring it an analogy, or an isomorphism, does not make it one.

      Frank (2012) Because the reviewer also mentions Frank (2012), I would like to include a small remark on this paper too. “Natural Selection. IV. The Price equation” by Frank (2012) is partly a response to my earlier criticism of the use of the Price equation. Much like Gardner, West, and Wild (2011), I would describe this paper as what is called a ”flight forwards” in Dutch. While the questions I ask are relatively prosaic (such as: how does the Price equation help derive a prediction from model assumptions?), Frank (2012) pivots to suggesting that there is a profound philosophy-of-science disagreement that I am on the wrong side of. It is close to impossible to respond to Frank (2012), because it is a labyrinth of arguments that sound deep and impressive, but that are just not specific enough to know how they relate to points that I made – or even just what they mean in general. Just to pick a random paragraph:

      “Is there some reorientation for the expression of natural selection that may provide subtle perspective, from which we can understand our subject more deeply and analyse our problems with greater ease and greater insight? My answer is, as I have mentioned, that the Price equation provides that sort of reorientation. To argue the point, I will have to keep at the distinction between the concrete and the abstract, and the relative roles of those two endpoints in mature theoretical understanding.”

      For many of those terms, I have no real idea what they mean, and also reading the rest of the paper does not help understanding what this has to do with the more prosaic questions that are waiting for an answer. What is “reorientation”? What does “concrete” versus “abstract” have to do with the question what is being achieved by doing least squares regressions in modeling? What would be an example of a mature and an immature theoretical understanding?

      Rousset (2015) is also mentioned by the reviewer. This paper is not esoteric. It states, as reviewer #2 points out, that "neither data nor inferences are considered". This paper therefore finds itself in the modeling domain, and not in the data domain. It does however still dodge the question what the benefits are of misspecification in the modeling domain. As a matter of fact, it denies that there is misspecification at all.

      “In the presence of synergies, the residuals have zero mean and are uncorrelated to the predictors. No further assumption is made about the distribution of the residuals. Thus, there is no sense in which the regression is misspecified.”

      This is a remarkable quote, and testament to the lasting impact of the construction errors in Price (1970). Misspecification is literally defined as getting the model wrong. In statistics, avoiding misspecification can be complicated, because of the noise in the data. The real datagenerating process is unknown, and because of the noise, there is always the possibility that data that are generated by one model look like they could also have been generated by another. The challenge is to reduce the odds of getting the model wrong to acceptable proportions, which is what statistical tests are for. But in modeling, we know what the model is; it is postulated by the modeler. Therefore, misspecification can be avoided by just not replacing it with a different model.

      What is being discussed in this part of Rousset (2015) is replacing what in this manuscript is called Model 3 (𝑤<sub>𝑖</sub> = 𝛼 + 𝛽<sub>1,0</sub>𝑝<sub>𝑖</sub> + 𝛽<sub>1,1</sub>𝑝<sub>𝑖</sub> + 𝛽<sub>1,1</sub>𝑝<sub>𝑖</sub>𝑞<sub>𝑖</sub> + 𝜀<sub>𝑖</sub>) with Model 2 (𝑤<sub>𝑖</sub> = 𝛼 + 𝛽<sub>1,0</sub>𝑝<sub>𝑖</sub>+ 𝛽<sub>1,0</sub>𝑝<sub>𝑖</sub>𝑞<sub>𝑖</sub> + 𝜀<sub>𝑖</sub>), and choosing the parameters in Model 2 so that it is as close as it can be to Model

      (3) This is just the definition of misspecification. That is to say: the misspecification part is the choosing of Model 2 as a reference model. The minimizing of the sum of squared residuals one could consider as minimizing the damage.

      While Rousset (2015) finds itself in the modeling domain, it does nonetheless point to the field of statistics here, by stating that “the residuals have zero mean and are uncorrelated to the predictors”. From this, the paper concludes that “there is no sense in which the regression is misspecified”. That is just plain wrong. Minimizing the sum of the squared residuals guarantees that the residuals are uncorrelated with the variables that are included in the reference model, with respect to which the squared sum of residuals is minimized. The criterion that Rousset (2015) uses is that the model is well-specified if there is no correlation between the residuals (here: ) and the variables included in the reference model (here: 𝑝<sub>𝑖</sub> and 𝑞<sub>𝑖</sub>). But according to this criterion, all models would always be well-specified, and no model could ever be misspecified. The correct criterion, however, also requires that the residuals are not correlated with variables not included in the reference model. And here, the residuals are in fact correlated with 𝑝<sub>𝑖</sub>𝑞<sub>𝑖</sub>, which is the variable that is included in Model 3, but not in Model 2. Therefore, according to the correct version of this criterion, this model is in fact misspecified – as it should be, because getting the model wrong is the definition of misspecification.

      In order to make sure that there can be no misunderstanding, I have added subsections at the end of Section 2 and Section 4 of Appendix A, and at the end of Section 2 of Appendix B. These subsections show that the algebra of minimizing the sum of squared errors implies that there is no correlation between the errors, or the residuals, and the variables that are included in the model. This is by no means something new; it is the reason why we do OLS to begin with. For additional details about misspecification, I would refer to Section 1b (viii) in van Veelen (2020).

      Finally, there is a detail worth noticing. In the main text, as well as in Appendix B, I use an analogy (and, unlike what Gardner, West, and Wild, 2011, refer to as an analogy, this actually is one). This is an analogy between two choices. On the one hand, there is the choice between Price-like equation 1 (based on Model 1 as a reference model) and Price-like equation 2 (based on Model 2 as a reference model) both applied to Model 2. On the other hand, there is the choice between Price-like equation 2 (based on Model 2 as a reference model) and Price-like equation 3 (based on Model 3 as a reference model) both applied to Model 3. Model 1 is the non-social model, Model 2 is the social model without interaction term, and Model 3 is the social model with interaction term. That makes the first choice a choice between treating a social model as a social model, or as a non-social model. The second choice is between treating a social model with interaction term as a social model with interaction term, or as a social model without interaction term. The power of this analogy is that every argument against treating the social model as if it is a non-social model is also an argument against treating the social model with interaction term as if it is a social model without interaction term.

      This ties in with the incorrect criterion for when a model is well-specified from Rousset (2015) as follows. His criterion (that there should be no correlation between the residuals and the variables in the model) declares the social model without interaction term well-specified as a reference model, when we are considering a social model with interaction term. According to the same criterion, however, the non-social model would also have to be declared to be wellspecified as a reference model, when the model we are considering is a social model. The reason is that also here, there is no correlation between the residuals and the variables that are included in this model. This is clearly not what anyone is advocating for, and for good reasons. The residuals here would, after all, be correlated with the p-score of the partner, which is a variable that is not included in the non-social model. This is a good indication that we should not use the non-social model for a social trait.

      Reviewer #3 (Public review):

      Before responding to this review, I would like to express that I appreciate the fact that the reviews and the responses are public at eLife. Besides just being useful in general, this also allows readers to get a behind the scenes glimpse into the state of the field, and the level of the reviewing. While the reports by Reviewers #1 and #2 show openness and an interest in getting things right, the report by Reviewer #3 is representative of the many review reports that I have received from the inclusive fitness community in the past. These reports tend to be rhetorically strong, and to those who do not have the time to dig deeper in the details, these reports are probably also convincing. I will therefore go through this review line by line to show how little there is behind the confident off-hand dismissal.

      There is an interesting mathematical connection - an "isomorphism"-between Price's equation and least-squares linear regression.

      This is esoteric and needlessly vague. Why is the word “isomorphism” used? In mathematics, an isomorphism is a structure-preserving mapping. The Price equation is an equation, or an identity, which makes it a bit difficult to imagine what the set of objects is on one end of the mapping. Least-squares linear regression can perhaps be seen as a function of a dataset, which would make it a single object (one function). This complicates things at the other end of the mapping too, if that set is a singleton set. The only isomorphism that I can think of is a trivial isomorphism where one equation is mapped onto one function and vice versa. It seems unlikely that this is what the reviewer means. The word isomorphism moreover is in quotes, so maybe this is supposed to be figurative. But what would it be that is being suggested here by this figure of speech? Just saying that there is, as the reviewer puts it, an “interesting mathematical connection”, does not make it so. It would already be a start to just specify what the mathematical connection is, because I have a hard time seeing what that would be. Is it just that, if you divide the Cov(𝑤, 𝑝)-term by the Var(𝑝)-term, then you get a regression coefficient? If that is what the reviewer has in mind, that would be a rather shallow observation.

      Some people have misinterpreted this connection as meaning that there is a generalitylimiting assumption of linearity within Price's equation, and hence that Hamilton's rule-which is derived from Price's equation-provides only an approximation of the action of natural selection.

      Here, the reviewer pulls a switcheroo. The use of the word “general”, or “generality”, here refers to the fact that the classical Price equation is an identity for all possible transitions between a parent and an offspring population. This is the sense in which the inclusive fitness literature uses the word general, and so do I in the relevant places in the manuscript. When I do, I make sure to add phrases like “in the sense that whatever the true model is, it always gets the direction of selection right”. As a consequence, the classical Hamilton’s rule is also totally general, in the same sense.

      One of the core points of the paper is that this is not unique to the classical Price equation. As a matter of fact, there is a large set of Price-like equations and Hamilton-like rules that are equally much identities, and equally much general (in the sense that they get the direction of selection right for all possible transitions). The being an identity and being completely general (in this sense) therefore cannot be a decisive criterion in favour of the classical Price equation and the classical Hamilton’s rule.

      On the other hand, the way in which my Generalized Price equation and my generalized version of Hamilton’s rule are general, is that they do not restrict the statistical model with respect to which errors are squared, summed and minimized to one linear statistical model. This generalization generates the variety of Price-like equations and Hamilton-like rules mentioned above (all of which are general in the sense of always getting the direction of selection right) and it gives us the flexibility to pick one that separates terms that reflect the fitness function from terms that reflect the population state.

      In response to my generalizing the Price equation and Hamilton’s rule in this second sense, the criticism of the reviewer comes down to saying that the Price equation and Hamilton’s rule do not need generalizing, because they already are general – the switcheroo being that this refers to generality in the first sense. That makes it sound like this could be an honest mistake, confusing one way in which these can be described as general with another. However, I really hammered this point home in the manuscript. Even a cursory reading of the manuscript reveals that I am fully aware that the classical Price equation and the classical Hamilton’s rule are general in the first sense.

      It is also not helpful that, as a description of what I supposedly claim, this is impressionistic, and lacks specificity. The Price equation is an equation, or an identity. What does it mean for there to be an “assumption of linearity” within it? For the classical Price equation in covariance form (which Reviewer #2 argues is what most people think of as “the Price equation”) there is no way in which one can transform this into a meaningful statement. There is just nothing in there to which the adjective “linear” can be applied. Linearity only becomes a thing when we ask ourselves how we can interpret the regression coefficient in the classical Price equation in regression form. That would be the linearity of the statistical model the differences with which are squared, summed and minimized in the regression.

      This is in contrast to the majority view that Hamilton's rule is a fully general and exact result.

      Again, in this manuscript, I write, time and again, that the classical Hamilton’s rule is fully general (in the sense that it is applies to any transition), and exact (if that means that it always gets the direction of selection right). So, this is clearly not where the contrast with the majority view lies. The contrast with the majority view is that the majority insist on misspecification, and I suggest not to do that.

      To briefly give some mathematical details: Price's equation defines the action of natural selection in relation to a trait of interest as the covariance between fitness 𝑤 and the genetic breeding value 𝑔 for the trait, i.e. Cov(𝑤, 𝑔);

      The Price equation is an identity, not a definition. When deciding on a definition, there is some freedom. We can choose to define ⊂ so that 𝐴 ⊂ 𝐵 means that 𝐴 is a strict subset of 𝐵; or we can choose to define ⊂ so that 𝐴 ⊂ 𝐵 means that 𝐴 is a (not necessarily strict) subset of 𝐵. The Price equation does not “define the action of natural selection”, because it is an identity. There is no freedom to “define” any other way.

      The more serious reason why this is conceptually also a little dangerous, is the following. Imagine a locus with two alleles. Both of them are non-coding bits of DNA. Selection therefore does not act on either of them. Now imagine a parent population with an average p-score of 0.5, or, in other words, the frequency of these alleles in the parent population is 50-50. That makes the expected value of the p-score in the offspring population 0.5 too. In finite populations, however, randomness can make the p-score grow a bit larger or a bit smaller than 0.5. If the parent population is small, the variance (the expected squared deviation from 0.5) can actually be sizeable. If the p-score in the offspring population lands above 0.5, then the Price equation has a > 0 and a 𝐶𝑜𝑣(𝑤, 𝑝) > 0. Describing the Price equation as “defining the action of natural selection” now suggests that higher p-scores have been selected for (or, in other words, that “the action of natural selection in relation to a trait of interest” is positive). With equal probability, however, < 0 and therefore also 𝐶𝑜𝑣(𝑤, 𝑝) < 0, and this would then make us draw the opposite conclusion, that natural selection has acted to lower the p-scores in the population. Both of those would be wrong, because in this situation, it would have been randomness that changed the average p-score. 

      this is a fully general result that applies exactly to any arbitrary set of (𝑔, 𝑤) data; without any loss of generality this covariance can be expressed as the product of genetic variance Var(𝑝) and a coefficient 𝑏(𝑔, 𝑤), the coefficient simply being defined as 𝑏(𝑔, 𝑤) = for all Var(𝑝) > 0; it happens that if one fits a straight line to the same (𝑔, 𝑤) data by means of least-squares regression then the slope of that line is equal to 𝑏(𝑔, 𝑤).

      Why this needs to be explained is a bit of a mystery. These “mathematical details” are in almost all Price equation papers, and they are the point of departure of my Appendix A (it is on page 7 of a more than 90 page long set of appendices). Seeing the need to explain this suggests that the reviewer thinks that there is a chance that I or anyone reading this paper would have missed this. I have not, and, more importantly, none of this invalidates the point I make in the paper.   

      All of this has already been discussed, repeatedly, in the literature.

      All of this has already been discussed, repeatedly, in the literature indeed. It is just that it does not engage with anything I write in the manuscript, or that I wrote in my other papers.

      Now turn to the present paper: the first sentence of the Abstract says "The generality of Hamilton's rule is much debated", and then the next sentence says "In this paper, I show that this debate can be resolved by constructing a general version of Hamilton's rule".

      This is correct.

      But immediately it's clear that this isn't really resolving the debate, what this paper is actually doing is asserting the correctness of the minority view (i.e. that Hamilton's rule as it currently stands is not a general result)

      It seems to me that the reason why this is “immediately clear” to this reviewer is that the reviewer has not processed the contents of the paper. I am not sure if I have to repeat this, but I am not saying that “Hamilton’s rule as it currently stands” is not general (in the sense that it always gets the direction of selection right). It is, and I say that it is a bunch of times. But so are other rules.

      and then attempting to build a more general form of Hamilton's rule upon that shaky foundation.

      I am not just “attempting to build a more general form of Hamilton's rule”. I did in fact build a more general form of Hamilton’s rule (where the generality refers to the richer set of reference statistical models).

      Predictably, the paper erroneously interprets the standard formulation of Hamilton's rule as a linear approximation and develops non-linear extensions to improve the goodness of fit for a result that is already exactly correct.

      Nowhere in the paper or the appendices do I describe the standard formulation of Hamilton’s rule (or, for that matter, any formulation of Hamilton’s rule) as an “approximation”. It is just not a word that has anything to do with this. If we are doing statistical inference, and the sum of squared errors that is minimized decreases by adding a variable in the statistical model with regard to which the sum of squared errors is minimized, then that will typically improve the goodness of fit. In statistics this is not described that as an improvement in how well the statistical model “approximates” the data, or whatever it is that the reviewer would suggest is being approximated here.

      This is not a convincing contribution. It will not change minds or improve understanding of the topic.

      There is indeed plenty of scope for this not to change minds or improve understanding of the topic. It will not change the minds or improve the understanding of those that are not really interested in getting this right. Obviously, it will also not convince those that do not read it.

      Nor is it particularly novel. Smith et al (2010, "A generalisation of Hamilton's rule for the evolution of microbial cooperation" Science 328, 1700-1703) similarly interpreted Hamilton's rule as a linear model and provided a corresponding polynomial expansion - usefully fitting the model to microbial data so as to learn something about the costs and benefits of cooperation in an empirical setting. it's odd that this paper isn't cited here.

      Let me begin by pointing to what I agree with. Given that smith et al. (2010) and my manuscript are both in the business of generalizing Hamilton’s rule, it would be helpful to the reader if my paper includes more information about how the two efforts relate. I will discuss the relation below, and I will also include that in Appendix B, and point to it in the main text. Before I do, however, I would like to point to two details in the review report that fit a pattern.

      The first is that the reviewer describes what smith et al. (2010) do as “useful”, and seems to think of fitting polynomial expansions as a legitimate way to “learn something about the costs and benefits of cooperation in an empirical setting”. That sounds quite positive. My paper, in which I supposedly repeat this, however, is characterized as misguided. This fits a pattern; all of the reviews I received from the inclusive fitness community include a “done before”, and regularly the done before is described approvingly, while my paper is described as fundamentally flawed.

      Also customary is the lack of detail. What would be really useful here, is something like “equation A.14 in this manuscript is the same as equation 6 in smith et al. (2010) if we choose . This kind of statement would pin down the way in which what I do has been done before. That, however, would require going into detail, at the risk of finding out that what is done in my manuscript is actually quite different from what happens in smith et al. (2010). That is also a recurrent thing. When I look up the done before, I typically find something that is not quite the same.  

      Now on to the paper. What smith et al. (2010) try to do is something that I wholeheartedly support. It is an empirical study that tries to capture non-linearity. A first point of order is that it is worth asking ourselves: linear or non-linear in what? For that, I would like to go back to the setup of my manuscript. Model 2 from the Main Text is

      In this fitness function, 𝑝! is the p-score of individual 𝑖 and 𝑞! is the p-score of the partner that individual 𝑖 is matched with. This is a standard model of social behaviour if 𝛽<sub>1,0</sub> < 0 and 𝛽<sub>0,1</sub> > 0. Such choices for 𝛽<sub>1,0</sub> and 𝛽<sub>0,1</sub> indicate that having a higher p-score decreases the fitness of individual 𝑖 and increases the fitness of its partner. Here we assume that 𝛼 = 1, 𝛽<sub>1,0</sub> \= −1, and 𝛽<sub>0,1</sub> \= 2. We assume that p-scores can only be 0 or 1, or, in other words, we assume that there are only cooperators and defectors in the population (or, in terms of smith et al., 2010: cooperators and cheaters).

      For a well-mixed population, where the likelihood of being matched with a cooperator is the same for cooperators and defectors (it is equal to the frequency of cooperators for both), we can now plot the fitnesses of cooperators (red) and defectors (blue) as a function of the frequency of cooperators (Appendix 1-figure 6 left).

      We can do the same for a population with relatedness where the probability of being matched with a cooperator is + 𝑓<sub>c</sub> for cooperators, and 𝑓<sub>c</sub> for defectors, where 𝑓<sub>c</sub> is the frequency of cooperators (Appendix 1-figure 6 right). For relatedness 𝑟 = 0 and 𝑟 = "7, cooperation is selected against at every frequency.

      Increasing relatedness further, we would find that for 𝑟 = the lines coincide, which implies that at every frequency, cooperation is neither selected for nor against. For 𝑟 > ": cooperation will be selected for at every frequency. This pattern implies that, as we have seen in the manuscript, the classical Hamilton’s rule works perfectly fine for Model 2; with 𝑐 = −𝛽<sub>1,0</sub> = 1 and 𝑏 = 𝛽<sub>0,1</sub> \= 2, cooperation is selected for if and only if 𝑟𝑏 > 𝑐. The fitnesses of cooperators and defectors as functions of the frequency of cooperators, moreover, are always parallel lines, regardless of relatedness.

      Model 3 in the main text extends Model 2 by adding an interaction term:

      Now we choose 𝛼 = 1, 𝛽<sub>1,0</sub> = −1, 𝛽<sub>1,0</sub> = 1, and 𝛽<sub>1,1</sub>  \= 1. We again draw the fitnesses of cooperators and defectors, both at relatedness 𝑟 = 0 (Appendix 1-figure 7 left) and at relatedness 𝑟 = (Appendix 1-figure 7 right). In the manuscript, I argue that the appropriate version of Hamilton’s rule here is Queller’s rule: 𝑟<sub>0,1</sub>𝑏<sub>0,1</sub> + 𝑟<sub>1,1</sub>𝑏<sub>1,1</sub> > 𝑐 with 𝑐 = −𝛽<sub>1,0</sub> = 1, 𝑏<sub>0,1</sub> = 𝛽<sub>0,1</sub> = 1, and 𝑏<sub>1,1</sub> = 𝛽<sub>1,1</sub> = 1. The fitnesses of cooperators and defectors as functions of the frequency of cooperators are still straight lines, but they are no longer parallel.

      The first thing to observe, therefore, is that a model with synergy, in which the classic version of Hamilton’s rule would be misspecified, and Queller’s rule would be well-specified, does not require the fitnesses as functions of the frequencies of cooperators to be non-linear. All that changes with the addition of the interaction term, is that they stop being parallel.

      The paper by smith et al. (2010) is an effort to capture non-linearities in the way fitnesses depend on the frequency of cooperators. That, therefore, goes beyond the step from Model 2 to Model 3. Whether it uses the right method to capture those non-linearities, we will come back to in a second, but it is important to realize that also without these non-linearities, the classic version of Hamilton’s rule can be too limiting to accurately describe selection. (Here, I should add that this implies that we were wrong in Wu et al. (2013), when we suggested that “for this experiment, it seems unnecessary to use the generalized Hamilton’s rule, if instead the Malthusian fitness is adopted. In other words, the Wrightian fitness approach calls for a generalization of Hamilton’s rule, whereas the Malthusian fitness approach does not (or at least not in a drastic way, as Malthusian fitnesses are almost linear in the frequency of cooperators).” Using Malthusian fitnesses, the functions were close to linear, but not close to parallel, and therefore also here, Hamilton’s rule needs generalizing - albeit in a different way than smith et al. (2010) did).

      The cooperation that is observed in the Myxococcus xanthus studied by smith et al. (2010) is not a good match with a model where individuals are matched in pairs for an interaction that determines their fitnesses. These microbes cooperate in large groups, and a better match would therefore be the n-player public goods games studied in van Veelen (2018). There, we see that simple, straightforward ways to describe synergies (or anti-synergies) can easily lead to fitnesses not being linear in the frequency of cooperators.

      The way smith et al. (2010) try to capture those non-linearities, however, is not free of complications. We addressed those in Wu et al. (2013), and I summarized them, shortly, in van Veelen (2018). One of the issues is that most of the non-linearity smith et al. (2010) pick up is the result of considering Wrightian fitness rather than Malthusian fitness. In a continuous time model with a constant growth rate, the population size at time 𝑡 is 𝑁(𝑡) = 𝑒<sup>mt</sup>𝑁(0), where 𝑚 is the Malthusian fitness. In a discrete time model with a constant average number of offspring per individual, the population at time 𝑡 is 𝑁(𝑡) = 𝑤<sup>t</sup>𝑁(0), where 𝑤 is the Wrightian fitness. If we take 𝑚 = ln 𝑤, these are the same, and if 𝑤 is close to 1, then 𝑚 can be approximated by 𝑤 − 1. That also implies that if 𝑤 is close to 1 (or, equivalently, if 𝑚 is close to 0) one is locally linear if the other is too. However, in the experiment by smith et al. (2010) the aggregate fitness effects are not small, and what is highly nonlinear in terms of Wrightian fitness is close to linear in Malthusian fitness.

      Another complication is that the Taylor coefficients that smith et al. (2010) find are the result of a combination of the data and the choice of a functional form they choose to first apply to their data. That means that a different choice of a functional form would have given different Taylor coefficients, while the in-between transformation can also be skipped. Also, the number of Taylor coefficients is larger than the dimensionality of the data, which are based on averages for 6 frequencies. For more details on these complications, I would like to refer to Wu et al. (2013) and van Veelen (2018). A nice detail is that if we consider the way the fitnesses of cooperators and defectors compare when using Malthusian fitnesses, then a comparison of the slopes actually suggests anti-synergies, which leads to a stable mix of cooperators and cheaters, already in the absence of population structure. This matches what is suggested by Archetti and Scheuring, (2011, 2012) and Archetti (2018).

      Besides these technical complications, smith et al. (2010) is also different, in the sense that it is an empirical paper. It does not contain the Generalized Price equation, it contains no insights regarding how to derive population genetic dynamics from the Generalized Price equation, or how to derive the appropriate rules from those, and it has a very different approach to separating fitness effects and population structure.

      To end on a positive note, I would like to quote a bit out of Wu et al. (2013):

      “While we criticise these mathematical issues, we are convinced that smith et al. (2010) aim into the right direction: to incorporate the nonlinearities characteristic of biology into social evolution, we may have to extend and generalize the approach of inclusive fitness. It would be beautiful if such a generalization would ultimately include Hamilton’s original rule as a special case […].”

      I like to think that this is exactly what I have done in this paper.

      References

      Akdeniz, A., & van Veelen, M. (2020). The cancellation effect at the group level. Evolution, 74(7), 1246–1254. doi: 10.1111/evo.13995

      Allen, B., & Tarnita, C. E. (2012). Measures of success in a class of evolutionary models with fixed population size and structure. Journal of Mathematical Biology, 68, 109–143. doi: 10.1007/s00285-012-0622-x

      Archetti, M. (2018). How to Analyze Models of Nonlinear Public Goods. Games 2018, Vol. 9, Page 17, 9(2), 17. doi: 10.3390/g9020017

      Archetti, M., & Scheuring, I. (2011). Coexistence of cooperation and defection in public goods games. Evolution, 65(4), 1140–1148. doi: 10.1111/j.1558-5646.2010.01185.x

      Archetti, M., & Scheuring, I. (2012). Review: Game theory of public goods in one-shot social dilemmas without assortment. Journal of Theoretical Biology, 299, 9–20. doi: 10.1016/j.jtbi.2011.06.018

      Bourke, A. F. G. (2014). Hamilton’s rule and the causes of social evolution. Philosophical Transactions of the Royal Society B: Biological Sciences, 369(1642), 20130362. doi: 10.1098/rstb.2013.0362

      Bourrat, P., Godsoe, W., Pillai, P., Gouhier, T. C., Ulrich, W., Gotelli, N. J., & van Veelen, M. (2023). What is the price of using the Price equation in ecology? Oikos, 2023(8). doi: 10.1111/oik.10024

      Crow, J. F. (2008). Commentary: Haldane and beanbag genetics. International Journal of Epidemiology, 37(3), 442–445. doi: 10.1093/ije/dyn048

      Fisher, R. (1930). The genetical theory of natural selection. Retrieved from https://www.cabidigitallibrary.org/doi/full/10.5555/19601600934

      Fletcher, J. A., & Zwick, M. (2006). Unifying the theories of inclusive fitness and reciprocal altruism. American Naturalist, 168(2), 252–262. doi: 10.1086/506529

      Frank, S. A. (1997). The Price equation, Fisher’s fundamental theorem, kin selection, and causal analysis. Evolution, 51(6), 1712–1729. doi: 10.1111/j.1558-5646.1997.tb05096.x

      Frank, S. A. (1998). Foundations of social evolution. Princeton: Princeton University Press.

      Frank, S. A. (2012). Natural selection. IV. The Price equation*. Journal of Evolutionary Biology, 25(6), 1002–1019. doi: 10.1111/j.1420-9101.2012.02498.x

      Gardner, A., West, S. A., & Wild, G. (2011). The genetical theory of kin selection. Journal of Evolutionary Biology, 24(5), 1020–1043. doi: 10.1111/j.1420-9101.2011.02236.x

      Grafen, A. (1985a). A geometric view of relatedness. Oxford Surveys in Evolutionary Biology, 2(2), 28-89.

      Grafen, A. (1985b). News and Views. Evolutionary theory: Hamilton’s rule OK. Nature, 318(6044), 310–311. doi: 10.1038/318310a0

      Hamilton, W. D. (1964). The genetical evolution of social behaviour. I. Journal of Theoretical Biology, 7(1), 1–16. doi: 10.1016/0022-5193(64)90038-4

      Karlin, S., & Matessi, C. (1983). The eleventh R. A. Fisher Memorial Lecture - Kin selection and altruism. Proceedings of the Royal Society of London. Series B. Biological Sciences, 219(1216), 327–353. doi: 10.1098/rspb.1983.0077

      Matessi, C., & Karlin, S. (1984). On the evolution of altruism by kin selection. Proceedings of the National Academy of Sciences, 81(6), 1754–1758. doi: 10.1073/pnas.81.6.1754

      Nowak, M. A., Tarnita, C. E., & Wilson, E. O. (2010). The evolution of eusociality. Nature, 466(7310), 1057–1062. doi: 10.1038/nature09205

      Okasha, S. (2005). Maynard Smith on the levels of selection question. Biology and Philosophy, 20(5), 989–1010. doi: 10.1007/S10539-005-9019-1/METRICS

      Page, K. M., & Nowak, M. A. (2002). Unifying evolutionary dynamics. Journal of Theoretical Biology, 219(1). doi: 10.1016/S0022-5193(02)93112-7

      Pillai, P., & Gouhier, T. C. (2019). Not even wrong: the spurious measurement of biodiversity’s effects on ecosystem functioning. Ecology, 100(7), e02645. doi: 10.1002/ecy.2645

      Price, G. R. (1970). Selection and Covariance. Nature, 227(5257), 520–521. doi: 10.1038/227520a0

      Price, G. R. (1972). Extension of covariance selection mathematics. Annals of Human Genetics, 35(4), 485-490.

      Queller, D. C. (1985). Kinship, reciprocity and synergism in the evolution of social behaviour. Nature, 318(6044), 366–367. doi: 10.1038/318366a0

      Queller, D. C. (1992a). A general model for kin selection. Evolution, 46(2), 376–380. doi: 10.1111/j.1558-5646.1992.tb02045.x

      Queller, D. C. (1992b). Quantitative Genetics, Inclusive Fitness, and Group Selection. The American Naturalist, 139(3), 540–558. doi: 10.1086/285343

      Queller, D. C. (2011). Expanded social fitness and Hamilton’s rule for kin, kith, and kind. Proceedings of the National Academy of Sciences, 108(supplement_2), 10792–10799. doi: 10.1073/pnas.1100298108

      Rousset, & Billiard. (2000). A theoretical basis for measures of kin selection in subdivided populations: Finite populations and localized dispersal. Journal of Evolutionary Biology, 13(5). doi: 10.1046/j.1420-9101.2000.00219.x

      Rousset, F. (2015). Regression, least squares, and the general version of inclusive fitness. Evolution, 69(11), 2963–2970. doi: 10.1111/evo.12791

      Smith, J., Van Dyken, J. D., & Zee, P. C. (2010). A generalization of hamilton’s rule for the evolution of microbial cooperation. Science, 328(5986), 1700–1703. doi: 10.1126/science.1189675

      Sober, Elliott., & Wilson, D. Sloan. (2007). Unto others : the evolution and psychology of unselfish behavior. 394. Retrieved from https://www.hup.harvard.edu/books/9780674930476

      Taylor, P. D. (1992). Altruism in viscous populations - an inclusive fitness model. Evolutionary Ecology, 6(4), 352–356. doi: 10.1007/bf02270971

      Taylor, Peter D. (1989). Evolutionary stability in one-parameter models under weak selection. Theoretical Population Biology, 36(2), 125–143. doi: 10.1016/00405809(89)90025-7

      Taylor, Peter D., Day, T., & Wild, G. (2007). Evolution of cooperation in a finite homogeneous graph. Nature, 447(7143), 469–472. doi: 10.1038/nature05784

      Van Cleve, J. (2015). Social evolution and genetic interactions in the short and long term. Theoretical Population Biology, 103. doi: 10.1016/j.tpb.2015.05.002

      van Veelen, M. (2005). On the use of the Price equation. Journal of Theoretical Biology, 237(4). doi: 10.1016/j.jtbi.2005.04.026

      van Veelen, M. (2007). Hamilton’s missing link. Journal of Theoretical Biology, 246(3). doi: 10.1016/j.jtbi.2007.01.001

      van Veelen, M. (2011). The replicator dynamics with n players and population structure. Journal of Theoretical Biology, 276(1). doi: 10.1016/j.jtbi.2011.01.044

      van Veelen, M. (2018). Can Hamilton’s rule be violated? ELife, 7. doi: 10.7554/eLife.41901

      van Veelen, M. (2020). The problem with the Price equation. Philosophical Transactions of the Royal Society B: Biological Sciences, 375(1797), 20190355. doi: 10.1098/rstb.2019.0355

      van Veelen, M., Allen, B., Hoffman, M., Simon, B., & Veller, C. (2017). Hamilton’s rule. Journal of Theoretical Biology, 414. doi: 10.1016/j.jtbi.2016.08.019

      van Veelen, M., García, J., Sabelis, M. W., & Egas, M. (2012). Group selection and inclusive fitness are not equivalent; the Price equation vs. models and statistics. Journal of Theoretical Biology, 299. doi: 10.1016/j.jtbi.2011.07.025

      Wilson, D. S., Pollock, G. B., & Dugatkin, L. A. (1992). Can altruism evolve in purely viscous populations? Evolutionary Ecology, 6(4), 331–341. doi: 10.1007/bf02270969

      Wu, B., Gokhale, C. S., van Veelen, M., Wang, L., & Traulsen, A. (2013). Interpretations arising from Wrightian and Malthusian fitness under strong frequency dependent selection. Ecology and Evolution, 3(5). doi: 10.1002/ece3.500

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1:

      (1) Figure 2 and related text: it would be useful to explain more explicitly what is meant by "neurogenic" and "non-neurogenic" models. I presume that the total number of neurons in non-neurogenic models is lower than in neurogenic models because no new neurons are added. It would be useful to plot the number of GCs as a function of timesteps.

      We have clarified the distinction between neurogenic and non-neurogenic models in the text (Lines 142-145), explicitly noting that in non-neurogenic models, no new GCs are added, resulting in a lower total neuron count over time. In response to the reviewer’s suggestion, we generated a plot showing the number of GCs over time (see below). Because the neurogenic model exhibits a simple linear increase, we found this plot not especially informative for inclusion in the manuscript. However, we agree with the reviewer’s later comments that similar plots are useful for interpreting specific results, and we have included those where appropriate.

      Author response image 1.

      Number of GCs over time for neurogenic (solid line) and non-neurogenic (dotted line) networks

      (2) Figure 2F, G: memory declines dramatically when the number of GCs at enrichment onset increases beyond an optimum. Why?

      We have explained the reasoning more thoroughly in the text (Lines 174-177) and added a new supplemental figure to support this reasoning (Figure S2). As the number of GCs increases, the network becomes overly inhibited and the response of abGCs to the stimuli decreases (Fig S2A). This leads to a smaller population of GCs being able to integrate with the stimulus (Fig S2B) which is expected given the activity-dependent plasticity rule. Moreover, it can be seen in Fig S2C that for networks with increasing size, the GCs that do learn only connect to MCs that are driven strongest by the stimuli until they struggle to connect to any MCs at all.

      In principle, a homeostatic mechanism like synaptic scaling could reduce activity to restore balance, but such a mechanism would also likely disrupt existing memories. Alternatively, we suggest activity-dependent apoptosis as a superior homeostatic mechanism because it leads to a stable level of activity without substantially erasing existing memories.

      (3) The paragraph describing synaptic connectivity of abGCs (related to Figure 2H) is confusing. What is the directionality of synapses considered here: mitral-to-granule, or granule-to-mitral? The text is opaque here. Connectivity matrix in Figure 2H: who is presynaptic, who is postsynaptic? If I understand correctly, these questions are actually irrelevant because all mitralgranule synapses in the network are reciprocal. This should be pointed out explicitly in the figure legend. Generally: the fact that the network is fully reciprocal (if I understand correctly) is very important but not stated with sufficient emphasis. It should be stated very explicitly in the text that connectivity matrices are fully reciprocal, and an equation clarifying this point should be included in Methods.

      (6) Connectivity matrix: to what degree was connectivity between mitral and granule cells reciprocal (fraction of connections in either direction that were paired with a connection in the opposite direction between the same cell pair)? Was connectivity shaped by experience (enrichment) reciprocal?

      (7) Directly related to the above: it would be useful to show the disynaptic connectivity matrix between mitral cells and analyze its symmetry. For the symmetric component, it should then be analyzed what fraction of this can be attributed to the reciprocal synapses, and what fraction is contributed by connectivity via different granule cells. This should then be compared to models with biologically realistic fractions of reciprocal connections. Is the model proposed here consistent with a biologically realistic fraction of reciprocal synapses between mitral-granule cell pairs?

      We appreciate these insightful and detailed comments. We agree that the assumption that MC-GC synapses were fully reciprocal was not clearly stated. We now explicitly state this in the main text (lines 90-94, 369-370, Figure 2 caption) and methods (line 561), emphasize its importance. As the reviewer points out, this is a simplifying assumption and does not fully reflect the biology because not all synapses are reciprocal in the true system. We also note that our synaptic plasticity model does not break the reciprocity assumption: all connections added or pruned during learning remain reciprocal. As a result, the disynaptic connectivity matrix (Bottom panel below, MCs sorted by stimulus as shown in the top panel) is always symmetric.

      We have now made these statements explicit in the main text and in the methods. Regarding functional consequences of this assumption, earlier work by our group has examined the impact of the degree of reciprocity of MC-GC synapses in a similar OB model (Chow, Wick & Riecke, Plos Comp Bio 2012). The study examined three different changes in reciprocity by (1) redirecting a fraction of the inhibitory connections of each GC to randomly chosen MCs instead of the MCs that drive that GC, (2) allowing heterogeneity in reciprocal weights so that there is no relationship between the strength of the MC -> GC synapse and the GC -> MC synapse, (3) reducing the level of self-inhibition a MC receives from the GCs that it excites. The model was found to be quite robust to each of these manipulations, suggesting that our present model likely remains functionally relevant even if biological reciprocity is partial. We reference this work now in the discussion, lines 490-492.

      Author response image 2.

      Disynaptic connectivity. Top: MC activity in response to the two stimuli, sorted by MC selectivity. Bottom: Disynaptic connectivity matrix (diagonal subtracted).

      (4) How were mitral cells sorted in Figure 2H? This needs to be explained.

      (5) Directly related to the point above: the text mentions that synaptic connectivity between GCs of the "learning cluster" and mitral cells (which direction?) is increased for mitral cells responding by enrichment odors, but this is not shown in the figure. This statement suggests that mitral cells sorted to the bottom of the y-axis respond more strongly to enrichment odors, but the information is not given directly. Please provide more information to back up your statements.

      Indeed as the reviewer inferred, MCs in Figure 2H were sorted so that those that receive the strongest stimulation from the odor were at the bottom of the y-axis. We have clarified this in the Figure 2 caption and added a subplot to Figure 2H showing the average MC input to make this more explicit.

      (8) Apoptosis (Figure 4 and related text): paragraph 231ff is somewhat difficult to comprehend because the "number" of enrichments should really be the "frequency" of enrichments. In Figure 4, it is not mentioned explicitly that each enrichment is with different random new odors.

      We agree that the term “number” of enrichments was imprecise and have revised the text to refer instead to the frequency of enrichment events (Lines 255-267). We also clarified that in Figure 4, each enrichment corresponds to a different set of randomly sampled odors, and we now state this explicitly in both the Figure 4 legend and main text (Lines 260-261).

      (9) Apoptosis: apoptosis improves memory but the underlying reason remains opaque. A simple prediction of the data in Figure 4D and 4E is that the number of GCs in 4E. It would be helpful to show this. Furthermore, an obvious question that arises is whether a higher frequency of enrichments improves memories because the total number of granule cells is kept low, or because granule cells are removed specifically based on their activity (or both). This could be addressed easily by artificially removing a random subset of granule cells in a simulation such as 4E to match granule cell numbers to the case in 4D.

      Apoptosis improves learning is because it reduces the total inhibition in the network by removing GCs and thus prevents deficits in learning that occur in Fig. 2G as GCs accumulate in the network. As the reviewer inferred, the number of GCs in Figure 4D is lower than in 4E and this is now clarified in the text. This difference was shown implicitly in Supplementary Figure S4D (previously S3D), but we now explicitly reference this plot to support this point as well (Line 266).

      As the reviewer notes, there is a question in whether increased enrichment frequency improves memory because it limits the total number of GCs, or because apoptosis selectively removes GCs based on their activity, or both. Our model supports both mechanisms. Importantly, simply reducing GC numbers through random deletion will degrade existing memories: random removal erodes memory representations encoded by those GCs. In contrast, our age and activity dependent apoptosis rule targets a specific cohort of adult-born GCs. This selective removal minimizes damage to existing memories encoded by GCs outside of this cohort while keeping GC numbers within a regime that supports robust learning (as shown in Figure 2G).

      However, we note that if enrichment frequency becomes too high, even recent memories can be lost due to premature pruning of GCs that have not yet stabilized their synaptic connections. This tradeoff has been shown experimentally (Forest et al., Nat Comm 2019) which we reproduce in our model (Figure S4).

      (10) Text related to Figure 5: "Learning flexibility...approached a steady state when the growth of the network started to saturate". Please show the growth (better: size) of the network (total number of GCs) for these simulations (and other panels in Figure 5). It would also be useful to show the total number of GCs in other figures (e.g. Figure 4; see above).

      We have now added a supplementary figure (Figure S6) that shows the total number of GCs over time for the simulations presented. This confirms that the network size approaches a steady state around the same time that learning flexibility begins to plateau, as noted in the original text (now line 275), and highlights the large number of GCs without apoptosis as well as the slightly reduced number of GCs in the permanent encoding model (line 312).

      (11) As much as I appreciate the comprehensive discussion of the results in a broader context, I feel that the discussion can be somewhat shortened. The section on lateral inhibition is not fully valid given that synaptic connectivity is reciprocal. I also feel that much of the final section (Model assumptions and outlook) can be dropped (except for the last paragraph), not because anything is irrelevant, but because these points have been made, onen repeatedly, in the text above.

      We agree that the discussion could be streamlined and have revised the manuscript accordingly. Specifically, we have shortened the section on lateral inhibition and clarified that the OB relies predominantly on reciprocal connectivity (Line 370). We also agree that parts of the final section were repetitive and have removed these. However, to address comments by Reviewer 3, we also expanded on some of the model assumptions. We thank the reviewer for helping us improve the clarity and focus of the manuscript.

      (12) Figure 5: bolding every 5th curve is confusing.

      We have adjusted our figure accordingly.

      (13) "...we biased the dendritic field...": it would be helpful to explain the idea of a "dendritic field" in a bit more detail prior to this sentence.

      We have now noted that GC’s "dendritic field" refers to the subset of MCs with which it is capable of forming synaptic connections when we initially describe the model (Line 97).

      Reviewer #3:

      (1) The authors find that a network with age-dependent synaptic plasticity outperforms one with constant age-independent plasticity and that having more GC per se is not sufficient to explain this effect. In addition, having an initial higher excitability of GCs leads to increased performance. To what degree the increased excitability of abGCs is conceptually necessarily independent of them having higher synaptic plasticity rates / fast synapses?

      We thank the reviewer for this question, as the difference between excitability and plasticity rate in memory formation is something we intended to highlight in this study. We have updated the (Lines 157-198) to clarify this.

      At the cellular level, a neuron's excitability and its rate of synaptic plasticity are mechanistically distinct: excitability is governed by factors such as ion channel expression or membrane resistance, whereas plasticity rates are influenced by molecular pathways involved in synapse and dendritic spine formation and remodeling. While these are independent properties, they are functionally coupled: most synaptic plasticity rules are activity-dependent, so greater excitability can increase the likelihood of plasticity being induced but does not itself guarantee learning.

      Our model reflects this distinction. Increased excitability biases which neurons become activated and thus eligible to undergo plasticity, but actual learning still depends on the plasticity rate itself. This can be seen by comparing the model constant plasticity and excitability (solid blue and green curves in Figure 2C) to the model with only transient excitability (solid blue and green lines in Figure 2E). In both cases, the strength and duration of the memory remain limited by the plasticity rate. We note additionally that, in this network, neurons compete to learn new stimuli: as GCs start to learn, they suppress MC activity through recurrent inhibition which suppresses learning in other GCs who otherwise would have been in position to learn the odor. As a result there is not a significant increase in the overall number of neurons recruited to learn (Figure 2J). In a different network architecture, such as a feedforward network, we would not expect this to be the case; greater excitability in a population of neurons would likely increase the memory by increasing the number of neurons recruited to learn. Transiently enhanced excitability biases which neurons join the memory engram (Figure 2J), but the extent and rate of learning still depend on the plasticity rates themselves. We did note in the original text (now lines 284-286) that this bias in recruitment subtly increases memory stability, but the extent is not great. In principle, a model can be engineered to rely on transiently increased excitability to encode memories in orthogonal subpopulations of neurons and that this could resolve the flexibility-stability dilemma. However, in that case, the number of memories that can be stored within a short time would be bounded by the size of this subpopulation such that even if a large number of odors are presented, mature GCs cannot become part of the engram and the network would likely fail to learn the stimuli. However, when this was tested experimentally (Forest et al. Cereb Cor. 2020), it was found that mature GCs participated in the engram when the number of odors was sufficiently high. Our results are consistent with these experiments: for complex odor environments, neonatal GCs, which are mature during odor exposure, and abGCs both participate in the engrams.

      Author response image 3.

      Simulating learning in more complex odor environments. Top: enrichment consisted of three odor pairs presented sequentially in a random order. Bottom: enrichment consisted of five odor pairs. Left: discriminability of the odor pairs over time. Middle: connectivity between MCs (sorted by odor selectivity) and GCs (sorted by age). In both cases AbGCs develop a clear connectivity structure. In more complex environments neonatal GCs also start to develop a clear connectivity structure. Right: combined engram membership across all stimuli by GC age.

      In sum, transiently increased excitability alone will not make learning any faster, so a fast learning system must have a high plasticity rate. If this plasticity rate stays high, then memories stored in these neurons, even if no longer highly excitable, will be vulnerable as the neurons can still be driven above their plasticity threshold by moderately interfering stimuli and will thus be quickly forgotten. Conversely, if the reviewer is wondering if a greater increase in the plasticity rate of new neurons can compensate for a lack of excitability, this is not the case: if a newborn neuron is not sufficiently driven by the stimulus it will not learn regardless of how high its plasticity rate is.

      (2) The authors do not mention previous theoretical work on the specificity of mitral to granule cell interactions from several groups (Koulakov & Rinberg - Neuron, 2011; Gilra & Bhalla, PLoSOne, 2015; Grabska-Bawinska...Mainen, Pouget, Latham, Nat. Neurosci. 2017; Tootoonian, Schaefer, Latham, PLoS Comput. Biol., 2022), nor work on the relevance of top-down feedback from the olfactory cortex on the abGC during odor discrimination tasks (Wu & Komiyama, Sci. Adv. 2020), or of top-down regulation from the olfactory cortex on regulating the activity of the mitral/tuned cells in task engaged mice (Lindeman et al., PLoS Comput. Biol., 2024), or in naïve mice that encounter odorants (in the absence of specific context; Boyd, et al., Cell Rep, 2015; Otazu et al., Neuron 2015, Chae et al., Neuron, 2022). In particular, the presence of rich topdown control of granule cell activity (including of abGCs) puts into question the plausibility of one of the opening statements of the authors with respect to relying solely on local circuit mechanisms to solve the flexibility-stability dilemma. I think the discussion of this work is important in order to put into context the idea of specific interactions between the abGCs and the mitral cells.

      We thank the reviewer for these detailed and thorough comments, and whole-heartedly agree that it is important to discuss the listed studies in order to contextualize our work through the broader lens of how information is processed in the OB. We have expanded our discussion to further acknowledge and integrate insight from previous theoretical and experimental work cited by the reviewer. (Lines 361-366, 493-550)

      Regarding the importance of top-down feedback, we of course recognize that in practice cortical inputs play a critical role in abGC survival and synaptic integration. However, its nature is not quite clear and is likely variable across behavioral seungs. In the paradigm that we study in the manuscript, there is likely no key reward value or contextual signal that is relayed to the OB. One plausible interpretation is that in this task, cortical feedback provides a random, variable baseline excitatory drive to GCs. This would likely be consistent with many of the listed studies, e.g.

      (1) Glomerular layer targeting of feedback would be explicitly unrelated to glomerular odor specificity, as in Boyd et al.

      (2) GC activity would decrease if these cortical inputs were silenced, resulting in stronger MC responses as in Otazu et al., Chae et al.

      (3) Silencing PCx during learning would prevent GCs from reaching activity-dependent plasticity thresholds, resulting in decreased spine density as in Wu & Komiyama.

      Likewise activating PCx would lead to increased spine density.

      In this interpretation, the effect of top-down input could be captured implicitly by adjusting model parameters such as activity or plasticity thresholds. For the purposes of our study, we opted to neglect these inputs in favor of model simplicity.

      Critically, even if top-down inputs play a substantially larger role, by perhaps even going as far as providing signals to abGCs to modulate their development, the core solution to the flexibility-stability dilemma that we describe stays local: we predict that the memory persists in the same network in which it was formed.

      (3) To what the degree of specific connectivity reflects a specific stimulus configuration, and is a good proxy for determining the stimulus discriminability and memory capacity in terms of temporal activity patterns (difference in latency/phase with respect to the respiration cycle, etc.) which may account to a substantial fraction of ability to discriminate between stimuli? The authors mention in the discussion that this is, indeed, an upper bound and specific connectivity is necessary for different temporal activity patterns, but a further expansion on this topic would help in understanding the limitations of the model.

      We thank the reviewer for raising this important point. Indeed, there have been several recent experimental studies indicating that much of the information needed for olfactory discrimination is encoded in the temporal activity patterns of mitral and tuned cells. Our model does not explicitly simulate these dynamics. It was for this reason that we defined memory in terms of the learned structure of the network rather than by firing rate activity. This is motivated by the idea that learned patterns of connectivity constrain the space of neural activity the network can support, and thus shape stimulus responses. We now make this limitation more explicit in the discussion and clarify that the specific MC–GC connectivity we analyze should be seen as a structural substrate that constrains the possible temporal transformations the network could support (Lines 492-506).

      (4) Reward or reward prediction error signals are not considered in the model. They however are ubiquitous in nature and likely to be encountered and shape the connectivity and activity patterns of the abGC-mitral cell network. Including a discussion of how the model may be adjusted to incorporate reward/error signals would strengthen the manuscript.

      We appreciate the reviewer’s suggestion and agree that reward and reward prediction error signals are critical components of many learning paradigms. We deliberately chose not to model associative learning, reward signals or top-down neuromodulation in this work. Our goal is to investigate the role of adult neurogenesis in a regime where its contribution has been shown to be experimentally necessary. Specifically, we focused on an unsupervised perceptual learning paradigm where adult neurogenesis is required for successful odor discrimination (Moreno et al. PNAS, 2008). In contrast, when the same odors are used in a rewarded learning paradigm, performance remains intact even when adult neurogenesis is ablated (Imayoshi et al., Nat. Neuro., 2008). This dissociation suggests that neurogenesis is dispensable in contexts where reward can guide learning. As such, we argue that isolating the contribution of local circuit dynamics in an unsupervised setting is critical to understanding what neurogenesis is uniquely enabling, especially given the evolutionary cost of maintaining it.

      We agree that extending this work to incorporate reward-driven plasticity or neuromodulatory influences would be a valuable direction for future research. In particular, it could help clarify how different learning paradigms engage distinct abGC cohorts (e.g., Mandairon et al., eLife 2018; Wu & Komiyama, Sci. Adv. 2020), and how task structure shapes memory allocation and engram composition. We have incorporated this into the discussion regarding extending our model to include top down feedback (lines 539-553).

      Specific comments

      (1) Lines 84-86; 507-509; Eq(3): Sensory input is defined by a basal parameter of MCs spontaneous activity (Sspontaneus) and the odor stimuli input (Siodor) but is not clear from the main text or methods how sensory inputs (glomerular patterns) were modeled

      We now clarify in the Methods section "Stimulus model" how the sensory inputs were modeled. Specifically, odor-evoked inputs to mitral cells (Siodor) were generated either as Gaussian profiles across the mitral cell population (Figs. 2,3) or as sparser random patterns (Figs. 4,5). In Figures 2 and 3, the denser Gaussian stimuli require more GCs to learn the odors, aiding in visualization of the connectivity matrix (Figure 2H) and abGC recruitment plots (Figure 2I,J; Figure 3C,E). However, real olfactory stimuli activate a sparse set of MCs, so in Figures 4 and 5 where we address learning of many stimuli, we utilize sparser, binary, stimuli delivered to only 10% of MCs, in range of experimental data (Wachowiak and Cohen, Neuron, 2001). The fact that the stimuli are binary, however, is not realistic and leads to denser representations. This leads to a worst-case scenario for the model as denser memory representations are easier to overwrite. These points has been added explicitly to the Methods section "Stimulus model" to improve clarity.

      (2) Lines 118-122: The used perceptual learning task explanation is done only in the context of the discriminability of similar artificial stimuli using the Fisher discriminant and "Memory" metric. A detailed description of the logic of the perceptual learning task methods and objective, taking into account Comment 1, would help to better understand the model.

      We thank the reviewer for pointing out had not adequately described the task and have updated the main text (lines 125-132) and included a new methods section "Perceptual learning task" to describe it more explicitly. The experiments that inspired the simulation followed an ecological model of discrimination learning (Moreno et al. PNAS 2009): For one hour a day over a ten day "enrichment period", two tea balls containing similar but distinct odors were suspended from the lid of each mouse's home cage. The mice engaged with the stimuli under self-directed conditions, therefore learning through natural experience. As a result the mice use olfactory information to discriminate between the similar stimuli, a skill potentially relevant for navigation or social behaviors.

      In our simulations, we model these experiments as follows. During the enrichment period, the model is stimulated with a randomly selected stimulus chosen from a set of two similar stimuli, corresponding to a mouse choosing to sniff one of the tea balls. During enrichment, in between these bouts of "sniffing", the model only receives spontaneous activity, reflecting the temporal sparsity of sensory input even over the enrichment period. Outside of enrichment, the model again receives only spontaneous input.

      (3) Rapid re-learning of forgotten odor pair is enabled by sensory-dependent dendritic elaboration of neurons that initially encoded the odors and the observed re-learning would occur even if neurogenesis was blocked following the first enrichment and even though the initial learning did require neurogenesis. When this would ever occur in nature? The re-learning of an odor period? Why is this highlighted in the study?

      We believe that this sort of learning is certainly relevant in nature. To clarify: by “learning,” we do not refer to the memory of an entire “odor period”, but simply an altered mapping of specific stimuli. Therefore, forgeung could occur if these specific stimuli are absent from the environment for a period of time, and re-learning would occur when these stimuli are re-encountered. Natural odor environments are highly dynamic, as environmental conditions and social contexts change over time. The odors an animal encounters also depend strongly on its own behavior; as it explores different environments, it may be exposed to particular odors intermittently: it could encounter them in one location, then not return to that location for some time before returning again.

      Such natural variability in odor exposure makes the ability to forget and re-learn especially valuable, allowing the animal to prioritize relevant information while maintaining flexibility. To this end, we show in Figure 5G that the synaptic forgetting of odors is beneficial to the performance of the model because it reduces interference in the network. Therefore we highlight that re-learning enabled by adult neurogenesis is a highly efficient strategy for memory storage and retrieval, which is why he emphasize it in this study.

      (4) Figure 2A: I understand that the ages shown at the bottom of the colored boxes represent the GC age. If so, find a better way to express that to avoid confusing 'GC ages' from the days shown in the perceptual learning task description (Figure 2B).

      We have updated the text in the figure to disambiguate the two and refer to the “days” shown in the perceptual learning task description now as “time relative to enrichment”

      (5) Figure 2B: Clarify how the two-dimensional arrays are arranged to represent the patterns shown. Does each point of the array represent one neuron? If so, are these neurons re-arranged to help the readers visually differentiate patterns A and B? Are the patterns of activity of MCs in the model spatially and temporally sparse as observed in experimental work?

      In Figure 2B, each point in the two-dimensional array represents the activity of a single mitral cell. The layout is purely for visualization—neurons are re-arranged to make the differences between odor patterns A and B visually apparent. This ordering does not reflect anatomical position or model architecture. We revised the Figure 2 caption to say this explicitly.

      Regarding spatial sparseness, as we mentioned in the response to the reviewer’s comment (1), the activity of mitral cells in response to odors is spatially sparse in the model. Regarding temporal sparseness, while the model is not spiking and does not include temporal dynamics within the timescale of the breath, however, odor input is delivered in discrete, odorspecific epochs interleaved with periods of no input, which leads to temporally structured activity patterns. This information has been made explicit in the new methods sections "Stimulus model" and "Perceptual learning task"

      (6) Figure 3C and Line 189: potential confusion between the color code mentioned in the legend for the enrichment and developing periods.

      It appeared to be a confusion in the text and has been corrected (Lines 212-213).

      (7) Figure 5F: For clarity, this would benefit from replacing the bold line with areas in the plot to depict the enrichment periods.

      We agree that replacing the bolded line segments with shaded areas is more clear and have updated the figure accordingly, and appreciate the reviewer's suggestion to clarify the figure.

      (8) Lines 380, 416: Potential role of cortical feedback and or neuromodulation depending on behavioral relevance or permanent exposure? Later mentioned in Lines 467 - 474.

      We have updated the text to acknowledge the role of potential cortical feedback and neuromodulation, now in lines 403-407.

    1. Reviewer #2 (Public review):

      In this study, Fontana et al. develop a paradigm for associative conditioning by pairing exposure to an alarm substance with a novel tank. Exposure to conspecific alarm substance (CAS) in the novel tank triggers freezing and what they characterize as evasive swimming behaviour, which is subsequently seen in a re-exposure to the novel tank without the CAS present. Importantly, these states are identified via automated processes, including postural tracking and a random forest classification process, which could be very useful tools for subsequent studies.

      In their experiments, they focus on the differences in behaviour among strains of zebrafish (both males and females), and among individual zebrafish. For males and females of different strains, they find some differences, though the clearest message seems to be that the most robust measure of the behaviour in response to both the CAS and in the memory trials is the freezing behaviour, while evasive behaviour is more variable. and not always seen. This may relate to their observation of significant "evasiveness" in vehicle control experiments (discussed further below).

      Moving on to individual variation from within this multi-strain male/female dataset, they first examine transition matrices between states and find tthat his is not dramatically altered by stimulus exposure. They then use clustering to identify 4 different "classes" of zebrafish that differ in their expression (or not) of two types of behaviour: freezing and/or evasive behaviour. They show that over the three exposure epochs of the experiment, this classification is somewhat stable in an individual fish, though many fish change their behaviour - e.g., evading + freezing -> only freezing.

      In the final set of experiments, the authors move beyond behavioural analyses and perform whole-brain cFos mapping of these individual zebrafish. They perform analyses aimed at identifying correlations between individual behavioural expression and the number of cFos-positive cells in different brain regions. Using partial least squares analysis, they find areas associated with two types of behavioural contrasts, which differ in their weighting of different behavioural expression during the Memory trials. Covariation and network structure analysis within different classes of larvae also find some differences in covariation among brain areas, providing hypotheses as to underlying network effects that may govern the expression of freezing and/or evasive behavior in the memory trial phases.

      Overall, I find this to be an interesting study that employs state of the are methods of behavioural analyses and whole-brain cFos analyses, but I am left a little bit confused as to what the take home message is and what can be concluded from this complex study that mixes in analyses of strain, sex, and individuality within a quite complex assay with multiple behavioural parameters.

      My suggestions are as follows:

      (1) My first concern relates to the claim in the abstract that "We found that fear memory behavior fell into four distinct groups: non-reactive, evaders, evading freezers, and freezers".

      In my opinion, the "freezing" aspect is well supported as being both triggered by the CAS and for memory effect upon re-exposure to the tank, but I am less convinced about the "evasive" behaviour. In Figure 2, it appears that "evasiveness" is generally not increased in both the Exposure or Memory phases for many groups, and in Figure 5, it appears that "evasiveness" is expressed by nearly 50% of the fish in the pre-exposure condition before CAS addition and in all phases in the vehicle condition. Therefore, it appears that most of the expression of this behaviour is independent of any memory-based effect.

      (2) My second concern relates to the claim in the abstract that "background strain and sex influenced how fish respond to CAS, with males more likely to increase evasive behaviors than females and the TU strain more likely to be non-reactive."

      My understanding, based on the introduction and on the methods, is that it is likely important that the CAS be prepared from conspecifics of the same strain and sex, and for this reason, they prepared different CAS specific for each strain and each sex. Therefore, the "CAS" that is applied is necessarily different for each condition, and I am concerned about if the differences observed could relate more to variation in the quality, purity, concentration, etc. of the specific CAS samples for different groups, rather than their reactivity to the substance or their ability to form memories based on such experiences.

      (3) My third concern relates to the interpretation of the cFos data.

      As I mentioned above, I feel as though the behavioural analysis is perhaps more complex than is warranted via the inclusion of evasiveness, and I wonder if the conclusions from the experiments would be simpler if analyzed only from the perspective of freezing.

      But considering the presented analyses: while I dont think there is anything wrong with the partial least squares approach and the network analyses, I am concerned that the simple messaging in the text does not reflect the complexity of this analysis combining different weightings of different behavioural characteristics in a behavioural contrast, or covariations among many regions and what such analyses mean at the level of brain function. For these reasons, I feel like statements along the lines of "Behavioral variation is driven by differences in the activity of brain regions outside the telencephalon, such as the cerebellum, preglomerular nuclei, preoptic area and hypothalamus" are not well supported.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      We thank the reviewer for his/her very positive comments.

      Reviewer #2 (Public review):

      We thank the reviewer for his/her positive evaluation. We plan to add RNAseq data of yeast wild-type and JDP mutant strains as more direct readout for the role of Apj1 in controlling Hsf1 activity. We agree with the reviewer that our study includes one major finding: the central role of Apj1 in controlling the attenuation phase of the heat shock response. In accordance with the reviewer we consider this finding highly relevant and interesting for a broad readership. We agree that additional studies are now necessary to mechanistically dissect how the diverse JDPs support Hsp70 in controlling Hsf1 activity. We believe that such analysis should be part of an independent study but we will indicate this aspect as part of an outlook in the discussion section of a revised manuscript.

      Reviewer #3 (Public review):

      We thank the reviewer for his/her suggestions. We agree that it is sometimes difficult to distinguish direct effects of JDP mutants on heat shock regulation from indirect ones, which can result from the accumulation of misfolded proteins that titrate Hsp70 capacity. We also agree that an in vitro reconstitution of Hsf1 displacement from DNA by Apj1/Hsp70 will be important, also to dissect Apj1 function mechanistically. We will add this point as outlook to the revised manuscript.

      Reviewer #1 (Recommendations for the authors): 

      (1) Can the authors submit the raw translatome data to a standard repository? Also, the data should be summarized in a supplemental Excel table. 

      We submitted the raw translatome data to the NCBI Gene Expression Omnibus and added the analyzed data sets (shown in Figures 1 and 5) as Supplementary Tables S4/S5 (excel sheets). We additionally included RNAseq analysis of yeast WT and JDP mutants set grown at 25°C, complementing and confirming our former translatome analysis (new Figure 5, Figure Supplement 2). Respective transcriptome raw data were also deposited at the NCBI Gene Expression Omnibus and analyzed data are available as Supplementary Table S7.

      (2) MW indicators need to be added to the Western Blot figures. 

      We added molecular weight markers to the Western Blot figures.

      (3) Can the authors please include the sequences of the primers used in all the RT-qPCR experiments? They mention they are in the supplemental information, but I couldn't locate them. 

      We added the sequences of the RT-qPCR primers as Supplementary Table S4.

      (4) Given the clear mechanism proposed, it would be nice if the authors could provide a nice summary figure. 

      We followed the suggestion of the reviewer and illustrate our main finding as new Figure 7.

      Reviewer #2 (Recommendations for the authors): 

      (1) As mentioned above, a co-IP experiment between Hsf1 and Ssa1/2 in APJ1 and apj1∆ cells, utilizing Hsf1 alleles with and without the two known binding sites, would cement the assignment of Apj1 in the Hsf1 regulatory circuit. 

      We agree with the reviewer that Hsf1-Ssa1/2 pulldown experiments, as done by Pincus and colleagues (1), will further specify the role of Apj1 in targeting Hsp70 to Hsf1 during the attenuation phase of the heat shock response. We have tried extensively such pulldown experiments to document dissociation of Ssa1/2 from Hsf1 upon heat shock in yeast wild-type cells. While we could specifically detect Ssa1/2 upon Hsf-HA1 pulldown, our results after heat shock were highly variable and inconclusive and did not allow us to probe for a role of Apj1 or the two known Ssa1/2 binding sites in the phase-specific targeting. We now discuss the potential roles of the two distinct Ssa1/2 binding sites for phase-specific regulation of Hsf1 activity in the revised manuscript (page 12, lanes 17-21).

      (2) Experiments in Figure 3 nicely localize CHIP reactions with known HSEs. A final confirmatory experiment utilizing a mutated HSE (another classic experiment in the field) would cement this finding and validate the motif and reporter-based analysis. 

      We thank the reviewer for this meaningful suggestions. We have done something like this by using the non-Hsf1 regulated gene BUD3, which lacks HSEs, as reference. We engineered a counterpart, termed “BUD3 HS-UAS”, which bears inserted HSEs, derived from the native UAS of HSP82, within the BUD3 UAS. We show that BUD3<sup>+</sup> lacking HSEs is not occupied by Hsf1 and Apj1 under either non-stress or heat shock conditions while BUD3-HSE is clearly occupied under both, paralleling Hsf1 and Apj1 occupancy of HSP82 (Figure 3E). We have renamed the engineered allele to “BUD3-HSE” to clarify the experimental design and output.

      (3) Page 8 - the ydj1-4xcga allele is introduced without explaining why it's needed, since ydj1∆ cells are viable. The authors should acknowledge the latter fact, then justify why the RQC depletion approach is preferred. Especially since the ydj1∆ mutant appears in Figure 5B. 

      ydj1∆ cells are viable, yet they grow extremely slowly at 25°C and hardly at 30°C,  making them difficult to handle. The RQC-mediated depletion of Ydj1 in ydj1-4xcga cells allows for solid growth at 30°C, facilitating strain handling and analysis of Ydj1 function. Importantly, ydj1-4xcga cells are still temperature-sensitive and exhibit the same deregulation of the heat shock response upon combination with apj1D as observed for ydj1∆ cells. Thus ydj1 knockout and knockdown cells do not differ in the relevant phenotypes reported here and we performed most of the analysis with  ydj1-4xcga cells due to their growth advantage. We added a respective explanation to the text (page 8, lanes 13-14) .

      (4) The authors raise the possibility that Sis1, Apj1, and Ydj1 may all be competing for access to Ssa1/2 at different phases of the HSR, and that access may be dictated by conformational changes in Hsf1. Given that there are at least two known Hsp70 binding sites that have negative regulatory activity in Hsf1, the possibility that domain-specific association governs the different roles should be considered. It is also unclear how the JDPs are associating with Hsf1 differentially if all binding is through Ssa1/2. 

      We thank the reviewer for the comment and will add the possibility of specific roles of the identified Hsp70 binding sites in regulating Hsf1 activity at the different phases of the heat shock response to the discussion section. Binding of Ssa1/2 to substrates (including Hsf1) is dependent on J-domain proteins (JDPs), which differ in substrate specificity. It is tempting to speculate that the distinct JDPs recognize different sites in Hsf1 and are responsible for mediating the specific binding of Ssa1/2 to either N- or C-terminal sites in Hsf1. Thus, the specific binding of a JDP to Hsf1 might dictate the binding to Ssa1/2 to either binding site. We discuss this aspect in the revised manuscript (page 12, lanes 17-21).

      (5) Figure 6 - temperature sensitivity of hsf1 and ydj1 mutants has been linked to defects in the cell wall integrity pathway rather than general proteostasis collapse. This is easily tested via plating on osmotically supportive media (i.e., 1M sorbitol) and should be done throughout Figure 6 to properly interpret the results.

      Our data indicate proteostasis breakdown in ydj1 cells by showing strongly altered localization of Sis1-GFP, pointing to massive protein aggregation (Figure 6 – Figure Supplement  1D).

      We followed the suggestion of the reviewer and performed spot tests in presence of 1 M sorbitol (see figure below). The presence of sorbitol is improving growth of ydj1-4xcga mutant cells at increased temperatures, in agreement with the remark of the reviewer. We, however, do not think that growth rescue by sorbitol is pointing to specific defects of the ydj1 mutant in cell wall integrity. Sorbitol functions as a chemical chaperone and has been shown to have protective effects on cellular proteostasis and to rescue phenotypes of diverse point mutants in yeast cells by facilitating folding of the respective mutant proteins and suppressing their aggregation (2-4). Thus sorbitol can broadly restore proteostasis, which can also explain its effects on growth of ydj1 mutants at increased temperatures. Therefore the readout of the spot test with sorbitol is not unambiguous and we therefore prefer not showing it in the manuscript.

      Author response image 1.

      Serial dilutions of indicated yeast strains were spotted on YPD plates without and with 1 M sorbitol and incubated at indicated temperatures for 2 days.<br />

      Reviewer #3 (Recommendations for the authors): 

      (1) Line 154: Can the authors, by analysis, offer an explanation for why HSR attenuation varies between genes for the sis1-4xcga strain? Is it, for example, a consequence of that a hypomorph and not a knock is used, a mRNA turnover issue, or that Hsf1 has different affinities for the HSEs in the promoters? 

      We used the sis1-4xcga knock-down strain because Sis1 is essential for yeast viability. The point raised by the reviewer is highly valid and we extensively thought about the diverse consequences of Sis1 depletion on levels of e.g. translated BTN2 (minor impact) and HSP104 (strong impact) mRNA. We meanwhile performed transcriptome analysis and confirmed the specific impact of Sis1 depletion on HSP104 mRNA levels, while BTN2 mRNA levels remained much less affected (new Figure 5 - Figure Supplement 2A/B). We compared numbers and spacings of HSEs in the respective target genes but could not identify obvious differences. Hsf1 occupancy within the UAS region of both BTN2 and HSP104 is very comparable at three different time points of a 39°C heat shock: 0, 5 and 120 min, arguing against different Hsf1 affinities to the respective HSEs (5). The molecular basis for the target-specific derepression upon Sis1 depletion thus remains to be explored. We added a respective comment to the revised version of the manuscript (page 12, lanes 3-8) .

      (2) Line 194: The analysis of ChIP-seq is not very elaborated in its presentation. How specific is this interaction? Can it be ruled out by analysis that it is simply the highly expressed genes after the HS that lead to Apj1 appearing there? More generally: Can the data in the main figure be presented to give a more unbiased genome-wide view of the results?

      We overall observed a low number of Apj1 binding events in the UAS of genes. The interaction of Apj1 with HSEs is specific as we do not observe Apj1 binding to the UAS of well-expressed non-heat shock genes. Similarly, Apj1 does not bind to ARS504 (Figure S3 – Figure Supplement 1). We extended the description of our ChIP-seq analysis procedures leading to the identification of HSEs as Apj1 target sites to make it easier to understand the data analysis. We additionally re-analysed the two Apj1 binding peaks that did not reveal an HSE in our original analysis. Using a modified setting we can identify a slightly degenerated HSE in the promoter region of the two genes (TMA10, RIE1) and changed Figure 3C accordingly. Notably, TMA10 is a known target gene of Hsf1. The expanded analysis is further documenting the specificity of the Apj1 binding peaks.

      (3) Line 215. Figure 3. The clear anticorrelation is puzzling. Presumably, Apj1 binds Hsf1 as a substrate, and then a straight correlation is expected: When Hsf1 substrate levels decrease at the promoters, also Apj1 signal is predicted to decrease. What explanations could there be for this? Is it, for example, that Hsf1 is not always available as a substrate on every promoter, or is Apj1 tied up elsewhere in the cell/nucleus early after HS? 

      We propose that Apj1 binds HSE-bound Hsf1 only after clearance of nuclear inclusions, which form upon heat stress. Apj1 thereby couples the restoration of nuclear proteostasis to the attenuation of the heat shock response. This explains the delayed binding of Apj1 to HSEs (via Hsf1), while Hsf1 shows highest binding upon activation of the heat shock response (early timepoints). Notably, the binding efficiency of Hsf1 and Apj1 (% input) largely differ, as we determine strong binding of Hsf1 five min post heat shock (30-40% of input), whereas maximal 3-4% of the input is pulled down with Apj1 (60 min post heat shock) (Figure 3D). Even at this late timepoint 10-20% of the input is pulled down with Hsf1. The diverse kinetics and pulldown efficiencies suggest that Apj1 displaces Hsf1 from HSEs and accordingly Hsf1 stays bound to HSEs in apj1D cells (Figure 4). This activity of Apj1 explains the anti-correlation: increased targeting of Apj1 to HSE-bound Hsf1 will lower the absolute levels of HSE-bound Hsf1. What we observe in the ChIP experiment at the individual timepoints is a snapshot of this reaction. Accordingly, at the last timepoint (120 min after heat shock ) analyzed, we observe low binding of both Hsf1 and Apj1 as the heat shock response has been shut down.

      (4) Line 253: "Sis-depleted".  

      We have corrected the mistake.

      (5) Line 332: Fig. 6C SIS1 OE from pRS315. A YIP would have been better, 20% of the cells will typically not express a protein with a CEN/ARS of the pRS-series so the Sis1 overexpression phenotype may be underestimated and this may impact on the interpretation. 

      We agree with the reviewer that Yeast Integrated Plasmids (YIP) represent the gold standard for complementation assays. We are not aware of a study showing that 20% of cells harboring pRS-plasmids do not express the encoded protein. The results shown in Fig. 8C/D demonstrate that even strong overproduction of Sis1 cannot restore Hsf1 activity control. This interpretation also will not be affected assuming that a certain percentage of these cells do not express Sis1. Nevertheless, we added a comment to the respective section pointing to the possibility that the Sis1 effect might be underestimated due to variations in Sis1 expression (page 11, lanes 15-19).

      (6) Figure 1C. Since n=2, a more transparent way of showing the data is the individual data points. It is used elsewhere in the manuscript, and I recommend it. 

      We agree that showing individual data points can enhance transparency, particularly with small sample sizes. However, the log2 fold change (log2FC) values presented in Figure 1C and other figures derived from ribosome profiling and RNAseq experiments were generated using the DESeq2 package. This DeSeq2 pipeline is widely used in analyzing differential gene expression and known for its statistical robustness. It performs differential expression analysis based on a model that incorporates normalization, dispersion estimation, and shrinkage of fold changes. The pipeline automatically accounts for biological, technical variability, and batch effects, thereby improving the reliability of results. These log2FC values are not directly calculated from log-transformed normalized counts of individual samples but are instead estimated from a fitted model comparing group means. Therefore, the individual values of replicates in DESeq2 log2FC cannot be shown.

      (7) Figure 1D. Please add the number of minutes on the X-axis. Figure legend: "Cycloheximide" is capitalized.  

      We revised the figure and figure legend as recommended.

      (8) Several figure panels: Statistical tests and SD error bars for experiments performed in duplicates simply feel wrong for this reviewer. I do recognize that parts of the community are calculating, in essence, quasi-p-values using parametric methods for experiments with far too low sample numbers, but I recommend not doing so. In my opinion, better to show the two data points and interpret with caution.

      We followed the advice of the reviewer and removed statistical tests for experiments based on duplicates.

      References

      (1) Krakowiak, J., Zheng, X., Patel, N., Feder, Z. A., Anandhakumar, J., Valerius, K. et al. (2018) Hsf1 and Hsp70 constitute a two-component feedback loop that regulates the yeast heat shock response eLife 7,

      (2) Guiberson, N. G. L., Pineda, A., Abramov, D., Kharel, P., Carnazza, K. E., Wragg, R. T. et al. (2018) Mechanism-based rescue of Munc18-1 dysfunction in varied encephalopathies by chemical chaperones Nature communications 9, 3986

      (3) Singh, L. R., Chen, X., Kozich, V., and Kruger, W. D. (2007) Chemical chaperone rescue of mutant human cystathionine beta-synthase Mol Genet Metab 91, 335-342

      (4) Marathe, S., and Bose, T. (2024) Chemical chaperone - sorbitol corrects cohesion and translational defects in the Roberts mutant bioRxiv  10.1101/2024.09.04.6109452024.2009.2004.610945

      (5) Pincus, D., Anandhakumar, J., Thiru, P., Guertin, M. J., Erkine, A. M., and Gross, D. S. (2018) Genetic and epigenetic determinants establish a continuum of Hsf1 occupancy and activity across the yeast genome Mol Biol Cell 29, 3168-3182

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      This manuscript from Jones and colleagues investigates a previously described phenomenon in which P. falciparum malaria parasites display increased trafficking of proteins displayed on the surface of infected RBCs, as well as increased cytoadherence in response to febrile temperatures. While this parasite response was previously described, it was not uniformly accepted, and conflicting reports can be found in the literature. This variability likely arises due to differences in the methods employed and the degree of temperature increase to which the parasites were exposed. Here, the authors are very careful to employ a temperature shift that likely reflects what is happening in infected humans and that they demonstrate is not detrimental to parasite viability or replication. In addition, they go on to investigate what steps in protein trafficking are affected by exposure to increased temperature and show that the effect is not specific to PfEMP1 but rather likely affects all transmembrane domain-containing proteins that are trafficked to the RBC. They also detect increased rates of phosphorylation of trafficked proteins, consistent with overall increased protein export.

      Strengths:

      The authors used a relatively mild increase in temperature (39 degrees), which they demonstrate is not detrimental to parasite viability or replication. This enabled them to avoid potential complications of a more severe heat shock that might have affected previously published studies. They employed a clever method of fractionation of RBCs infected with a var2csa-nanoluc fusion protein expressing parasite line to determine which step in the export pathway was likely accelerating in response to increased temperature. This enabled them to determine that export across the PVM is being affected. They also explored changes in phosphorylation of exported proteins and demonstrated that the effect is not limited to PfEMP1 but appears to affect numerous (or potentially all) exported transmembrane domain-containing proteins.

      Weaknesses:

      All the experiments investigating changes resulting from increased temperature were conducted after an increase in temperature from 16 to 24 hours, with sampling or assays conducted at the 24 hr mark. While this provided consistency throughout the study, this is a time point relatively early in the export of proteins to the RBC surface, as shown in Figure 1E. At 24 hrs, only approximately 50% of wildtype parasites are positive for PfEMP1, while at 32 hrs this approaches 80%. Since the authors only checked the effect of heat stress at 24 hrs, it is not possible to determine if the changes they observe reflect an overall increase in protein trafficking or instead a shift to earlier (or an accelerated) trafficking. In other words, if a second time point had been considered (for example, 32 hrs or later), would the parasites grown in the absence of heat stress catch up?

      We did not assess cytoadhesion at later stages, but in the supplementary figures we show that at 40 hours post infection both heat stress and control conditions have comparable proportions of VAR2CSA-positive iRBCs, whilst they differ at 24h. This is true for the DMSO (control wildtype resembling) HA-tagged lines of HSP70x and PF3D7_072500 (Supplementary Figures 9 and 12 respectively). In the light that protein levels appear not changed, we conclude that trafficking is accelerated during these earlier timepoints, but remains comparable at later stages. This would still increase the overall bound parasite mass as parasites start to adhere earlier during or after a heat stress.

      Reviewer #2 (Public review):

      This manuscript describes experiments characterising how malaria parasites respond to physiologically relevant heat-shock conditions. The authors show, quite convincingly, that moderate heat-shock appears to increase cytoadherance, likely by increasing trafficking of surface proteins involved in this process.

      While generally of a high quality and including a lot of data, I have a few small questions and comments, mainly regarding data interpretation.

      (1) The authors use sorbitol lysis as a proxy for trafficking of PSAC components. This is a very roundabout way of doing things and does not, I think, really show what they claim. There could be a myriad of other reasons for this increased activity (indeed, the authors note potential PSAC activation under these conditions). One further reason could be a difference in the membrane stability following heat shock, which may affect sorbitol uptake, or the fragility of the erythrocytes to hypotonic shock. I really suggest that the authors stick to what they show (increased PSAC) without trying to use this as evidence for increased trafficking of a number of non-specified proteins that they cannot follow directly.

      This is a valid point, however, uninfected RBCs do not lyse following heat stress, nor do much younger iRBCs, indicating that the observed effect is specific to infected RBCs at a defined stage. The sorbitol sensitivity assay is performed at 37°C under normal conditions after cells are returned to non–heat stress temperatures, so the effect is not due to transient changes in membrane permeability at elevated temperature. 

      Planned experiment: However, to increase the strength of our conclusions and further test our hypothesis, we will perform sorbitol sensitivity assays on >20 hours post infection iRBCs following heat stress in the presence and absence of furosemide, a PSAC inhibitor. If iRBC lysis is abolished with furosemide present, this would confirm that the effect is PSAC-dependent. However, the effect could also possibly be due to altered PSAC activity during heat stress which is maintained at lower temperatures, as outlined in the discussion.

      (2) Supplementary Figure 6C/D: The KAHRP signal does not look like it should. In fact, it doesn't look like anything specific. The HSP70-X signal is also blurry and overexposed. These pictures cannot be used to justify the authors' statements about a lack of colocalisation in any way.

      Planned experiment: We agree that the IFAs are not the best as presented and will include better quality supplementary images in a revised version.

      (3) Figure 6: This experiment confuses me. The authors purport to fractionate proteins using differential lysis, but the proteins they detect are supposed to be transmembrane proteins and thus should always be found associated with the pellet, whether lysis is done using equinatoxin or saponin. Have they discovered a currently unknown trafficking pathway to tell us about? Whilst there is a lot of discussion about the trafficking pathways for TM proteins through the host cell, a number of studies have shown that these proteins are generally found in a membrane-bound state. The authors should elaborate, or choose an experiment that is capable of showing compartment-specific localisation of membrane-bound proteins (protease protection, for example).

      We do not believe we identified a novel trafficking pathway, but that we capture trafficking intermediates of PfEMP1 between the PVM and the RBC periphery, in either small vesicles, and/ or possibly Maurer’s clefts. These would still be membrane embedded, but because of their small size, not be pelleted using the centrifugation speeds in our study (we did not use ultracentrifugation). This explanation, we believe, is in line with the current hypothesis of PfEMP1 and other exported TMD protein trafficking to the periphery or the Maurer’s clefts.

      (4) The red blood cell contains, in addition to HSP70-X, a number of human HSPs (HSP70 and HSP90 are significant in this current case). As the name suggests, these proteins non-specifically shield exposed hydrophobic domains revealed upon partial protein unfolding following thermal insult. I would thus have expected to find significantly more enrichment following heat shock, but this is not the case. Is it possible that the physiological heat shock conditions used in this current study are not high enough to cause a real heat shock?

      As noted by the reviewer, we do not see enrichment of red blood cell heat shock proteins following heat stress, either with FIKK10.2-TurboID or in the phosphoproteome. We used a physiologically relevant heat stress that significantly modifies the iRBC, as shown by our functional assays. While a higher temperature might induce an association of red blood cell heat shock proteins, such conditions may not accurately reflect the most commonly found context of malaria infection.

      Reviewer #3 (Public review):

      Summary:

      In this paper, it is established that high fever-like 39 C temperatures cause parasite-infected red blood cells to become stickier. It is thought that high temperatures might help the spleen to destroy parasite-infected cells, and they become stickier in order to remain trapped in blood vessels, so they stop passing through the spleen.

      Strengths:

      The strength of this research is that it shows that fever-like temperatures can cause parasite-infected red blood cells to stick to surfaces designed to mimic the walls of small blood vessels. In a natural infection, this would cause parasite-infected red blood cells to stop circulating through the spleen, where the parasites would be destroyed by the immune system. It is thought that fevers could lead to infected red blood cells becoming stiffer and therefore more easily destroyed in the spleen. Parasites respond to fevers by making their red blood cells stickier, so they stop flowing around the body and into the spleen. The experiments here prove that fever temperatures increase the export of Velcro-like sticky proteins onto the surface of the infected red blood cells and are very thorough and convincing.

      Weaknesses:

      A minor weakness of the paper is that the effects of fever on the stiffness of infected red blood cells were not measured. This can be easily done in the laboratory by measuring how the passage of infected red blood cells through a bed of tiny metal balls is delayed under fever-like temperatures.

      Previous work by Marinkovic et al. (cited in this manuscript) reported that all RBCs, both infected and uninfected, increase in stiffness at 41 °C compared with 37 °C, with trophozoites and schizonts exhibiting a particularly pronounced increase. We agree that it would be interesting to determine whether similar changes occur at physiological fever-like temperatures, and whether this increase in stiffness coincides with the period of elevated protein trafficking. However, since we have already demonstrated enhanced protein export using multiple complementary approaches, we have chosen to address these questions in a follow-up study.

    1. Different individuals, cultures and societies may place more value on one type of knowing than another, although most use a combination that includes science and religion.

      Its interesting to look at the differences in cultural values and morals, especially when it comes to knowing and understanding. For centuries, society has continued to attempt to answer the questions of knowing using both science and religion as a temporary band aid for what we have yet to understand. In no way do I think using these concepts is a negative thing, on the contrary, having something that keeps us content.

    1. But it has progressively realized that there is some kind of intelligibility in the world, that the world can, in part, be understood, and that we have experiences which, if properly interrogated, will yield answers to our questions.

      I don't think this is true. I don't think that we can always have an answer to our questions. Even when we do it can change. I don't think that there can ever truly be a universal truth because as humans we are not able to truly grasp all of the ideas and concepts that go on in the world. Even when we think we find a truth, that may change for us in the future.

    1. Author response:

      The following is the authors’ response to the original reviews

      We would like to express our sincere gratitude to the reviewers for their thorough analysis of the manuscript and their extremely helpful comments. We have taken all the suggestions into consideration and conducted a range of additional experiments to address the points raised. We have also extensively revised the manuscript to clarify descriptions, correct inaccuracies and remove inconsistencies. We have modified the figures for clarity and content.

      Overall, we expanded the description of the EBH structure to emphasise its dimeric nature and the impact of the two binding sites on interpreting the binding data, including cooperativity. Using ITC, we tested the effect of the pre-SxIP residues on the binding affinity with additional peptides. We found that these residues had a significant effect, albeit much smaller than that of the post-SxIP residues. We analysed the binding of the 11MACF-VLL mutant with EBH-ΔC and evaluated the exchange rates. In agreement with our model, we found that the EBH affinity for the SxIP peptide from CK5P2 (KKSRLPRILIKRSR), which has a C-terminal sequence similar to that of the 11MACF-VLLRK mutant, is 21nM, which is similar to the affinity of the mutant itself. This demonstrates the significant variation in affinity observed among natural SxIP ligands, as predicted by our study. Our responses to the specific points raised by the reviewers are provided below.

      Reviewer #1 (Public Review):

      There is no direct experimental evidence for independent dock and lock steps. The model is certainly plausible given their structural data, but all titration and CEST measurements are fully consistent with a simple one-step binding mechanism. Indeed, it is acknowledged that the results for the VLL peptide are not consistent with the predictions of this model, as affinity and dissociation rates do not co-vary. The model may still be a helpful way to interpret and discuss their results, and may indeed be the correct mechanism, but this has not yet been proven.

      Unfortunately, it is not possible to obtain direct experimental evidence because the folding of the C-terminus is too fast to influence the NMR parameters. However, as the reviewer pointed out, our structural data support the two-step model, since folding of the C-terminus is only possible once the ligand containing the post-SxIP residues has bound. By adopting a mechanistically supported model, we can analyse the contributions to binding and relate them to the structural characteristics of the complex. This provides a clearer insight into the roles of the various regions in the interaction and allows to modify them rationally to enhance the ligand affinity.

      In the revised version, we restate the equations in terms of comparing the on-rates. This provides a clearer view of the effect of the additional stage, which cannot increase the overall on-rate since the two stages are sequential. If the forward rate of the second stage is comparable to or slower than the off-rate of the first stage, the overall on-rate decreases. Conversely, if the forward rate is much faster, the overall on-rate remains unchanged. For the wild-type 11MACF peptide, we observed that the presence of the EBH C-terminus does not affect the on-rate of binding, which is in perfect agreement with the two-step model and indicates that the C-terminus folds very quickly.

      Additionally, we evaluated the binding of the 11MACF-VLL mutant to EBH-ΔC and observed a twofold decrease in Kd compared to WT 11MAC, primarily due to an increase in the on-rate. Interestingly, this rate is approximately twice as low as the overall on-rate for EBH/11MACF-VLL binding, contradicting the sequential two-step model. This suggests a more complex binding process where binding is accelerated by additional hydrophobic interactions with the unfolded C-terminus. However, given the difficulty of quantifying very slow exchange rates, it is more likely that the discrepancy is due to the accuracy of the rate measurements. Therefore, the model allows the rational analysis of changes in binding parameters due to mutations.

      There is little discussion of the fact that binding occurs to EBH dimers -  either in terms of the functional significance of this or in the  acquisition and analysis of their data. There is no discussion of  cooperation in binding (or its absence), either in the analysis of NMR  titrations or in ITC measurements. Complete ITC fit results have not  been reported so it is not possible to evaluate this for oneself.

      We added information about the dimer to the introduction, emphasising its role in enhancing interaction with microtubules (MTs) and its structural role in SxIP binding. The ITC data do not exhibit any biphasic behaviour and can be fitted to a single-site model with 1:1 stoichiometry relative to the EB1c monomer. This corresponds to two independent binding sites in the dimer. We have added the stoichiometry to Table 1 and the description. The NMR titration data for the 11MACF and 11MACF-VLL interactions were fitted to the TITAN dimer model, which includes cooperativity parameters. For WT 11MACF, both cooperativity parameters were zero, corresponding to independent binding sites in the ITC model. For 11MACF-VLL, the fitting suggests weak negative cooperativity, with a ~3-fold increase in Kd for binding to the second site and no change in the off-rate. This difference in Kd is likely to be too small to induce a biphasic shape to the ITC curve. As the cooperativity effect on the NMR spectra is small and absent in the ITC, we used the independent sites model for data analysis, as there is insufficient justification for introducing extra parameters into the model. Crucially, fitting to this model did not alter the off-rate value obtained by NMR or affect the conclusions. We added a description of cooperativity to the results and discussion.

      Three peptides are used to examine the role of C-terminal residues in SxIP motifs: 4-MACF (SKIP), 6-MACF (SKIPTP), and 11-MACF (KPSKIPTPQRK). The 11-mer demonstrates the strongest binding, but this has added residues to the N-terminal as well. It has also introduced charges at both termini, further complicating the interpretation of changes in binding affinities. Given this, I do not believe the authors can reasonably attribute increased affinities solely to post-SxIP residues.

      We tested the 9MACF peptide SKIPTPQRK, which has the same N-terminus as the 4- and 6-MACF peptides, and found that its binding affinity is ~10-fold weaker than that of 11MACF. This demonstrates the contribution of both the pre- and post-SxIP residues. This is likely due to electrostatic interactions between the positively charged N-terminus and the negatively charged EBH surface, similar to those involving the positive charges at the peptide C-terminus. Although significant, the contribution of the N-terminal peptide region is approximately one order of magnitude lower than that of the post-SxIP residues, meaning the post-SxIP region is the main affinity modulator. We have added the binding data on 9MACF and a discussion of the contributions to the manuscript.

      Experimental uncertainties are, with exceptions, not reported.

      Uncertainties added to the number in Table 1 and the text. Information on how uncertainties were calculated added to Table 1.

      Reviewer #1 (Recommendations For The Authors):

      (1) Have you tested the binding of the WT dimer in your cell model?

      We haven’t tested the WT dimer because it has already been reported in the 2009 Cell paper by Honappa et al. In the cell experiments, our main focus was on recruiting the high-affinity mutant to MTs. The low level of recruitment, despite the mutant's high affinity, highlights the importance of dimerisation or additional contributions to binding.

      (2) Please deposit all NMR dynamics measurements (relaxation rates and derived model-free parameters) alongside structural data in the BMRB.

      The relaxation data have been submitted to BMRB, IDs 53187 and 53188

      (3) Please report complete fitting results, e.g. for ITC, including stoichiometries. Clarify what this means for binding to a dimer, and if there is any evidence of cooperativity. Figure 3C, right hand panel, shows an unusual stoichiometry, can the authors comment on this?

      We have added more information on stoichiometry and cooperativity; please refer to our response to the above comment for details. We repeated the titration for the VLLRK mutant using fresh peptide stock. As expected, the stoichiometry was close to 1:1 relative to the EB1c monomer. The new data are now included in the table and figure.

      (4) Please report uncertainties for all measurements of Kd, koff, kon, ∆G, ∆H, ∆S, and explain whether these are determined from statistical analysis, technical or biological repeats (and where reported, clarify between standard deviation/standard error). Please also be aware of standard guidelines for reporting significant figures for data with uncertainties, as these have not been followed in Table 1.

      Uncertainties added to the number in Table 1 and the text. Information on how uncertainties were calculated added to Table 1.

      (5) The construct design for the cell model is unclear - given the importance of flanking residues, please report and discuss how the sequences are attached to venus: which termini is attached, and what is the linker composition?

      We cloned the peptides at the C-terminus of mTFP, after the GS linker of the vector. The peptide itself contains a GS sequence at the N-terminus, creating a highly flexible GSGS linker that separates the SxIP region from mTFP and minimises the potential effect of mTFP on binding. We followed the design of Honappa et al. to enable direct comparison with the published results. We have added this information to the 'Methods' section..

      (6) Which HSQC pulse sequence was used for 2D lineshape analysis? The authors mention non-linear chemical shift changes, presumably associated with the dimer interface - this would be useful to expand upon and clarify.

      For the lineshape analysis, we used the standard Bruker sequence hsqcfpf3gpphwg with soft-pulse watergate water suppression and flip-back. This sequence is included in the TITAN model. We added the description of the non-linear chemical shift changes and connection of these changes to the allosteric effect of the binding to the supplementary information describing details of the lineshape analysis.

      (7) Figure 1A could usefully highlight the dimer interface in the surface representation also.

      We believe that including the interface would make the figure too complicated. The dimer configuration is shown in different colours for the two subunits, clearly demonstrating their involvement in forming the binding site.

      (8) Figures 1C and 1D could usefully show a secondary structure schematic to assist the reader. The x-axis in these figures is not linear and this should be corrected. The calculation of combined chemical shift perturbations should be described.

      Thank you for the helpful suggestion. We changed the scale of the figures and added the diagram of the secondary structure.

      (9) Units are missing from many figure axes.

      We added missing units to the axes. Thank you for highlighting this.

      (10) What peptide concentrations are used in Figure 1C? Presumably, these should be reported at saturation for this to be a fair comparison, this should be clarified.

      The protein concentration was 50 µM. Peptides 4MACF and 6MACF were added at a 100-fold molar excess and peptide 11MACF was added at a 4-fold excess. Saturation was achieved for 11MACF. This was impossible for the short peptides due to their mM affinity. This information has been added to the figure legend. The figure's main aim is to illustrate the differences in the chemical shift perturbation profiles, which can be achieved even if full saturation is not attained. Although the absolute value of the chemical shifts is proportional to the degree of saturation, the distribution of the largest chemical shift changes is independent of this degree. Therefore, we can draw conclusions about the distribution of changes by comparing under non-saturation conditions.

      (11) The presentation of raw peak intensities in Figure 1D shows primarily the flexibility of the C-terminal region associated with high intensities. Beyond this, when comparing the binding of peptides it would be much more informative to show relative peak intensities. Residues around 210-225 appear to show strong broadening in the presence of peptide, but this is masked by the low initial intensity. Can the authors clarify and discuss this? Also, what peptide concentrations were used for this comparison? For a fair comparison, it should be close to saturation - particularly to exclude exchange broadening contributions.

      The protein concentration was 50 µM. 6MACF and 6MACF peptides were added at a 100-fold excess and 11MACF at a 4-fold excess. Saturation was achieved for 11MACF. This was impossible to achieve for the short peptide due to its mM affinity. This information has been added to the figure legend. Upon checking the data, we found a small systematic offset in the coiled-coil region of some of the complexes, as the integral intensity had been used in the initial plot. While this does not change the conclusion regarding the high dynamics of the C-terminus, it does create an inaccurate perception of the relative intensities of the folded regions in the different complexes, as noted by the reviewer. We have now plotted the amplitudes at the maximum of the peaks, which do not exhibit any systematic offset as they are much less susceptible to baseline distortions. We are grateful to the reviewer for highlighting this apparent discrepancy.

      (12) Figure 2 - the scale for S2 order parameters appears to be backwards, given the caption, but its range should be indicated. Similarly, the range of values for Rex should also be indicated. These data should also be tabulated/plotted in supporting information.

      We have corrected the figure legend and added S2 and Rex plots to the supplementary material. The figure aims to highlight regions of increased mobility, while the plots provide full quantitative information on the values. We thank the reviewer for pointing out the error in the figure legend and for the suggestions regarding the plots.

      (13) The scale in Figure 3B is illegible. Indeed, the whole structure is quite small and could usefully be expanded.

      We increased the size of the structure panels and added a scale.

      (14) Figure 4 does not show a decrease in exchange rates, as per the caption - no comparison of exchange rates is shown, only thermodynamic information in panel E. Panel C shows CEST measurements, but it is not clear what system this is for - please clarify, and consider showing the comparable data for the ∆C construct for comparison.

      We have amended the figure legend to clarify that the figure shows binding parameters. We added information about the CEST profiles for the EBH/11MACF interaction to the figure legend (Figure 4C). Exchange with the ∆C construct is too fast for CEST measurements. We used lineshape analysis to evaluate the exchange rates for this construct.

      (15) The schematics shown in Figure 4D, and elsewhere, are really quite difficult to understand. They may pose additional challenges to colourblind readers. Please consider ways that this could be clarified.

      We simplified the colour scheme in the model to make the colours easier to see and to highlight SxIP and non-SxIP regions. We believe that this improved the clarity of the figure.

      (16) Figures S1D/E - the x-axes are unclear and units are missing from the y-axes.

      We re-labelled the axes to clarify the scale and units. Thank you for pointing this.

      Reviewer #2 (Public Review):

      The C-terminal tail of EB1, which is adjacent to EBH and is not analyzed in this study, is highly acidic and plays an important role in protein interactions. If the authors discuss the C-terminus of EB1, they should analyze the whole C-terminus of EB1, which would strengthen the conclusion they have made.

      Honapa et al., Cell, 2009, reported chemical shift perturbations (CSPs) on the peptide binding for the full EB1c fragment, which includes the negatively charged C-terminus. Similar to our study, they observed significant CSPs in the FVIP region but negligible CSPs at the negatively charged EEY end. They concluded that the final eight EB1c residues did not contribute to binding and used a truncated EB1c construct for their structural analysis. Building on that study, we used the same EEY-truncated construct to analyse the contribution of the C-terminus in more detail. We believe that conducting additional experiments with the full C-terminus with respect to SxIP binding would be superfluous, as it would merely replicate the findings of Honapa EA. We have added the rationale for selecting the truncated EB1c construct to the text, referencing Honapa et al.

      Reviewer #2 (Recommendations For The Authors):

      (1) Figure 2C: The authors can analyze the 11MACF peptide as well, to provide more assurance to their argument. It would be easier to distinguish the sequences of "SKIP" and "FVIP" by changing their colors.

      Our relaxation analysis (Fig. 2C) focuses on the dynamics of the unstructured C-terminal region in both the free and complex forms. Further relaxation analysis of the peptide would not provide additional information on this, and would be complicated by the presence of free peptide in solution.

      (2) Figure 3B: Acidic residues in EBH should be labeled.<br /> Page 6, line 11: If the authors insist that the acidic patch will influence the interactions between EB1 and the peptide, the data of the analysis using the entire EB1 C-terminus should be included, given that the C-terminal tail of EB1 is highly acidic.

      To test the contribution of charge to binding, we conducted an ITC experiment at increasing salt concentrations. We observed a significant increase in Kd values when the concentration of NaCl increased from 50 to 150 mM, which supports our conclusion regarding the significant electrostatic contribution. This conclusion is independent of the presence or absence of the C-terminus.

      As we explained earlier, Honapa et al., Cell 2009, conducted an NMR experiment on the full EB1c and observed no CPSs in the EEY region, indicating a negligible contribution from the EEY region to SxIP binding. Therefore, we think that additional experiments involving the entire C-terminus are unnecessary, as they would simply replicate the results of Honapa et al. We have added the rationale for selecting the truncated EB1c to the text, referencing Honapa et al.

      It would be very difficult to label the acidic residues without enlarging 3B considerably. However, we do not think this is necessary as we are not discussing any specific residues. The current figure shows the distribution of the surface charge, which is sufficient for our purposes.

      (3) Figure 2B (Page 4, line 27): The side chain of S5477 should be drawn. The authors should include a figure of the crystal structure of EBH and SxIP as a comparison (Honnappa et al., Cell, 2009). In their paper, Honnappa et al. performed chemical shift perturbation titrations by NMR. From their analysis, I imagine that the EB1 tail may not be critical for the EB1 C-terminus:SxIP interactions, since the signals in the tail are not significantly perturbed. The authors should cite this paper.

      We are grateful to the reviewer for highlighting this. CSP analysis of the Honapa EA revealed significant changes in the FVIP region, which we also observed. They also reported negligible CSPs at the EEY end, demonstrating that this part of the tail is non-critical and can be removed. We have added text to the manuscript to highlight the similarity between CSPs and those observed in Honapa EA. Figure 2B shows the side chains for the residues with the strongest detected contacts. These do not include S5477.

      (4) Figure 3C (ITC data): The stoichiometric ratios in the ITC data look strange. EBH vs KPSKIPVLLRKRK, is it 1:1?

      We repeated the ITC experiments using a new stock of the peptide and a new batch of the protein, checking the concentrations using UV spectroscopy. The new experiments produced a stoichiometry close to 1, as shown in the table.

      (5) Page 10, line 27: "The TPQ sequence of 11MACF is not optimal...": What is the meaning of "optimal"? The transient interaction between EB1 and its binding partner is responsible for the dynamics of the microtubule cytoskeleton. In a sense, the relatively weak interaction is "optimal" for the system. The authors should rephrase the word.

      We agree that weak interactions are optimal from a functional perspective, as they have been selected through evolution. In our case, 'optimal' refers to the hydrophobic interaction with the C-terminus. We replaced 'optimal' with 'ideal' to draw more attention to the second part of the sentence, which clarifies the context.

      (6) Page 11, line 2: "small number of comets enriched in the peptide that were too faint for the quantitative analysis, comparable to the reported previously (Honnappa, Gouveia et al. 2009)." Honnappa et al. used EGFP-fusion constructs in their study: EGFP forms a weak dimer, which presumably gave different results from the authors' mTFP-constructs. The authors can note this point in the text.

      We are grateful to the reviewer for highlighting this. This aligns well with our conclusion that dimerisation is important for localisation to comets. We have added this point to the text.

      (7) Page 10, line 21: The authors calculate the free energy of complex formation between EBH and MACF peptide and explain in the text, but it is hard to follow.

      We simplified and clarified the description of the energy contributions by focusing on the SxIP and non-SxIP regions of the peptide, as well as the EBH C-terminus.

      Minor points:

      Page 2, line 9: IP motifs are not usually located in the C-terminus. For example, SxIP in Tastin is located in the N-terminal region, and SxIPs in CLASP are in the middle.

      We corrected this statement, removing C-terminal.

      Page 3, line 4: The authors should note the residue numbers of SKIP.

      We think that in this context the residue number of the SxIP region are not important and would be distracting.

      Figure 3D and Figure S3F: Make the colors and the order the same between the two figures.

      We changed the colour scheme and the order of ITC parameters in S3F to match the main figure.

      Figure 1A, 2B, Figure S5: Change the color of SKIP from other residues in the same chain, otherwise the readers cannot distinguish. Likewise, change the color of FVIP in Figure 2B.

      We think that changing the colours will complicate the figures unnecessary. The corresponding residues are clearly labelled in the figures.

      Figure 3, Figure S5, S6, S7: Box the letters of SKIP for clarity.

      We boxed the SxIP region in S5 (new S6) and underlined in S6 (new S7). In S7 (new S8) the location of SxIP is very clear from the homology.

      Figure 3B; Figure S2: Hard to recognize the peptide (MACF in green).

      We increased the size of 3D and S2, making it easier to see the peptide.

      Figure 1C and D: Make the residual numbers of the x-axes the same between the two graphs.

      We made new plots with a linear scale for the residue numbers.

      Figure 2A: The structures shown are not EB1. It should be described as EBH or EB1(191-260 a.a.).

      Corrected.

      Page 5, line 17: "the S2 values of the C-terminus" should be "the S2 values of the C-terminal loop in EBH", otherwise it is confusing.

      Corrected.

      Page 6, line 27; Figure S3C and S6: Please indicate the assignments of the resonances from "253FVI255" in the Figures.

      We labelled the peaks corresponding to the 253FVI255 region in figure S6 (new S7). Figure S3 shows EBH-ΔC that does not include this region.

      Page 7, line 25: Figure S7 should be S8.

      Corrected

      Page 12, line 6: "sulfatrahsferases" must by a typo.

      Corrected.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      This paper presents a computational model of the evolution of two different kinds of helping ("work," presumably denoting provisioning, and defense tasks) in a model inspired by cooperatively breeding vertebrates. The helpers in this model are a mix of previous offspring of the breeder and floaters that might have joined the group, and can either transition between the tasks as they age or not. The two types of help have differential costs: "work" reduces "dominance value," (DV), a measure of competitiveness for breeding spots, which otherwise goes up linearly with age, but defense reduces survival probability. Both eventually might preclude the helper from becoming a breeder and reproducing. How much the helpers help, and which tasks (and whether they transition or not), as well as their propensity to disperse, are all evolving quantities. The authors consider three main scenarios: one where relatedness emerges from the model, but there is no benefit to living in groups, one where there is no relatedness, but living in larger groups gives a survival benefit (group augmentation, GA), and one where both effects operate. The main claim is that evolving defensive help or division of labor requires the group augmentation; it doesn't evolve through kin selection alone in the authors' simulations.

      This is an interesting model, and there is much to like about the complexity that is built in. Individual-based simulations like this can be a valuable tool to explore the complex interaction of life history and social traits. Yet, models like this also have to take care of both being very clear on their construction and exploring how some of the ancillary but potentially consequential assumptions affect the results, including robust exploration of the parameter space. I think the current manuscript falls short in these areas, and therefore, I am not yet convinced of the results. Much of this is a matter of clearer and more complete writing: the Materials and Methods section in particular is incomplete or vague in some important junctions. However, there are also some issues with the assumptions that are described clearly.

      Below, I describe my main issues, mostly having to do with model features that are unclear, poorly motivated (as they stand), or potentially unrealistic or underexplored.

      We would like to thank the reviewer for the thoughtful comments that helped us to greatly improve the clarity of our paper.  

      One of the main issues I have is that there is almost no information on what happens to dispersers in the model. Line 369-67 states dispersers might join another group or remain as floaters, but gives no further information on how this is determined. Poring through the notation table also comes up empty as there is no apparent parameter affecting this consequential life history event. At some point, I convinced myself that dispersers remain floaters until they die or become breeders, but several points in the text contradict this directly (e.g., l 107). Clearly this is a hugely important model feature since it determines fitness cost and benefits of dispersal and group size (which also affects relatedness and/or fitness depending on the model). There just isn't enough information to understand this crucial component of the model, and without it, it is hard to make sense of the model output.

      We use the same dispersal gene β to represent the likelihood an individual will either leave or join a group, thereby quantifying both dispersal and immigration using the same parameter. Specifically, individuals with higher β are more likely to remain as floaters (i.e., disperse from their natal group to become a breeder elsewhere), whereas those with lower β are either more likely to remain in their natal group as subordinates (i.e., queue in a group for the breeding position) or join another group if they dispersed.  

      We added in the text “Dispersers may migrate to another group to become subordinates or remain as floaters waiting for breeding opportunities, which is also controlled by the same genetic dispersal propensity as subordinates” to clarify this issue. We also added in Table 1 that β is the “genetic predisposition to disperse versus remain in a group”, and to Figure 1 that “subordinates in the group (natal and immigrants) […]” after we already clarified that “Dispersers/floaters may join a random group to become subordinates.”

      Related to that, it seems to be implied (but never stated explicitly) that floaters do not work, and therefore their DV increases linearly with age (H_work in eq.2 is zero). That means any floaters that manage to stick around long enough would have higher success in competition for breeding spots relative to existing group members. How realistic is this? I think this might be driving the kin selection-only results that defense doesn't evolve without group augmentation (one of the two main ways). Any subordinates (which are mainly zero in the no GA, according to the SI tables; this assumes N=breeder+subordinates, but this isn't explicit anywhere) would be outcompeted by floaters after a short time (since they evolve high H and floaters don't), which in turn increases the benefit of dispersal, explaining why it is so high. Is this parameter regime reasonable? My understanding is that floaters often aren't usually high resource holding potential individuals (either b/c high RHP ones would get selected out of the floater population by establishing territories or b/c floating isn't typically a thriving strategy, given that many resources are tied to territories). In this case, the assumption seems to bias things towards the floaters and against subordinates to inherit territories. This should be explored either with a higher mortality rate for floaters and/or a lower DV increase, or both.

      When it comes to floaters replacing dead breeders, the authors say a bit more, but again, the actual equation for the scramble competition (which only appears as "scramble context" in the notation table) is not given. Is it simply proportional to R_i/\sum_j R_j ? Or is there some other function used? What are the actual numbers of floaters per breeding territory that emerge under different parameter values? These are all very important quantities that have to be described clearly.

      Although it is true that dispersers do not work when they are floaters, they may later help if they immigrate into a group as a subordinate. Consequently, immigrant subordinates have no inherent competitive advantage over natal subordinates (as step 2.2. “Join a group” is followed by step 3. “Help”, which occurs before step 5. “Become a breeder”). Nevertheless, floaters can potentially outcompete subordinates of the same age if they attempt to breed without first queuing as a subordinate (step 5) when subordinates are engaged in work tasks. We believe that this assumption is realistic and constitutes part of the costs associated with work tasks. However, floaters are at a disadvantage for becoming a breeder because: (1) floaters incur higher mortality than individuals within groups (Eq. 3); and (2) floaters may only attempt to become breeders in some breeding cycles (versus subordinate groups members, who are automatically candidates for an open breeding position in the group in each cycle). Therefore, due to their higher mortality, floaters are rarely older than individuals within groups, which heavily influences their dominance value and competitiveness. Additionally, any competitive advantage that floaters might have over other subordinate group members is unlikely to drive the kin selection-only results because subordinates would preferably choose defense tasks instead of work tasks so as not to be at a competitive disadvantage compared to floaters.  

      Regarding whether floaters aren't usually high resource holding potential (RHP) individuals and, therefore, our assumptions might be unrealistic; empirical work in a number of species has shown that dispersers are not necessarily those of lower RHP or of lower quality. In fact, according to the ecological constraints hypothesis, one might predict that high quality individuals are the ones that disperse because only individuals in good condition (e.g., larger body size, better energy reserves) can afford the costs associated with dispersal (Cote et al., 2022). To allow differences in dispersal propensity depending on RHP, we extended our model in the Supplemental Materials by incorporating a reaction norm of dispersal based on their rank (D = 1 / (1 + exp (β<sub>R</sub> * Rβ<sub>0</sub>)) under the section “Dominance-dependent dispersal propensities” and now referenced in L195. This approach allows individuals to adjust their dispersal strategy to their competitiveness and to avoid kin competition by remaining as a subordinate in another group. Results show that the addition of the reaction norm of dispersal to rank did not qualitatively influence the results described in the main text.  

      We also added “number of floaters” present in the whole population to the summary tables as requested.  

      As a side note, the “scramble context” we mention was an additional implementation in which we made rank independent of age. However, since the main conclusions remained unchanged, we decided to remove it for simplicity from the final manuscript, but we forgot to remove it from Table 1 before submission.  

      I also think the asexual reproduction with small mutations assumption is a fairly strong one that also seems to bias the model outcomes in a particular way. I appreciate that the authors actually measured relatedness within groups (though if most groups under KS have no subordinates, that relatedness becomes a bit moot), and also eliminated it with their ingenious swapping-out-subordinates procedure. The fact remains that unless they eliminate relatedness completely, average relatedness, by design, will be very high. (Again, this is also affected by how the fate of the dispersers is determined, but clearly there isn't a lot of joining happening, just judging from mean group sizes under KS only.) This is, of course, why there is so much helping evolving (even if it's not defensive) unless they completely cut out relatedness.

      As we showed in the Supplementary Tables and the section on relatedness in the SI (“Kin selection and the evolution of division of labor"), high relatedness does not appear to explain our results. In evolutionary biology generally and in game theory specifically (with the exception of models on sexual selection or sex-specific traits), asexual reproduction is often modelled because it reduces unnecessary complexity. To further study the effect of relatedness on kin structures more closely resembling those of vertebrates, however, we created an additional “relatedness structure level”, where we shuffled half of the philopatric offspring using the same method used to remove relatedness completely, effectively reducing withingroup relatedness structure by half. As shown in the new Figure S3, the conclusions of the model remain unchanged.  

      Finally, the "need for division of labor" section is also unclear, and its construction also would seem to bias things against division of labor evolving. For starters, I don't understand the rationale for the convoluted way the authors create an incentive for division of labor. Why not implement something much simpler, like a law of minimum (i.e., the total effect of helping is whatever the help amount for the lowest value task is) or more intuitively: the fecundity is simply a function of "work" help (draw Poisson number of offspring) and survival of offspring (draw binomial from the fecundity) is a function of the "defense" help. As it is, even though the authors say they require division of labor, in fact, they only make a single type of help marginally less beneficial (basically by half) if it is done more than the other. That's a fairly weak selection for division of labor, and to me it seems hard to justify. I suspect either of the alternative assumptions above would actually impose enough selection to make division of labor evolve even without group augmentation.

      In nature, multiple tasks are often necessary to successfully rear offspring. We simplify this principle in the model by maximizing reproductive output when both tasks are carried out to a similar extent, allowing for some flexibility from the mean. We added to the manuscript “For example, in many cooperatively breeding birds, the primary reasons that individuals fail to produce offspring are (1) starvation, which is mitigated by the feeding of offspring, and (2) nest depredation, which is countered by defensive behavior. Consequently, both types of tasks are necessary to successfully produce offspring, and focusing solely on one while neglecting the other is likely to result in lower reproductive success than if both tasks are performed by individuals within the group.”

      Regarding making fecundity a function of work tasks and offspring survival as a function of defensive tasks, these are actually equivalent in model terms, as it’s the same whether breeders produce three offspring and two die, or if they only produce one. This represents, of course, an oversimplification of the natural context, where breeding unsuccessfully is more costly (in terms of time and energy investment) than not breeding at all.

      Overall, this is an interesting model, but the simulation is not adequately described or explored to have confidence in the main conclusions yet. Better exposition and more exploration of alternative assumptions and parameter space are needed.

      We hope that our clarifications and extension of the model satisfy your concerns.  

      Reviewer #2 (Public review):

      Summary:

      This paper formulates an individual-based model to understand the evolution of division of labor in vertebrates. A main conclusion of the paper is that direct fitness benefits are the primary factor causing the evolution of vertebrate division of labor, rather than indirect fitness benefits.

      Strengths:

      The paper formulates an individual-based model that is inspired by vertebrate life history. The model incorporates numerous biologically realistic details, including the possibility to evolve age polytheism where individuals switch from work to defence tasks as they age or vice versa, as well as the possibility of comparing the action of group augmentation alone with that of kin selection alone.

      Weaknesses:

      The model makes assumptions that restrict the possibility that kin selection leads to the evolution of helping. In particular, the model assumes that in the absence of group augmentation, subordinates can only help breeders but cannot help non-breeders or increase the survival of breeders, whereas with group augmentation, subordinates can help both breeders and non-breeders and increase the survival of breeders. This is unrealistic as subordinates in real organisms can help other subordinates and increase the survival of non-breeders, even in the absence of group augmentation, for instance, with targeted helping to dominants or allies. This restriction artificially limits the ability of kin selection alone to lead to the evolution of helping, and potentially to division of labor. Hence, the conclusion that group augmentation is the primary driving factor driving vertebrate division of labor appears forced by the imposed restrictions on kin selection. The model used is also quite particular, and so the claimed generality across vertebrates is not warranted.

      We would like to thank the reviewer for the in-depth review. We respond to these and other comments below.  

      I describe some suggestions for improving the paper below, more or less in the paper's order.

      First, the introduction goes to great lengths trying to convince the reader that this model is the first in this or another way, particularly in being only for vertebrates, as illustrated in the abstract where it is stated that "we lack a theoretical framework to explore the conditions under which division of labor is likely to evolve" (line 13). However, this is a risky and unnecessary motivation. There are many models of division of labor and some of them are likely to be abstract enough to apply to vertebrates even if they are not tailored to vertebrates, so the claims for being first are not only likely to be wrong but will put many readers in an antagonistic position right from the start, which will make it harder to communicate the results. Instead of claiming to be the first or that there is a lack of theoretical frameworks for vertebrate division of labor, I think it is enough and sufficiently interesting to say that the paper formulates an individual-based model motivated by the life history of vertebrates to understand the evolution of vertebrate division of labor. You could then describe the life history properties that the model incorporates (subordinates can become reproductive, low relatedness, age polyethism, etc.) without saying this has never been done or that it is exclusive to vertebrates; indeed, the paper states that these features do not occur in eusocial insects, which is surprising as some "primitively" eusocial insects show them. So, in short, I think the introduction should be extensively revised to avoid claims of being the first and to make it focused on the question being addressed and how it is addressed. I think this could be done in 2-3 paragraphs without the rather extensive review of the literature in the current introduction.

      We have revised the novelty statements in the Introduction by more clearly emphasizing how our model addresses gaps in the existing literature. More details are provided in the comments below.

      Second, the description of the model and results should be clarified substantially. I will give specific suggestions later, but for now, I will just say that it is unclear what the figures show. First, it is unclear what the axes in Figure 2 show, particularly for the vertical one. According to the text in the figure axis, it presumably refers to T, but T is a function of age t, so it is unclear what is being plotted. The legend explaining the triangle and circle symbols is unintelligible (lines 227-230), so again it is unclear what is being plotted; part of the reason for this unintelligibility is that the procedure that presumably underlies it (section starting on line 493) is poorly explained and not understandable (I detail why below). Second, the axes in Figure 3 are similarly unclear. The text in the vertical axis in panel A suggests this is T, however, T is a function of t and gamma_t, so something else must be being done to plot this. Similarly, in panel B, the horizontal axis is presumably R, but R is a function of t and of the helping genotype, so again some explanation is lacking. In all figures, the symbol of what is being plotted should be included.

      We added the symbols of the variables to the Figure axes to increase clarity. In Figure 3A, we corrected the subindex t in the x-axis; it should be subindex R (reaction norm to dominance rank instead of age). As described in Table 1, all values of T, H and R are phenotypically expressed values. For instance, T values are the phenotypically expressed values from the individuals in the population according to their genetic gamma values and their current dominance rank at a given time point.  

      Third, the conclusions sound stronger than the results are. A main conclusion of the paper is that "kin selection alone is unlikely to select for the evolution of defensive tasks and division of labor in vertebrates" (lines 194-195). This conclusion is drawn from the left column in Figure 2, where only kin selection is at play, and the helping that evolves only involves work rather than defense tasks. This conclusion follows because the model assumes that without group augmentation (i.e., xn=0, the kin selection scenario), subordinates can only help breeders to reproduce but cannot help breeders or other subordinates to survive, so the only form of help that evolves is the least costly, not the most beneficial as there is no difference in the benefits given among forms of helping. This assumption is unrealistic, particularly for vertebrates where subordinates can help other group members survive even in the absence of group augmentation (e.g., with targeted help to certain group members, because of dominance hierarchies where the helping would go to the breeder, or because of alliances where the helping would go to other subordinates). I go into further details below, but in short, the model forces a narrow scope for the kin selection scenario, and then the paper concludes that kin selection alone is unlikely to be of relevance for the evolution of vertebrate division of labor. This conclusion is particular to the model used, and it is misleading to suggest that this is a general feature of such a particular model.

      The scope of this paper was to study division of labor in cooperatively breeding species with fertile workers (i.e., primarily vertebrates), in which help is exclusively directed towards breeders to enhance offspring production (i.e., alloparental care). Our focus is in line with previous work in most other social animals, including eusocial insects and humans, which emphasizes how division of labor maximizes group productivity. Other forms of “general” help are not considered in the paper, and such forms of help are rarely considered in cooperatively breeding vertebrates or in the division of labor literature, as they do not result in task partitioning to enhance productivity.

      Overall, I think the paper should be revised extensively to clarify its aims, model, results, and scope of its conclusions.

      Recommendations for the authors: 

      Reviewer #1 (Recommendations for the authors):

      I reserved this section for more minor comments, relating to clarity and a general admonition to give us more detail and exploration of some basic population genetic quantities.

      Another minor point, although depending on whether I assume right or wrong, it could be major: I am not entirely sure that dispersers help in the groups they join as helpers, because of line 399, which states specifically that individuals who do remain in natal territories do. But I assume dispersers help (elsewhere, the authors state helping is not conditional on relatedness to the breeder). Otherwise, this model becomes even weirder for me. Either way, please clarify.

      Apologies if this was not clear. Immigrants that join a group (so dispersers from another group) as a subordinate help and queue for a breeding position, as does any natal subordinate born into the group. We rephased the sentence to “Subordinate group members, either natal or immigrants to the group, […]”  

      More generally, in simulation studies like this, there can be interactions between the strength of selection (which affects overall genetic variation maintained in the population), population size, and mutation rate/size, which can affect, for example, relatedness values. None of these quantities is explored here (and their interactions are not quantified), so it is not possible to evaluate the robustness of any of these results.

      Thank you for your comments about the parameter landscape. It is important to point out that variations in the mutation rate do not qualitatively affect our results, as this is something we explored in previous versions of the model (not shown). Briefly, we find that variations in the mutation rates only alter the time required to reach equilibrium. Increasing the step size of mutation diminishes the strength of selection by adding stochasticity and reducing the genetic correlation between offspring and their parents. Population size could, in theory, affect our results, as small populations are more prone to extinction. Since this was not something we planned to explore in the paper directly, we specifically chose a large population size, or better said, a large number of territories (i.e. 5000) that can potentially host a large population.  

      The authors also never say how it is actually determined. There is the evolved helping variable, and there is also the evolved reaction norm. I assume that the actual amount of help of each type is given by the product of T (equation 1) and H (for defense) and (1-T) and H (for work), but this should be stated explicitly.  

      Help provided is an interaction between H (total effort) and T (proportion of total effort invested in each type of task). To clarify the distinction between these two processes, we have now added “Hence, the gene α regulates the amount of help expressed, while the genes γ determine which specific helping tasks are performed at different time points in the breeding cycle”.  

      It is also weird that after introducing the T variable as a function of age, Figure 3 actually depicts it as a function of dominance value.

      Thank you for pointing out an error in Eq. 1. This inequality was indeed written incorrectly in the paper (but is correct in the model code); it is dominance rank instead of age (see code in Individual.cpp lines 99-119). We corrected this mistake throughout the manuscript.

      What is "scramble context"?

      “Scramble context” was an additional implementation that we decided to remove from the final manuscript, but we forgot to remove from Table 1 before submission. We have now removed it from the table.

      Reviewer #2 (Recommendations for the authors):

      Some specific comments:

      (1) L 31: "All theoretical..." These absolute statements are risky and unnecessary.

      Rephrased to “To date, most theoretical and empirical work…”

      (2) L 46: I believe Tom Wenseleers has published on the evolution of division of labor with reproductive workers and high within-colony conflict.

      Tom Wenseleers has indeed produced some models on the evolution of cooperation in social insects where some workers may reproduce. However, these models focus on the relevance of relatedness and policing selecting for a reduction in within-group conflict and the evolution of reproductive division of labor. Our model focuses instead on division of labor among workers (helpers). We have rephased this section to “task specialization is linked to sterility and where conflict of interest is generally low” to account for species of social insect in which variation in relatedness between group members and higher levels of reproductive conflict may arise. We also cited one of his papers.  

      (3) L 57: Again, unnecessary categorical statements.

      Rephrased to “Although a great deal of recent empirical work highlights the importance of direct benefits in the evolution of cooperative breeding behavior in vertebrates [21–24], we lack understanding on the joint influence of direct and indirect fitness benefits in the evolution of division of labor.”

      (4) L 67: This is said to be a key distinction, but in the paper, such a key role is not clearly shown. This and other tangential points are unnecessary to keep the introduction to the point.

      The different fitness costs of different tasks is the basis of our model on division of labor. Therefore, this is a key distinction and basis from which to describe different tasks in the model. We have left this sentence unchanged.

      (5) L 61-73: "In vertebrates, however, helpers may obtain fitness benefits directly via reproduction..." Some social insects may do so as well. It seems unnecessary and incorrect to say that vertebrate sociality is fundamentally different from invertebrate one. I think it is sufficiently interesting to say this work aims to understand vertebrate division of labor, by explicitly modeling aspects of its life history, without saying this can't happen in invertebrates or that no other model has ever done anything like it.

      Our point is not that, in some social insects, workers cannot obtain direct fitness benefits, but that previous models where the focus is on the colony reproductive outcome are only a good approximation to eusocial insect with sterile workers. However, to make this clearer we have added “In vertebrates and social insect with fertile workers, however, helpers may obtain fitness benefits directly via […]”.  

      (6) L 74-86: By this point, the introduction reads like a series of disconnected comments without a clear point.

      In L60 we added: “Understanding how direct and indirect benefits interact is particularly important in systems where individuals may differentially bear the fitness costs of cooperation”. By adding this sentence, we emphasize our focus on the largely unexplored direct fitness benefits and costs, as well as their interaction with indirect fitness. We then proceed to explain why it is crucial to consider that tasks have varying direct fitness costs and how the fitness benefits derived from cooperation change with age and resource-holding potential. These elements are essential for studying the division of labour in species with totipotent workers.

      (7) L 87: This sentence gives a clear aim. It would be clearer if the introduction focused on this aim.

      With the new sentence added in L60 (see previous comment), we bring the focus to the main question that we are trying to address in this paper earlier in the Introduction.  

      (8) L 88: "stochastic model" should be changed to "individual-based model".

      Done.

      (9) L 104: "limited number" is unclear. Say a fixed finite number, or something specific.

      Done.

      (10) L 105: "unspecified number" is unclear. Say the number of subordinates emerges from the population dynamics.

      Changed to “variable number of subordinate helpers, the number of which is shaped by population dynamics, with all group members capable of reproducing during their lifetime”.

      (11) L 112: "Dispersers" is used, but in the previous lines 107-109, the three categories introduced used different terms. Those three terms introduced should be used consistently throughout the paper, without using two or more terms for one thing.

      We use the term “disperser” to describe individuals that disperse from their natal group.

      Dispersers can assume one of three roles: (1) they can join another group as "subordinates"; (2) they can join another group as "breeders" if they successfully outcompete others; or (3) they can remain as "floaters" if they fail to join a group. "Floaters" are individuals who persist in a transient state without access to a breeding territory, waiting for opportunities to join a group in an established territory. We rephased the sentence to “Dispersers cannot reproduce without acquiring a territory (denoted here as floaters)”. This was also clarified in other instances where the term “dispersers” was used (e.g. L407). Other instances where this might not have been so clear, we replace “dispersers” with “floaters”.  

      (12) L 112: "(floaters)" Unclear parenthesis.

      See previous comment.  

      (13) L 115: There should be a reference to Methods around here.

      Added a reference to Figure 1.

      (14) L 117: To be clearer, say instead that dominance value is a linearly increasing function of age as a proxy of RHP and a linearly decreasing function of help provided due to the costs of working tasks. And refer to equation 2.

      Rephrased to “We use the term dominance value to designate the competitiveness of an individual compared to other candidates in becoming a breeder, regardless of group membership, that increases as a function of age, serving as a proxy for resource holding potential (RHP), and decreases as a function of help provided, reflecting costs to body condition from performing working tasks (Eq. 2).” We did not include “linearly” to keep it simpler, since it is clear from Eq. 2, which is now referenced here.  

      (15) L 119: "Subordinate helpers". As all subordinates are helpers, the helper qualifier is confusing.

      Subordinates are not necessarily helpers, as they can evolve help values of 0, hence, why we make it explicit here.

      (16) L 119: "choose". This terminology may be misleading. The way things are implemented in the model is that individuals are assigned a task depending on their genetic traits gamma. Perhaps it would be better to use a less intentional term, like perform one of two tasks.

      We changed “choose between two” to “engage in one of two”, which has less connotations of intentionality.

      (17) L 124: "Subordinates can [...] exhibit task specialization that [...] varies with their dominance value". It should be that it varies with age.

      Apologies. The equation was wrong; it does vary with dominance value. We corrected it accordingly.

      (18) L 133: "maximised" This is apparently important for the modelling procedure, but it is completely unclear what it means. Equation 4 comes out of nowhere, and it is said that such an equation is the maximum amount of help that can affect fecundity. Why? What does this mean? If there is something that is maximised, this should be proven. This value is then used for something (line 507), but it is unclear why or what it is used for (it says "we use the value of Hmax instead" without saying what for, no justification for the listed inequalities are given, and the claimed maximisation of an unspecified variable at those H values is not proven). Moreover, the notation in this section is also unclear: what are the sums over? Also, Hdefence and Hwork should vary over the index that is summed over, but the notation suggests that those quantities don't vary.

      We changed “maximized” to “greatest”, and we added a clarification to the rationality behind the maximization of the impact of help in the breeder’s productivity: “For example, in many cooperatively breeding birds, the primary reasons that breeders fail to produce offspring are (1) starvation, which is mitigated by the feeding of offspring, here considered as a work task, and (2) nest depredation, which is countered by defensive behavior. Consequently, both types of tasks are often necessary for successful reproduction, and focusing solely on one while neglecting the other is likely to result in lower reproductive success than if both tasks are performed by helpers within the group.”

      We now also clarify that the sums are for help given within a group (L 507), and added indexes to the equations.

      (19) L 152: "habitat saturation" How is this implemented? How is density dependence implemented? Or can the population size keep increasing indefinitely? It would be good to plot the population size over time, the group size over time, and the variance in group size over time. This could substantiate later statements about enhancing group productivity and could all be shown in the SI.

      Habitat saturation emerges from population dynamics due to the limited availability of territories and the fluctuating number of individuals, leading highly productive environments to experience habitat saturation. Although the number of group members is not restricted in our model, the population could theoretically increase indefinitely. However, this is not observed in the results presented here, as we selected parameter landscapes that stabilize population numbers. We confined our parameters to those where the population neither increased indefinitely (nor collapsed), as we did not incorporate density-dependent mortality traits for simplification. Consequently, the group size in the SI, where the standard deviation is already included, closely represents group size at any other given time during equilibrium.

      L 336: we changed “environments with habitat saturation” to “environments that lead to habitat saturation”, to increase clarity.

      (20) L 152: "lifecycle". Rather than the lifecycle, the figure describes the cycle of events in a single time step. The lifecycle (birth to death) goes over multiple time steps (as individuals live over multiple steps). So this figure shouldn't be called a life cycle.

      We changed “lifecycle” to “breeding cycle”.

      (21) L 156: "generation". This is not a generation but a time step.

      We changed “generation” to “breeding cycle”.

      (22) L 157: "previous life cycle" would mean that the productivity of a breeder depends on the number of helpers that its parents had, which is not what is meant.

      We changed “lifecycle” to “breeding cycle”.

      (23) L 158: "Maximum productivity is achieved when different helping tasks are performed to a similar extent." Again, unclear why that is the case.

      We added a clarification on this, see response to comment 18.  

      (24) L 160: "Dispersers/floaters". Use just one term for a single thing.

      See response to comment 11.   

      (25) L 162: "dispersal costs". I don't recall these being described in Methods.

      Individuals that disperse do not enjoy the protection of living in a territory and within a group of other individuals, so they have a higher mortality risk, described in Eq. 3.3. (negative values in the exponential part of the equation increase survival). The cost of dispersal is the same as individuals that remain as floaters at a given time step.

      (26) L 164: "generation" -> time step.

      We changed this to “breeding cycle”.  

      (27) L 170: "Our results show that division of labor initially emerges because of direct fitness benefits..." This is a general statement, but the results are only particular to the model. So this statement and others in the manuscript should be particular to the model. Also, Figure 2 doesn't say anything about what evolves "initially" as it only plots evolutionary equilibria.

      We rephrased this statement to “Our results suggest that voluntary division of labor involving tasks with different fitness costs is more likely to emerge initially because of direct fitness benefits”, to more accurately represent the conditions under which we modeled the division of labor.  

      Our reference to “initially” is regarding group formation (family groups versus aggregations of unrelated individuals or a mix). This is shown in the comparison between the different graphs at equilibrium. The initial state of the simulation is that all individuals disperse and do not cooperate.  

      (28) L 171: "but a combination of direct and indirect fitness benefits leads to higher rates and more stable forms of division of labor". What do you mean by "higher rates and more stable forms of division of labor"? Say how division of labor is shown in the figure (with intermediate T?).

      Yes, intermediate values of T show division of labor if γR ≠ 0. This is described under the section “The role of dominance in task specialization”. We added “with intermediate values suggesting a division of labor” to the Figure 2 legend.  

      (29) L173-175: "as depicted in Figure 2, intermediate values of task specialization indicate in all cases age/dominance-mediated task specialization (γt ≠ 0; Table 1) and never a lack of specialization (γt = 0; Table 1)". This sentence is unclear and imprecise. Does this sentence want to say that in Figure 2, all plots with intermediate values of T involve gamma t different from zero? If so, just say that.

      Rephrased to: “In Figure 2, all plots depicting intermediate values of T exhibit non-zero γR values and, hence, division of labor”.

      (30) L179-180: "forms of help that impact survival never evolve under any environmental condition when only kin selection occurs". This is misleading because under the KS scenario, help cannot positively impact survival in this model, so they never evolve.

      Help cannot affect survival but could potentially affect group persistence. If helpers increase breeder productivity and offspring remain philopatric and queue for the breeding position, then they will receive help from related individuals.   

      (31) L 210: "initially". What do you mean by that?

      Help only evolves in our model in family groups, which may then open the door for the evolution of help in mixed-kin groups. Therefore, we use “initially” to refer to the ancestral group structure that likely led to cooperation under benign environmental conditions. We rephased this section to “in more benign (and often highly productive) environments that lead to habitat saturation, help likely evolved initially in family groups, and defensive tasks are favored because competition for the breeding position is lower under kin selection.”

      (32) L 212: "kin selection is achieved". What does that mean?

      Rephased to “kin selection acts not only by selecting subordinates in their natal group to increase the productivity of a related breeder […]”

      (33) L 216: "division of labor seems to be more likely to evolve in increasingly harsh environments". Say in parentheses where this is shown.

      Added.  

      (34) L 218: "help evolves in benign environments". I don't see where this is shown. Figure 2 doesn't show that H is higher with lower m (e.g., in KS+GA column).

      Help does not evolve in benign environments under only direct fitness benefits derived from group augmentation (shown in Figure 2).  

      (35) L 225: "y-axis" should be "vertical axis", as y has another meaning in the model.

      Done.

      (36) L 226: "likelihood". Here and throughout, "likelihood" should be changed to probability. Likelihood means something else.

      Thank you for the advice, we have corrected this through the manuscript.  

      (37) L 236: "the slope of the reaction norm for the dominance value in task specialization".

      Unclear. Clearer to say: the rate at which individuals to shift from defense to work as they age.

      The important part is not so much the rate but the direction, that is, from work task to defense (or vice versa) as their rank increases. Changed to “the direction and rate of change in task specialization with dominance”.

      (38) L 257: "(task = 0; cost to dominance value)," This seems out of place.

      This aims to clarify that work tasks have a cost to dominance, while defense tasks have a cost to survival. This is particularly relevant in this model since different helping tasks are defined by their fitness costs.

      (39) L 258: "increase"-> "increase with age".

      Added “with dominance”.

      (40) L 262: "division of labor equilibria" What is that?

      Changed to “at equilibrium when division of labor evolves”

      (41) L 268: "Our findings suggest that direct benefits of group living play a driving role in the evolution of division of labor via task specialization in species with totipotent workers". This is a very general statement, but the results are much more circumscribed. First, the model is quite specific by assuming that, in the absence of group augmentation (xn=0), indirect fitness benefits can only be given to breeders (Equation 5) but not to other subordinates (Equations 2, 3.1). This is unrealistic, particularly for vertebrates, and reduces the possibility that indirect fitness benefits play a role.  

      As previously discussed, the scope of this paper was to study division of labor in cooperatively breeding species with fertile workers in which help is exclusively directed towards breeders to enhance offspring production through alloparental care. Other forms of “general” help do not result in task partitioning to enhance productivity.

      Second, the difference in costs of work and defense are what drive the evolution of "division of labor" (understood as intermediate T in case this is what the authors mean) in the KS scenario, but the functional forms of those two costs are quite specific and not of the same form, so these functions may bias the results found. Specifically, R is an unbounded linear function of work and the effect of this function becomes weaker as the individual ages due to the weakening force of selection with age (Equation 2) whereas Sh is a particular bounded nonlinear function of defense (Equation 3.1). These differences may tend to make the effect of Sh stronger due to the particular functions chosen.  

      The difference in costs is inherent to the nature of the different tasks (work versus defense): while survival is naturally bounded, with death as the lower bound, dominance costs are potentially unbounded, as they are influenced by dynamic social contexts and potential competitors. Therefore, we believe that the model’s cost structure is not too different from that in nature.  

      Third, no parameter sweep is given to see to what extent these results hold across the many parameters involved. So, in summary, the discussion should at least reflect that the results are of a restricted nature rather than giving the impression that they are of the suggested level of generality.

      During the exploratory phase of the model development, various parameters and values were assessed. However, the manuscript only details the ranges of values and parameters where changes in the behaviors of interest were observed, enhancing clarity and conciseness. For instance, variation in yh (the cost of help on dominance when performing “work tasks”) led to behavioral changes similar to those caused by changes in xh (the cost of help in survival when performing “defensive tasks”), as both are proportional to each other. Specifically, since an increase in defense costs raises the proportion of work relative to defense tasks, while an increase in the costs of work task has the opposite effect, only results for the variation of xh were included in the manuscript to avoid redundancy. Added to Table 1: “To maintain conciseness, further exploration of the parameter landscape was not included in the manuscript”.

      (42) L 270: "in eusocial insects often characterized by high relatedness and reproductive inhibition, sterile workers acquire fitness benefits only indirectly". This is misleading. Sterile workers of any taxa, be it insects or vertebrates, can only acquire fitness benefits indirectly as they are sterile, but eusocial insects involve not only sterile workers.

      Rephased to “In contrast, in eusocial species characterized by high relatedness and permanent worker sterility, such as most eusocial insects, workers acquire fitness benefits only indirectly”. In any case, permanent sterility only occurs in eusocial invertebrates; in vertebrates with reproductive inhibition sterility is only temporal and context dependent. Therefore, in vertebrates, sterile workers may potentially obtain direct fitness benefits if the social context changes, as is the case in naked mole-rats.  

      (43) L 273: "Group members in eusocial species are therefore predicted to maximize colony fitness due to the associated lower within-group conflict". Again, this is incorrect. Primitively eusocial insects have high conflict.

      We added “Group members in such eusocial species” to clarify that we are not referring here to primitively eusocial species but those with permanent sterile workers.  

      (44) L 277: "when the benefits of cooperation are evenly distributed among group members". In this model, the benefits of cooperation are not evenly distributed among group members: breeders reproduce, but subordinates don't.

      Subordinates may reproduce if they become breeders later in life. However, subordinates also benefit from cooperation as subordinates directly (greater survival in larger groups), and indirectly if they are related to the breeder. Here we refer to the first one, and we expand on that in the following sentence.  

      (45) L 280: "survival fitness benefits derived from living in larger groups seem to be key for the evolution of cooperative behavior in vertebrates [22, 63], and may also translate into low within-group conflict. This suggests that selection for division of labor in vertebrates is stronger in smaller groups". I don't see how the previous sentence suggests this. The paper does not present results to support this statement (i.e., no selection gradients in smaller vs larger groups are shown).

      The benefits of living in a larger group entail diminishing returns, so those living in smaller groups benefit greater by an increase in productivity and group size than those in a larger group.  

      (46) L 284: "Our model demonstrates that vertebrates evolve a more stable division of labor". Where is that shown? How is "more stable" measured?

      Rephrased to “vertebrates are more likely to evolve division of labor”. This is shown in Figure 2, that exemplifies that division of labor evolves in a wider range of environmental condition and to a higher degree (intermediate values of T).  

      (47) L 287: "direct fitness benefits in the form of group augmentation select more strongly for defensive tasks". Where is that shown? Establishing this would entail comparing selection gradients with direct fitness benefits of group augmentation and without them.

      In Figure 2, when we compare the GA column to KS+GA column, we see that at equilibrium, more helpers choose defense tasks, specially when they are free to choose their preferred task (circles).  

      (48) L 288: "kin selection alone seems to select only for work tasks." Again, this may be an artifact of the model assuming that helpers cannot increase non-breeders' fitness components except via group augmentation, and that defense tasks are inherently more costly than work tasks.

      As stated previously, we are studying task specialization in cooperative breeders where help is in the form of alloparental care (from allofeeding and egg care to defense from predators). We also assume that the costs are different, but whether one or the other is more costly depends on the relative context (e.g., a task can be more costly if it affects competitiveness in a very competitive environment). It is important to note that we name these tasks “work” and “defense” for practical reasons, but the focus of the paper is on tasks with different fitness costs that for their characteristics may not fit so well in under this terminology. While we acknowledge that most tasks have both kinds of fitness costs to a degree, here we focus on the main fitness costs of each kind of task (L430-436).  

      (49) L 290: "are comparatively large". This sounds as if the tasks are large, which is presumably not what is meant.

      Rephrased to “costs to dominance value and to the probability of attaining a breeding position are comparatively larger than survival costs.”

      (50) L 298: "helpers are predicted to increase defensive tasks with age or rank, whereas in harsh environments, work tasks are predicted to increase with age or rank." Add parentheses referring to where this is shown.

      This is shown in Figure 3, but since this is described in the discussion, we did not add a reference to the figure. If the editor would like us to refer to figures here, we can (see also comments below relating to the same issue).

      (51) L 308: "the role of age and environmental harshness on the evolution of division of labor". What is the prediction? Simply, the role of age is an assumption, not a prediction.

      Rephrased to “the role of environmental harshness on the evolution of division of labor via age-dependent task specialization”.

      (52) L 315: "individuals shifting from work tasks such as foraging for food, digging, and maintaining the burrow system, to defensive tasks such as guarding and patrolling as individuals grow older and larger". Say in parentheses where this is predicted.

      This prediction comes from Figure 3, we do not reference it here since we are in the Discussion section.  

      (53) L 320: "Under these conditions, our model predicts the highest levels of task partitioning and division of labor." Where is this predicted? Add parentheses referring to where this is shown. As it is, it is not possible to check the validity of the statement.

      This prediction comes from Figure 2 column KS+GA, we do not reference it here since we are in the Discussion section. The results with references to the figures are found under the Results section. In the discussion, we reiterate the results already described and add some examples from real data that seem to confirm our predictions.  

      (54) L 322: "In line with our model predictions, larger and older helpers of this species invest relatively more in territory maintenance, whereas younger/smaller helpers defend the breeding shelter of the dominant pair to a greater extent against experimentally exposed egg predators". These predictions are neat, but are now very difficult to understand from the figures. Maybe at the bottom of 3A, you could add a diagram work->defense for negative gamma_t and defense>work for positive gamma_t (or whatever order it is).

      Done.

      (55) L 325: "Territory maintenance has been shown to greatly affect routine metabolic rates and, hence, growth rates [80], which directly translates into a decrease in the likelihood of becoming dominant and attaining breeding status, as predicted by our model." This seems to be an assumption, not a prediction.

      That is true. We removed: “as predicted by our model”.  

      (56) L 352: "controlled". This means something else.

      Changed to “addressed”.

      (57) L 356: "summary, our study represents the first theoretical model aimed at elucidating the potential mechanisms underlying division of labor between temporal non-reproductives via task specialization in taxa beyond eusocial organisms". Again, claiming to be the first is risky and unnecessary.

      Rephrased to “our study helps to elucidate”.

      (58) L 358: "Harsh environments, where individuals can obtain direct fitness benefits from group living, favor division of labor, thereby enhancing group productivity and, consequently, group size." I'm not sure about this conclusion as harsh environments (large m in Figure 2) also involve the evolution of no division of labor (from the triangles and circles that are zero in the right bottom panel) and perhaps more so than with less harsh environments (intermediate m). Incidentally, in the bottom right panel of Figure 2, do the two separate clusters of triangles and circles mean that there is some sort of evolutionary branching?

      Yes, there are two different equilibria for the same set of conditions. Although it is true that for m=0.3 less division of labor evolves when kin selection and group augmentation act together, it is not the case when only group augmentation takes place. In addition, we qualify m=0.2 as harsh as opposed to benign in which we observe the rise of habitat saturation (m=0.1). m=0.3 is then an extreme harsh environment, in which in several instances different parameter landscape causes population collapse (see figures in the Supplemental Material).  

      (59) L 360: "Variation in the relative fitness costs of different helping tasks with age favors temporal polyethism". I don't see that this has been shown. Temporal polyethism evolves here whenever gamma_t evolves non-zero values. Figure 3A shows that non-zero gamma_t evolves with harsher environments, but I don't see what the "variation in relative fitness costs of different helping tasks" refers to.

      The evolved reaction norms of the model are towards different fitness costs depending on the task performed, since this is how we define the different types of tasks in the model.  

      (60) L 382: "undefined". Say variable. Undefined is something else.

      Undefined is more accurate, since we did not define how many subordinates there were per group, while “variable” could have been defined within a range, which was not the case in this model.  

      (61) L 390: "each genetic locus". Say earlier that each genetic trait is controlled by a single locus.

      Added.  

      (62) L 395: "complete" and "consistent" -> "certain".

      We changed one to “certain” and another to “absolute” to avoid using the same adjective twice in a sentence.  

      (63) L 396: What determines whether dispersers become subordinates or floaters? A trait? Or a fixed probability?

      We added “which is also controlled by the same genetic dispersal predisposition as for subordinates”.

      (64) L 412-413: "cycle". This should be a breeding step.

      Changed to “season” instead.

      (65) L 418: Say negatively impacts (it could also be positively impacts, which I guess is not what you mean).

      Done.

      (66) L 425: "a sample of floaters". Chosen how?

      Added “randomly drawn”.

      (67) L 426-428. But the equation in Table 1 indicates that all floaters compete for breeding spots, not a sample of floaters. This is not clear.

      The number of floaters sampled to try to breed at a given group is N<sub>f,b</sub> = 𝑓∗𝑁<sub>𝑓</sub>/𝑁<sub>𝑏</sub> (Table 1).

      Therefore, N<sub>f,b</sub> is the sample size of floaters for a given open breeding position, and f is how many groups on average a floater attempts to access in each time step.  

      (68) L 432. In the figure, the breeding cycle is called a step, but here it is called a cycle. There should be a single term used throughout. Breeding is not really a cycle here (it doesn't involve multiple steps that are repeated cyclically), so it seems more appropriate to call this breeding steps or breeding seasons.

      Taken into account previous comments, we changed the terms “generation” and “life cycle” to “breeding cycle”. We added “or seasons”.  

      (69) L 439: "generations". What are generations here, as generations are overlapping? You probably mean time steps or something else.

      Changed to “breeding cycles”.

      (70) L 439: "equilibrium was reached". Presumably, equilibrium is reached only asymptotically, so some cutoff is implemented in practice. So maybe say explicitly what cutoff was implemented.

      As mentioned, we run the model for 200’000 time steps, and if equilibrium was not reached for the phenotypic values, then we run the model for longer, with 400’000 time steps being the maximum at which all simulation reached equilibrium. In some cases, genetic values did not reach equilibrium at ranges at which there was no impact on phenotypic values, so these were disregarded to assess whether equilibrium was reached.  

      (71) L 452: "Even though individuals are likely to change the total amount of help given throughout their lives". Do you mean in real organisms or in the model? Say which. If it is in the model, it is not clear how.

      We added “in nature” to clarify that this was not the case in the model.  

      (72) L 455: "For more details on how individuals may adapt their level of help with age and social and environmental conditions, see [63]." Do you mean real individuals or in the model? Again, if it is in the model, it is unclear how this is possible and should be explained in this paper at least briefly rather than citing another one.

      We rephrased it to “How individuals in the model may adapt their level of help with age and social and environmental conditions has been described elsewhere.” We do not go into detail here because it is not within the scope of the paper, and those results have been described elsewhere.  

      (73) L 475: "helpers". Make terminology consistent throughout.

      All helpers are subordinates, but not all subordinates are helpers, as they may evolve no help. Since here we are describing those subordinates that do help, we use that terminology. We added “subordinate helpers” to clarify this further.  

      (74) L 476: "proportional". The dependence in Equation 1 is not "proportional to". Say something like "a survival probability (not rate) that decreases with the amount of help provided".

      Done.

      (75) L 482: "environmental"-> baseline, as defined first.

      Done.

      (76) L 486: "benefits". Can you briefly say in parentheses what those benefits are in real organisms? As in line 475, where you reminded the reader of survival costs due to predator defense.

      Added “such as those offered by safety in numbers or increased resource defense potential”.

      (77) L 494. "we first outline a basic model in which individuals". It is not clear what this sentence says, and the remainder of this section does not clarify it.

      We made two models for comparison, one where individuals can choose freely which task they prefer to perform, and another in which there is an increase in productivity when both kinds of tasks are performed to a similar extent at group level. In the latter model, individuals may choose an unpreferred task at certain times during their lived to increase the effect of the help provided in the breeder’s (and group’s) productivity.  

      We rephrased this section to “we first outline a basic model where individuals evolve their preferred helping task. Then we compare this to another model in which the breeder’s reproductive outcome is maximized when the group’s helping effort in each kind of tasks is performed to a roughly equal degree.”

      (78) L 496: "by performing both tasks". Sounds as if the breeder performs both tasks, not helpers.

      We changed to “when the group’s helping effort in each kind of tasks”.

      (79) L 497: "the maximum amount of cumulative help of each type (sigma Hmax) that can affect fecundity is given by Eq. 4:" This statement is imprecise. Presumably, what is meant is that this level of help maximises breeder productivity, as stated earlier in the paper. However, there is no proof that this level of help maximises breeder productivity, so this expression seems unjustified and it is unclear how it is used.

      This is a description of the model set up. As described later in the same section, the cumulative help of each time that will influence the breeder’s fecundity if maximum Hmax. Therefore, it does represent the maximum amount of cumulative help of each type that can affect the breeder’s fecundity.

      (80) L 500: "reproduced" -> "reproduce".

      Done.  

      (81) L 503. Say here what K is so that the reader knows what equation 5 is showing.

      Added “K” to the “The quantity of offspring produced (K)”.

      (82) L 503: "diminishing returns" -> "diminishing returns as help increases".

      Done.  

      (83) L 507: Why these inequalities?

      These inequalities explain the use of Hmax (response to comment 79). We rephased it to “the cumulative defense effort is larger than or the cumulative work effort is larger than ”.  

      (84) L 526: "removing the influence of relatedness from the model". It would be helpful to plot relatedness in this and the other scenario to check that it is indeed low here and high in the other.

      The actual values of relatedness are provided in the Supplemental Material Table S1. We added this reference to Figure 2.  

      (85) L 528: "It is possible that direct and indirect fitness benefits could have an additive effect on the evolution of alloparental care". This is technically incorrect. It is also unclear what the point of this sentence is.

      We have removed this sentence.  

      (86) Table 1: Say what are the allowed values for these genotypic traits (can they take negative values, be greater than one, are they continuous or discrete?): e.g., alpha \in [0,1] or alpha \in (-infinity, infinity). For phenotypic traits, it would be helpful if the third column lists the equation where the trait is defined. As the variables in the first column are scalars, they should not be bold face. Survival "rate" should be survival "probability" throughout.

      All genetic traits can take any real number (-infinity, infinity), but the phenotypic values are either constrained by the equation like for logistic formulas, or manually constrained like for dispersal propensity or help (only positive numbers allowed). We added “Each genetic trait is controlled by a single locus, and may take any real number” (L403), and added the boundaries for help and dominance value in Table 1. We decided against including the equations in the table due to space constraints. We removed the bold face as suggested. We changed all instances of “survival rate” to “survival probability”.

      (87) Figures S1, S2: I don't recall seeing references to these figures in the main text, but there should be, as well as for Tables S1-S3.

      Table S1 is now referenced in Figure 2. The other figures are now referenced in the main text when we reference the different sections in the Supplemental Materials (L190 and L198). Other Tables are referenced in their respective Figures in the SI.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, Chen et al. used cryo-ET and in vitro reconstituted system to demonstrate that the autoinhibited form of LRRK2 can also assemble into filaments that wrap around the microtubule, although the filaments are typically shorter and less regular compared to the previously reported active-LRRK2 filaments. The structure revealed a new interface involving the N-terminal repeats that were disordered in the previous active-LRRK2 filament structure. The autoinhibited-LRRK2 filament also has different helical parameters compared to the active form.

      Strengths:

      The structure obtained in this study is the highest resolution of LRRK2 filaments done by subtomogram averaging, representing a major technical advance compared to the previous Cell paper from the same group. Overall, I think the data are well presented with beautiful graphic rendering, and valuable insights can be gained from this structural study.

      Weaknesses:

      (1) There are only three main figures, together with 9 supplemental figures. The authors may consider breaking the currently overwhelming Figures 1 and 3 into smaller figures and moving some of the supplemental figures to the main figure, e.g., Figure S7.

      (2) The key analysis of this manuscript is to compare the current structure with the previous active-LRRK2 filament structure. Currently, such a comparison is buried in Figure 3H. It should be part of Figure 1.

      We thank the reviewer for this suggestion. As suggested, we have rearranged the figures, split Figure 1 and 3 into smaller Figures, and moved the comparison analysis in Figure 3H to the new Figure 1. Specifically, the old Figure 1 is separated into two figures, introducing the model-building process and describing the two symmetric axes. The old Figure 3 is also separated into two small figures, describing the geometric analysis and model comparison, respectively.

      Reviewer #2 (Public review):

      The authors of this paper have done much pioneering work to decipher and understand LRRK2 structure and function, to uncover the mechanism by which LRRK2 binds to microtubules, and to study the roles that this may play in biology. Their previous data demonstrated that LRRK2 in the active conformation (pathogenic mutation or Type I inhibitor complex) bound to microtubule filaments in an ordered helical arrangement. This they showed induced a "roadblock" in the microtubule impacting vesicular trafficking. The authors have postulated that this is a potentially serious flaw with Type 1 inhibitors and that companies should consider generating Type 2 inhibitors in which the LRRK2 is trapped in the inactive conformation. Indeed the authors have published much data that LRRK2 complexed to Type 2 inhibitors does not seem to associate with microtubules and cause roadblocks in parallel experiments to those undertaken with type 1 inhibitors published above.

      In the current study, the authors have undertaken an in vitro reconstitution of microtubule-bound filaments of LRRK2 in the inactive conformation, which surprisingly revealed that inactive LRRK2 can also interact with microtubules in its auto-inhibited state. The authors' data shows that while the same interphases are seen with both the active LRRK2 and inactive microtubule bound forms of LRRK2, they identified a new interphase that involves the WD40-ARM-ANK- domains that reportedly contributes to the ability of the inactive form of LRRK2 to bind to microtubule filaments. The structures of the inactive LRRK2 complexed to microtubules are of medium resolution and do not allow visualisation of side chains.

      This study is extremely well-written and the figures are incredibly clear and well-presented. The finding that LRRK2 in the inactive autoinhibited form can be associated with microtubules is an important observation that merits further investigation. This new observation makes an important contribution to the literature and builds upon the pioneering research that this team of researchers has contributed to the LRRK2 fields. However, in my opinion, there is still significant work that could be considered to further investigate this question and understand the physiological significance of this observation.

      We thank the reviewer for the positive comments and we agree that more work can be done next to understand the physiological significance of the autoinhibited LRRK2 in cellular environments. We are actively working on understanding how the stability of autoinhibited full-length LRRK2 is regulated, especially how the transfer between autoinhibited and active forms of LRRK2 can happen. Our in situ data (Watabane et al. 2020) indicates that overexpressed hyperactive PD-mutant LRRK2 mainly adopts its active-like conformation in cells. Thus, learning how the state transfer occurs will allow us to target autoinhibited LRRK2 specifically and efficiently in cells and study its structure and function in physiological conditions.

      Reviewer #3 (Public review):

      Summary:

      The manuscript by Chen et al examines the structure of the inactive LRRK2 bound to microtubules using cryo-EM tomography. Mutations in this protein have been shown to be linked to Parkinson's Disease. It is already shown that the active-like conformation of LRRK2 binds to the MT lattice, but this investigation shows that full-length LRRk2 can oligomerize on MTs in its autoinhibited state with different helical parameters than were observed with the active-like state. The structural studies suggest that the autoinhibited state is less stable on MTs.

      Strengths:

      The protein of interest is very important biomedically and a novel conformational binding to microtubules in the proposed.

      Weaknesses:

      (1) The structures are all low resolution.

      We thank the reviewer for the comments on both the strengths and weaknesses of the manuscript. We agree with the reviewer that higher resolution would provide more information about how LRRK2 interacts with microtubules and oligomerizes in its autoinhibited form. However, with the current resolution, our model-building benefited significantly from the published high-resolution models and the alpha-fold predictions. We used cryo-ET and subtomogram analysis to solve the structure because this filament is less regular than the right-handed active LRRK2 filament, preventing us from using conventional single-particle analysis. As highlighted by reviewer 1, being able to push the resolution to sub-nanometer is an important advance reflecting state-of-the-art subtomogram analysis, especially for a heterogeneous sample.  Notably, the microtubule reconstruction reached higher resolution, comparable to our previous single-particle studies on LRRK2-RCKW (Snead and Matyszewski et al.), confirming the data quality.

      (2) There are no measurements of the affinity of the various LRRK2 molecules (with and without inhibitors) to microtubules. This should be addressed through biochemical sedimentation assay.

      We thank the reviewer for the suggestion and we agree that learning the binding affinity between LRRK2 and microtubules would be informative. We attempted to purify the LRRK2 with mutants on the WD40:ARM/ANK interface we identified in the manuscript.. Unfortunately, either LRRK2 or LRRK2<sup>I2020T</sup> with N-terminal mutants (R521A/F573A/E854K), the yield and purity of the final samples are significantly worse than our routine LRRK2 prep. Our chromatography and gel electrophoresis results indicate that proteins are degrading during purification.

      Author response image 1.

      While we have attached the results here, and it would be interesting to investigate why N-terminal mutations destabilize LRRK2, we anticipate that significant efforts would be required for further experiments, which we respectfully consider outside of the scope of this manuscript. 

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) In Figure S9, the graphic definition of "chain length" in panel A is misleading. The authors can simply note in the figure legend that "chain length is the number of asymmetric units in a continuous chain".

      We thank the reviewer for the suggestion. The updated figure and legend have incorporated the changes.

      (2) In Figure S7B, the conformation changes of the 'G-loop' and the 'DYG' motifs are not so convincing at the current resolution.

      We thank the reviewer for pointing it out. We agree that our model resolution is not high enough to support the unbiased observation of the conformation changes of the key kinase motifs. In the revised manuscript, we avoided emphasizing the comparison between the two models. Instead, we state that for both the MLi-2 bound map and the GZD-824 bound map, the corresponding published high-resolution models fit into each kinase map, but the MLi-2 bound model doesn’t fit as well in the GZD-824 bound map, with a correlation value dropped from 0.44 to 0.4, supporting our statement that “full-length LRRK2 bound to microtubules is in its autoinhibited state in our reconstituted system”.

      Reviewer #2 (Recommendations for the authors):

      (1) Are there any cellular experiments that could be done to demonstrate that inactive LRRK2 associates with microtubules in cells?

      We thank the reviewer for pointing out this direction for future studies. We are studying the physiological significance of the autoinhibited LRRK2 in cells, but haven’t yet been successful at demonstrating physiological binding to microtubules. Further, as noted in our response to reviewer #3, we are also actively working on understanding how the stability of autoinhibited full-length LRRK2 is regulated, especially how the transfer between autoinhibited and active forms of LRRK2 can happen. Our in situ data (Watabane et al. 2020) indicates that hyperactive PD-mutant overexpressed LRRK2 mainly adopts its active-like conformation in cells. Thus, learning how the state transfer occurs will allow us to target autoinhibited LRRK2 specifically and efficiently in cells and study its structure and function in physiological conditions.

      (2) Previous work that the authors and others have undertaken has suggested that only LRRK2 in its active conformation can associate with microtubule filaments and the authors have shown that this leads to a roadblock in vesicular transport only when LRRK2 is complexed with Type 1 but not Type 2 inhibitors. There seems to be some discrepancy here that is not addressed in the paper as based on the current results one would also expect LRRK2 bound to Type 2 inhibitors to induce roadblocks in microtubule filaments. How can this be explained?

      We thank the reviewer for raising this important question. Taking all of our published data together, we believe that LRRK2 can introduce roadblocks with Type 1 inhibitor bound in the active-like conformation, where N-terminus LRRK2 domains are flexible and don’t block the kinase active site. In other words, full-length LRRK2 can form roadblocks when it behaves more like the truncated LRRK2<sup>RCKW</sup> variant. The autoinhibited LRRK2 forms shorter and less stable oligomers on microtubules, making it harder to block transport. Consistent with this, our in situ LRRK2-microtubule structure was observed in cells where LRRK2 is in an active-like conformation, and the LRRK2 N-terminus appeared to be flexible and away from the microtubule when forming right-handed filaments.

      (3) Does the finding that inactive LRRK2 only binds to microtubules as a short filament, explain the differences between the inactive and active forms of LRRK2 binding to microtubules and causing roadblocks?

      We thank the reviewer for discussing this point with us and asking the question. As we replied in the previous comment, the reviewer’s conclusion explains how the roadblock phenomenon occurs only under certain circumstances. We expanded our discussion to add the following and address the question:

      “Notably, we previously demonstrated that active‐like LRRK2, when bound to a Type I inhibitor, can form roadblocks that impair vesicular transport. Since autoinhibited LRRK2 assembles into shorter, less stable oligomers on microtubules, we anticipate it will exert reduced road‐blocking effects in cells, regardless of the inhibitor bound.”

      (4) Could the authors undertake further characterization of the new WD40-ARM-ANK interphase that they have identified? Is this important for the binding of the autoinhibited mutant? Could mutants be made in this interphase to see if this prevents the autoinhibited but not the active conformation of LRRK2 binding to microtubules?

      We thank the reviewer for the comment. As mentioned in our response to Reviewer #2, public comment #2, we attempted to purify the LRRK2 with mutants on the WD40:ARM/ANK interface we identified in the manuscript multiple times. Unfortunately, either LRRK2 or LRRK2<sup>I2020T</sup> with N-terminal mutants (R521A/F573A/E854K), the yield and purity of the final samples are significantly worse than our routine LRRK2 prep. Our chromatography and gel electrophoresis results indicate that proteins are degrading during purification.

      (5) The authors identify several disease-relevant missense mutations that appear to lie within the novel interphase that the authors have characterised in this study. Although this is discussed in the Discussion, some experimental data demonstrating how these missense mutations impact the ability of inactive LRRK2 to bind to microtubule filaments in the presence or absence of Type 1 and Type 2 compounds could provide further experimental data that emphasises the physiological importance of the results presented in this study.

      We thank the reviewer for discussing this interesting direction. The disease-relevant missense mutations can have a direct or indirect impact on the binding of autoinhibited LRRK2 to microtubules, and we agree that it would be interesting to test it out in the future. However, we anticipate that significant effort would be required for further experiments. Alas, our funding for this project ended suddenly and we want to report our results to the community.

      (6) For the data that is shown in Figure 1, could the authors explain how this differs from results in previous papers of the authors showing that the active form of LRRK2 binds microtubules? How does the binding observed here differ from that observed in the previous studies? To a non-specialist reader, the data looks fairly like what has previously been reported.

      We thank the reviewer for asking the question. As mentioned in the response to the public review, the detailed comparison between the data and the previous papers is described in Figure 3, and we agree that it is helpful to incorporate this information in Figure 1. In the revised manuscript, we have incorporated the comparison panel in Figure 1.

      (7) The finding that the autoinhibited LRRK2 forms short and sparse oligomers on microtubules raises the question of how physiological this observation is. Having some data that suggests that this is physiologically relevant would boost the impact of this study.

      We agree with the reviewer on this comment. As discussed in the response to the first comment from the reviewer, we have not been able to assess the physiological relevance of LRRK2 binding to microtubules in either active or inactive state, but continue to pursue this line of research. We are aware and regret that this lessens the impact of this work.

      (8) For the more general reader the authors could potentially better highlight why the key finding in this paper is important.

      We thank the reviewer for the suggestion. To further address the significance of the key findings, especially how it can open up more possibilities for inhibitor-based drug development, we expand our discussion section to include the following:

      “Understanding how Type I and Type II inhibitors’ binding to LRRK2 affects its mechanism is vital to the design of inhibitor-based PD drug development strategies. Our findings revealed that different LRRK2 kinase inhibitors bind to autoinhibited LRRK2 similarly either in solution or on microtubules. Furthermore, the observation of autoinhibited LRRK2 forming short, less stable oligomers on microtubules opens new possibilities to inhibit LRRK2 activity in PD patients. A Type I inhibitor specifically targeting autoinhibited LRRK2 may alleviate the effect of LRRK2 roadblocks on microtubules. Alternatively, a promising strategy of LRRK2 inhibitor design can focus on the stabilization of allosteric N-terminus blocking on the kinase domain, which favors the formation of autoinhibited LRRK2 oligomers on microtubules and causes fewer side effects.”

      Reviewer #3 (Recommendations for the authors):

      In the third paragraph of the introduction, expand on whether type-1 inhibitors which "capture kinases in a closed, "active-like" conformation still inhibit the kinase activity.

      We thank the reviewer for the request to expand this paragraph. We added the following explanation for better understanding in the third paragraph:

      “Type-I inhibitors bind to the ATP binding site and target the kinase in its ‘active-like' conformation, inhibiting its kinase activity.”

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      This study examined the effect of blood pressure variability on brain microvascular function and cognitive performance. By implementing a model of blood pressure variability using an intermittent infusion of AngII for 25 days, the authors examined different cardiovascular variables, cerebral blood flow, and cognitive function during midlife (12-15-month-old mice). Key findings from this study demonstrate that blood pressure variability impairs baroreceptor reflex and impairs myogenic tone in brain arterioles, particularly at higher blood pressure. They also provide evidence that blood pressure variability blunts functional hyperemia and impairs cognitive function and activity. Simultaneous monitoring of cardiovascular parameters, in vivo imaging recordings, and the combination of physiological and behavioral studies reflect rigor in addressing the hypothesis. The experiments are well-designed, and the data generated are clear. I list below a number of suggestions to enhance this important work:

      (1) Figure 1B: It is surprising that the BP circadian rhythm is not distinguishable in either group. Figure 2, however, shows differences in circadian rhythm at different timepoints during infusion. Could the authors explain the lack of circadian effect in the 24-h traces?

      The circadian rhythm pattern is apparent in Figure 2 (Active BP higher than Inactive BP), where BP is presented as 12hour averages. When the BP data is expressed as one-hour averages (rather than minute-to-minute) over 24hours, now included in the revised manuscript as Supplemental Figure 3C-D, the circadian rhythm becomes noticeable. In addition, we have included one-hour average BP data for all mice in the control and BPV groups, Supplemental Figure 3A-B.

      Notably, the Ang-II induced pulsatile BP pattern remains evident in the one-hour averages for the BPV group, Supplemental Figure 3B. To minimize bias and validate variability, pump administrations start times were randomized for both control and BPV groups, Supplemental Figure 3A-B. Despite these adjustments, the circadian rhythm profile of BP is consistently maintained across individual mice and in the collective dataset, Supplemental Figure 3C-D.

      (2) While saline infusion does not result in elevation of BP when compared to Ang II, there is an evident "and huge" BP variability in the saline group, at least 40mmHg within 1 hour. This is a significant physiological effect to take into consideration, and therefore it warrants discussion.

      Thank you for this comment. The large variations in BP in the raw traces during saline infusion reflects transient BP changes induced by movement/activity, which is now included in Figure 1B (maroon trace). The revised manuscript now includes Line 222 “Note that dynamic activity-driven BP changes were apparent during both saline- and Ang II infusions, Figure 1B”.

      (3) The decrease in DBP in the BPV group is very interesting. It is known that chronic Ang II increases cardiac hypertrophy, are there any changes to heart morphology, mass, and/or function during BPV? Can the decrease in DBP in BPV be attributed to preload dysfunction? This observation should be discussed.

      The lower DBP in the BPV group was already present at baseline, while both groups were still infused with saline, and was a difference beyond our control. However, this is an important and valid consideration, particularly considering the minimal yet significant increase in SBP within the BPV group (Figure 1D). Our goal was to induce significant transient blood pressure responses (BPV) and investigate the impact on cardiovascular and neurovascular outcomes in the absence of hypertension. We did not anticipate any major cardiac remodeling at this early time point (considering the absence of overt hypertension) and thus cardiac remodeling was not assessed and this is now discussed in the revised manuscript (Line 443-453).

      (4) Examining the baroreceptor reflex during the early and late phases of BPV is quite compelling. Figures 3D and 3E clearly delineate the differences between the two phases. For clarity, I would recommend plotting the data as is shown in panels D and E, rather than showing the mathematical ratio. Alternatively, plotting the correlation of ∆HR to ∆SBP and analyzing the slopes might be more digestible to the reader. The impairment in baroreceptor reflex in the BPV during high BP is clear, is there any indication whether this response might be due to loss of sympathetic or gain of parasympathetic response based on the model used?

      We appreciate the reviewer’s suggestion and have accordingly generated new figures displaying scatter plots of SBP vs HR with linear regression analysis (Figure 3D-G). Our goal is to further investigate which branch of the autonomic nervous system is affected in this model. The loss of a bradycardic response suggests either an enhancement of sympathetic activity, a reduction in parasympathetic activity, or a combination of both. This is briefly discussed in the revised manuscript (Line 486-496).

      Heart rate variability (HRV) serves as an index of neurocardiac function and dynamic, non-linear autonomic nervous system processes, as described in Shaffer and Ginsber[1]. However, given that our data was limited to BP and HR readings collected at one-minute intervals, our primary assessment of autonomic function is limited to the bradycardic response. Further studies will be necessary to fully characterize the autonomic parameters influenced by chronic BPV.

      (5) Figure 3B shows a drop in HR when the pump is ON irrespective of treatment (i.e., independent of BP changes). What is the underlying mechanism?

      We apologize for any lack of clarity. These observed heart rate (HR) changes occurred during Ang II infusion, when blood pressure (BP) was actively increasing. In the control group, the pump solution was switched to Ang II during specific periods (days 3-5 and 21-25 of the treatment protocol) to induce BP elevations and a baroreceptor response, allowing direct comparisons between the control and BPV group.

      To clarify this point, we have revised Line 260-263 of the manuscript: “To compare pressure-induced bradycardic responses between BPV and control mice at both early and later treatment stages, a cohort of control mice received Ang II infusion on days 3-5 (early phase) (Supplemental Figure 4) and days 21-25 (late phase) thereby transiently increasing BP”.

      Additionally, a detailed description has been added to the Methods section (Line 96-101): “Controls receiving Ang II: To facilitate between-group comparisons (control vs BPV), a separate cohort of control mice were subjected to the same pump infusion parameters as BPV mice but for a brief period receiving Ang II infusions on days 3-5 and 21-25 for experiments assessing pressure-evoked responses, including bradycardic reflex, myogenic response, and functional hyperemia at high BP.”

      (6) The correlation of ∆diameter vs MAP during low and high BP is compelling, and the shift in the cerebral autoregulation curve is also a good observation. I would strongly recommend that the authors include a schematic showing the working hypothesis that depicts the shift of the curve during BPV.

      Thank you for this insightful comment. The increase in vessel reactivity to BP elevations in parenchymal arterioles of BPV mice suggests that chronic BPV induces a leftward shift and a potential narrowing of the cerebral autoregulation range (lower BP thresholds for both the upper and lower limits of autoregulation). This has been incorporated (and discussed) into the revised manuscript (see Figure 5N).

      One potential explanation for these changes is that the absence of sustained hypertension, a prominent feature in most rodent models of hypertension, limits adaptive processes that protect the cerebral microcirculation from large BP fluctuations (e.g., vascular remodeling). While this study does not specifically address arteriole remodeling, the lack of such adaptation may reduce pressure buffering by upstream arterioles, thereby rendering the microcirculation more vulnerable to significant BP fluctuations.

      The unique model allows for measurements of parenchymal arteriole reactivity to acute dynamic changes in BP (both an increase and decrease in MAP). Our findings indicate that chronic BPV enhances the reactivity of parenchymal arterioles to BP changes—both during an increase in BP and upon its return to baseline, Supplemental Figure 5C, F. The data suggest an increased myogenic response to pressure elevation, indicative of heightened contractility, a common adaptive process observed in rodent models of hypertension[2-4]. However, our model also reveals a notable tendency for greater dilation when the BP drops, Supplemental Figure 5F. This intriguing observation may suggest ischemia during the vasoconstriction phase (at higher BP), leading to enhanced release of dilatory signals, which subsequently manifest as a greater dilation upon BP reduction. This phenomenon bears similarities to chronic hypoperfusion models[5,6], where vasodilatory mechanisms become more pronounced in response to sustained ischemic conditions. Future studies investigating the effects of BPV on myogenic responses and brain perfusion will be a priority for our ongoing research.

      (7) Functional hyperemia impairment in the BPV group is clear and well-described. Pairing this response with the kinetics of the recovery phase is an interesting observation. I suggest elaborating on why BPV group exerts lower responses and how this links to the rapid decline during recovery.

      Based on the heightened reactivity of BPV parenchymal arterioles to intravascular pressure (Figure 5), we anticipate that the reduction of sensory-evoked dilations results from an increased vasoconstrictive activity and/or a decreased availability of vasodilatory signaling pathways (NO, EETs, COX-derived prostaglandins)[7,8]. Consequently, the magnitude of the FH response is blunted during periods of elevated BP in BPV mice.

      Additionally, upon termination of the stimulus-induced response−when vasodilatory signals would typically dominate−vasoconstrictive mechanisms are rapidly engaged (or unmasked), leading to quicker return to baseline. This shift in the balance between vasodilatory and vasoconstrictive forces favors vasoconstriction, contributing to the altered recovery kinetics observed in BPV mice. This has been included in the Discussion section of the revised manuscript.

      (8) The experimental design for the cognitive/behavioral assessment is clear and it is a reasonable experiment based on previous results. However, the discussion associated with these results falls short. I recommend that the authors describe the rationale to assess recognition memory, short-term spatial memory, and mice activity, and explain why these outcomes are relevant in the BPV context. Are there other studies that support these findings? The authors discussed that no changes in alternation might be due to the age of the mice, which could already exhibit cognitive deficits. In this line of thought, what is the primary contributor to behavioral impairment? I think that this sentence weakens the conclusion on BPV impairing cognitive function and might even imply that age per se might be the factor that modulates the various physiological outcomes observed here. I recommend clarifying this section in the discussion.

      We thank the reviewer for this comment. Clinical studies have demonstrated that patients with elevated BPV exhibit impairments across multiple cognitive domains, including declines in processing speed[9] and episodic memory[10]. To evaluate memory function, we utilized behavioral tests: the novel object recognition (NOR) task to assess episodic memory[11] and the spontaneous Y-maze to evaluate short-term spatial memory[12].

      Previous research indicates that older C57Bl6 mice (14-month-old) exhibit cognitive deficits compared to younger counterparts (4- and 9-month-old)[13]. To ensure rigorous selection for behavioral testing, we conducted preliminary NOR assessment, evaluating recognition memory at the one-hour delay but observing failures at the four-, and 24-hour delays, indicating age-related deficits. Based on these results, animals failing recognition criteria were excluded from subsequent behavioral assessment. However, because no baseline cognitive testing was conducted for the spontaneous Y-maze, it is possible that some mice with aged-related deficits were included in this test, which may have influenced data interpretation.

      Additionally, the absence of differences in the Y-maze performance may suggest that short-term spatial memory remains intact following 25 days of BPV, a point that is now discussed in the revised manuscript.

      (9) Why were only male mice used?

      We appreciate this comment and acknowledge the importance of conducting experiments in both male and female mice. Studies involving female mice are currently ongoing, with telemetry data collection approximately halfway completed and two-photon imaging studies on functional hyperemia also partially completed. However, using middleaged mice for these experiments has proven challenging due to high mortality rates following telemetry surgeries. As a result, we initially limited our first cohort to male mice.

      (10) In the results for Figure 3: "Ang II evoked significant increases in SBP in both control and BPV groups;...". Also, in the figure legend: "B. Five-minute average HR when the pump is OFF or ON (infusing Ang II) for control and BPV groups...." The authors should clarify this as the methods do not state a control group that receives Ang II.

      Please refer to response to comment 5.

      Reviewer #2 (Public review):

      Summary:

      Blood pressure variability has been identified as an important risk factor for dementia. However, there are no established animal models to study the molecular mechanisms of increased blood pressure variability. In this manuscript, the authors present a novel mouse model of elevated BPV produced by pulsatile infusions of high-dose angiotensin II (3.1ug/hour) in middle-aged male mice. Using elegant methodology, including direct blood pressure measurement by telemetry, programmable infusion pumps, in vivo two-photon microscopy, and neurobehavioral tests, the authors show that this BPV model resulted in a blunted bradycardic response and cognitive deficits, enhanced myogenic response in parenchymal arterioles, and a loss of the pressure-evoked increase in functional hyperemia to whisker stimulation.

      Strengths:

      As the presentation of the first model of increased blood pressure variability, this manuscript establishes a method for assessing molecular mechanisms. The state-of-the-art methodology and robust data analysis provide convincing evidence that increased blood pressure variability impacts brain health.

      Weaknesses:

      One major drawback is that there is no comparison with another pressor agent (such as phenylephrine); therefore, it is not possible to conclude whether the observed effects are a result of increased blood pressure variability or caused by direct actions of Ang II.

      We acknowledge this limitation and have attempted to address the concern by introducing an alternative vasopressor, norepinephrine (NE), Figure 4. A subcutaneous dose of 45 µg/kg/min was titrated to match Ang II-induced transient BP pulse (Systolic BP ~150-180 mmHg), Figure 4A. Similar to Ang II treated mice, NE-treated mice exhibited no significant changes in average mean arterial pressure (MAP) throughout the 20-day treatment period (Figure 4B). Although there was a trend (P=0.08) towards increased average real variability (ARV) (Figure 4C left), it did not reach statistical significance. The coefficient of variation (CV) (Figure 4C right) was significantly increased by day 3-4 of treatment (P=0.02).

      Notably, unlike the bradycardic response observed during Ang II-induced BP elevations, NE infusions elicited a tachycardic response (Figure 4A), likely due to β-1 adrenergic receptor activation. However, significant mortality was observed within the NE cohort: three of six mice died prematurely during the second week of treatment, and two additional mice required euthanasia on days 18 and 20 due to lethargy, impaired mobility, and tachypnea.

      While we recognize the importance of comparing results across vasopressors, further investigation using additional vasopressors would require a dedicated study, as each agent may induce distinct off-target effects, potentially generating unique animal models. Alternatively, a mechanical approach−such as implanting a tethered intra-aortic balloon[14] connected to a syringe pump−could be explored to modulate blood pressure variability without pharmacological intervention. However, such an approach falls beyond the scope of the present study.

      Ang II is known to have direct actions on cerebrovascular reactivity, neuronal function, and learning and memory. Given that Ang II is increased in only 15% of human hypertensive patients (and an even lower percentage of non-hypertensive), the clinical relevance is diminished. Nonetheless, this is an important study establishing the first mouse model of increased BPV.

      We agree that high Ang II levels are not a predominant cause of hypertension in humans, which is why it is critical that our pulsatile Ang II dosing did not cause overt hypertension, (no increase in 24-hour MAP). Ang II was solely a tool to produce controlled, transient increases in BP to yield a significant increase in BPV.

      Regarding BPV specifically, prior studies indicate that primary hypertensive patients with elevated urinary angiotensinogen-to-creatinine ratio exhibit significantly higher mean 24-hour systolic ARV compared to those with lower ratios[15]. However, the fundamental mechanisms driving these harmful increases in BPV remain poorly defined. A central theme across clinical BPV studies is impaired arterial stiffness, which has been proposed to contribute to BPV through reduced arterial compliance and diminished baroreflex sensitivity. Moreover, increased BPV can exert mechanical stress on arterial walls, leading to arterial remodeling and stiffness−ultimately perpetuating a detrimental feed-forward cycle[16].

      In our model, male BPV mice exhibited a minimal yet significant elevation in SBP without corresponding increases in DBP, potentially reflecting isolated systolic hypertension, which is strongly associated with arterial stiffness[17,18]. Our initial goal was to establish controlled rapid fluctuations in BP, and Ang II was selected as the pressor due to its potent vasoconstrictive properties and short half-life[19].

      We appreciate the reviewer’s insightful comment and acknowledge the necessity of exploring alternative mechanisms underlying BPV, and independent of Ang II. It is our long-term goal to investigate these factors in further studies.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      (1) How was the dose of Ang II determined? It seems that this dose (3.1ug/hr) is quite high.

      The Ang II dose was titrated in a preliminary study to one that induced a significant and transient BP response without increasing 24-hour blood pressure (i.e. no hypertension).

      Ang II was delivered subcutaneously at 3.1 μg/hr, a concentration comparable to high-dose Ang II administration via mini-osmotic pumps (~1700 ng/kg/min)[20], with one-hour pulses occurring every 3-4 hours. With 6 pulses per day, the total daily dose equates to 18.6 µg/day in a ~30 gram mouse.

      For comparison, if the same 18.6 µg/day dose were administered continuously via a mini-osmotic pump (18.6 µg/0.03kg/1440min), the resulting dosage would be approximately 431 ng/kg/min[21,22], aligning with subpressor dose levels. Thus, while the total dose may appear high, it is not delivered in a constant manner but rather intermittently, allowing for controlled, rapid variations in blood pressure.

      (2) Were behavioral studies performed on the same mice that were individually housed? Individual housing causes significant stress in mice that can affect learning and memory tasks (PMC6709207). It's not a huge issue since the control mice would have been housed the same way, but it is something that could be mentioned in the discussion section.

      Behavioral studies were performed on mice that were individually housed following the telemetry surgery. The study was started once BP levels stabilized, as mice required several days to achieve hemodynamic stability post-surgery. Consequently, all mice were individually housed for several days before undergoing behavioral assessment.

      To account for potential cognitive variability, earlier novel object recognition (NOR) tests were conducted to established cognitive capacity, and mice that did not meet criteria were excluded from further behavioral testing. However, we acknowledge that individual housing induces stress, which can influence learning and memory, and this is a factor we were unable to fully control. Given that both experimental and control groups experienced the same housing conditions, this stress effect should be comparable across cohorts. A discussion on this limitation is now included in the text.

      (3) It looks like one control mouse that was included in both Figures 1 and 2 (control n=12) but was excluded in Table 1 (control n=11), this isn't mentioned in the text - please include the exclusion criteria in the manuscript.

      We apologize for the typo−12 control animals were consistently utilized across Figure 1-2, Table 1, Supplemental Table 1, Figure 6C, and Supplemental Figure 2B. Since the initial submission, one control mouse was completed and included into the telemetry control cohort. Thus, in the updated manuscript, we have corrected the control sample size to 13 mice across these figures ensuring consistency.

      Additionally, exclusion criteria have now been explicitly included in the manuscript (Line 173-175). Mice were excluded from the study if they died prematurely (died prior to treatment onset) or mice exhibited abnormally elevated pressure while receiving saline, likely due to complications from telemetry surgery.

      (4) Please include a statement on why female mice were not included in this study.

      As discussed in our response to Reviewer #1, our initial intention was to include both male and female mice in this study. However, high mortality rates following telemetry surgeries significantly constrained our ability to advance all aspects of the study. As a result, we limited our first cohort to males to establish the basics of the model. A statement is now included in the manuscript, Line 50-53: “Female mice were not included in the present study due to high post-surgery mortality observed in 12-14-month-old mice following complex procedures. To minimized confounding effects of differential survival and to establish foundational data for this model, we restricted the investigation to male mice.”

      Potential sex differences might be complex and warrants a separate future research to comprehensively assess sex as a biological variable, which are currently ongoing.

      (5) On page 14, "experiments from control vs experimental mice were not equally conducted in the same season raising the possibility for a seasonal effect" - does this mean that control experiments were not conducted at the same time as the Ang II infusions in BPV mice? This has huge implications on whether the effects observed are induced by treatment or just batch seasonal effects.

      We fully acknowledge the reviewer’s concern, and our statement aims to provide transparency regarding the study’s limitations. Several challenges contributed to this outcome, including high mortality rates following surgeries (primarily telemetry implantation) and technical issues related to instrumentation, particularly telemetry functionality.

      Differences between BPV and saline mice emerge primarily due to mortality or telemetry failures−some mice did not survive post-surgery, while others remain healthy but had non-functional telemeters. This issue was particularly pronounced in 14-month-old mice, as their fragile vasculature occasionally prevented proper BP readings.

      Each experiment required a minimum of two and a half months per mouse to complete, with a cost (also per mouse) exceeding $1500 USD ($300 pump, $175 mouse, $900 telemeters, per diem, drugs, reagents etc.). Despite our best effort to ensure comparable seasonal/batch data, these logistical and technical constraints prevented perfect synchronization.

      To evaluate whether seasonal differences influenced our results, we incorporated additional telemetry data into the control cohort. Of the seven included control mice, six underwent the same treatment but were allocated to a separate branch of the study, which endpoints did not require a chronic cranial window. We found no significant differences in 24-hour average MAP during the baseline period between control mice with or without a cranial window, Supplemental Figure 2A. Additionally, we grouped mice into seasonal categories based on Georgia’s climate: “Spring-Summer” (May-September) and “Fall-Winter” (October-April) but observed no BP differences between these periods, Supplemental Figure 2B.

      Given the absence of seasonal effects on BP and the fact that mice were sourced from two independent suppliers (Jackson Laboratory and NIA), we anticipate that the observed results are driven by treatment rather than seasonal or batch effects.

      (6) Methods, two-photon imaging: did the authors mean "retro-orbital" instead of "intra-orbital" injection of the Texas red dye? Also, is this a Texas red-dextran? If so, what molecular weight?

      Thank you for this comment. The correct terminology is “retro-orbital” rather than “intra-orbital” injection. Additionally, we utilized Texas Red-dextran (70 kDa, 5% [wt/vol] in saline) for the imaging experiments. These details have now been incorporated into the Methods section.

      (1) Shaffer F, Ginsberg JP. An Overview of Heart Rate Variability Metrics and Norms. Front Public Health. 2017;5:258. doi: 10.3389/fpubh.2017.00258

      (2) Pires PW, Jackson WF, Dorrance AM. Regulation of myogenic tone and structure of parenchymal arterioles by hypertension and the mineralocorticoid receptor. Am J Physiol Heart Circ Physiol. 2015;309:H127-136. doi: 10.1152/ajpheart.00168.2015

      (3) Iddings JA, Kim KJ, Zhou Y, Higashimori H, Filosa JA. Enhanced parenchymal arteriole tone and astrocyte signaling protect neurovascular coupling mediated parenchymal arteriole vasodilation in the spontaneously hypertensive rat. J Cereb Blood Flow Metab. 2015;35:1127-1136. doi: 10.1038/jcbfm.2015.31

      (4) Diaz JR, Kim KJ, Brands MW, Filosa JA. Augmented astrocyte microdomain Ca(2+) dynamics and parenchymal arteriole tone in angiotensin II-infused hypertensive mice. Glia. 2019;67:551-565. doi: 10.1002/glia.23564

      (5) Kim KJ, Diaz JR, Presa JL, Muller PR, Brands MW, Khan MB, Hess DC, Althammer F, Stern JE, Filosa JA. Decreased parenchymal arteriolar tone uncouples vessel-to-neuronal communication in a mouse model of vascular cognitive impairment. GeroScience. 2021. doi: 10.1007/s11357-020-00305-x

      (6) Chan SL, Nelson MT, Cipolla MJ. Transient receptor potential vanilloid-4 channels are involved in diminished myogenic tone in brain parenchymal arterioles in response to chronic hypoperfusion in mice. Acta Physiol (Oxf). 2019;225:e13181. doi: 10.1111/apha.13181

      (7) Tarantini S, Hertelendy P, Tucsek Z, Valcarcel-Ares MN, Smith N, Menyhart A, Farkas E, Hodges EL, Towner R, Deak F, et al. Pharmacologically-induced neurovascular uncoupling is associated with cognitive impairment in mice. J Cereb Blood Flow Metab. 2015;35:1871-1881. doi: 10.1038/jcbfm.2015.162

      (8) Ma J, Ayata C, Huang PL, Fishman MC, Moskowitz MA. Regional cerebral blood flow response to vibrissal stimulation in mice lacking type I NOS gene expression. Am J Physiol. 1996;270:H1085-1090. doi: 10.1152/ajpheart.1996.270.3.H1085

      (9) Sible IJ, Nation DA. Blood Pressure Variability and Cognitive Decline: A Post Hoc Analysis of the SPRINT MIND Trial. Am J Hypertens. 2023;36:168-175. doi: 10.1093/ajh/hpac128

      (10) Epstein NU, Lane KA, Farlow MR, Risacher SL, Saykin AJ, Gao S. Cognitive dysfunction and greater visit-to-visit systolic blood pressure variability. Journal of the American Geriatrics Society. 2013;61:2168-2173. doi: 10.1111/jgs.12542

      (11) Antunes M, Biala G. The novel object recognition memory: neurobiology, test procedure, and its modifications. Cognitive processing. 2012;13:93-110. doi: 10.1007/s10339-011-0430-z

      (12) Kraeuter AK, Guest PC, Sarnyai Z. The Y-Maze for Assessment of Spatial Working and Reference Memory in Mice. Methods Mol Biol. 2019;1916:105-111. doi: 10.1007/978-1-4939-8994-2_10

      (13) Singhal G, Morgan J, Jawahar MC, Corrigan F, Jaehne EJ, Toben C, Breen J, Pederson SM, Manavis J, Hannan AJ, et al. Effects of aging on the motor, cognitive and affective behaviors, neuroimmune responses and hippocampal gene expression. Behav Brain Res. 2020;383:112501. doi: 10.1016/j.bbr.2020.112501

      (14) Tediashvili G, Wang D, Reichenspurner H, Deuse T, Schrepfer S. Balloon-based Injury to Induce Myointimal Hyperplasia in the Mouse Abdominal Aorta. J Vis Exp. 2018. doi: 10.3791/56477

      (15) Ozkayar N, Dede F, Akyel F, Yildirim T, Ates I, Turhan T, Altun B. Relationship between blood pressure variability and renal activity of the renin-angiotensin system. J Hum Hypertens. 2016;30:297-302. doi: 10.1038/jhh.2015.71

      (16) Kajikawa M, Higashi Y. Blood pressure variability and arterial stiffness: the chicken or the egg? Hypertens Res. 2024;47:1223-1224. doi: 10.1038/s41440-024-01589-8

      (17) Laurent S, Boutouyrie P. Arterial Stiffness and Hypertension in the Elderly. Front Cardiovasc Med. 2020;7:544302. doi: 10.3389/fcvm.2020.544302

      (18) Wallace SM, Yasmin, McEniery CM, Maki-Petaja KM, Booth AD, Cockcroft JR, Wilkinson IB. Isolated systolic hypertension is characterized by increased aortic stiffness and endothelial dysfunction. Hypertension. 2007;50:228-233. doi: 10.1161/HYPERTENSIONAHA.107.089391

      (19) Al-Merani SA, Brooks DP, Chapman BJ, Munday KA. The half-lives of angiotensin II, angiotensin II-amide, angiotensin III, Sar1-Ala8-angiotensin II and renin in the circulatory system of the rat. J Physiol. 1978;278:471490. doi: 10.1113/jphysiol.1978.sp012318

      (20) Zimmerman MC, Lazartigues E, Sharma RV, Davisson RL. Hypertension caused by angiotensin II infusion involves increased superoxide production in the central nervous system. Circ Res. 2004;95:210-216. doi: 10.1161/01.RES.0000135483.12297.e4

      (21) Gonzalez-Villalobos RA, Seth DM, Satou R, Horton H, Ohashi N, Miyata K, Katsurada A, Tran DV, Kobori H, Navar LG. Intrarenal angiotensin II and angiotensinogen augmentation in chronic angiotensin II-infused mice. Am J Physiol Renal Physiol. 2008;295:F772-779. doi: 10.1152/ajprenal.00019.2008

      (22) Nakagawa P, Nair AR, Agbor LN, Gomez J, Wu J, Zhang SY, Lu KT, Morgan DA, Rahmouni K, Grobe JL, et al. Increased Susceptibility of Mice Lacking Renin-b to Angiotensin II-Induced Organ Damage. Hypertension. 2020;76:468-477. doi: 10.1161/HYPERTENSIONAHA.120.14972

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Proposed revision plan

      Based on the below reviews, we propose the following revision plan. Briefly:

      • We will remove the functional data on TGFβ signaling and mechanical loading/mechanosensing. We agree with the reviewers that we would need to generate additional histological and molecular data from conditional knockout mice, antibody and (ant)agonist treatments and the optogenetic model to determine their exact involvement in lining macrophage maturation. These experiments require significant time and other resources.
      • We would therefore like to uncouple this question for a follow-on manuscript.We will re-focus the manuscript on the developmental data providing a molecular and cellular blueprint of lining macrophage development. This will include our data on CSF1 as a key signal. The novelty and relevance of our developmental data have been highlighted by all three reviewers, and they have also praised the rigor of these experiments and their interpretation. We thus believe that this re-focus will improve the manuscript message.
      • To further enhance this, we are proposing to include additional data delineating the developmental dynamics of synovial fibroblasts. We have generated an in-depth single cell RNAsequencing dataset but did not include fibroblast-specific analyses in the original manuscript. This is not a change proposed by the reviewers, but we are proposing this because we believe this would be an impactful addition to a revised version of our study, providing data also on the maturation of the synovial (lining) macrophage niche.
      • We will otherwise respond to all individual reviewer comments and implement the requested changes, unless technically not possible. Please find below detailed point-by-point answers.

      Reviewer #1

      Evidence, reproducibility and clarity

      In their manuscript entitled "The synovial lining macrophage layer develops in the first weeks of life in a CSF1- and TGFβ-dependent but monocyte-independent process," the authors explore the developmental trajectory of synovial lining macrophages. They demonstrate that the formation of this specialized macrophage layer is age-dependent and governed by a distinct developmental program that proceeds independently of circulating monocytes. Through scRNA-Seq, the authors show that synovial lining macrophages originate locally from Aqp1⁺ macrophages and are marked by the expression of Csf1r, Tgfbr, and Piezo1. Notably, genetic ablation of each of these factors impaired the development of lining macrophages to varying degrees, suggesting differential contributions of CSF1, TGFβ, and PIEZO1 signaling pathways to their maturation and maintenance.

      The manuscript is well written, and the data quality and representation is of a high standard. The authors have employed a sophisticated array of state-of-the-art mouse models and cutting-edge technologies to elucidate the developmental origin of synovial lining macrophages. Notably, the supporting scRNA-Seq datasets are of excellence and provide valuable insights that will likely be of significant interest to researchers in the field of immunology and joint biology. Accordingly, the experimental approach and interpretations regarding macrophage origin are well-founded and compelling. However, in the eye of the reviewer, the section addressing the underlying molecular mechanisms is a bit less convincing. This part of the study appears slightly underdeveloped, and some of the mechanistic claims lack sufficient experimental clarity. A more rigorous experimental investigation would be essential to reinforce the manuscript's conclusions, particularly concerning the data related to Tgfbr and Piezo1, where the current evidence appears insufficiently substantiated.

      We thank the reviewer for their positive and constructive evaluation of our manuscript. We agree with them (and the other reviewers) that our functional data on the involvement of TGFβ signaling and mechanical loading/mechanosensing are comparably less convincing and substantiated than our developmental data. We are very grateful for their (and the other reviewers’) suggestions to provide more support for the involvement of these factors in lining macrophage development. However, we think that carrying this out to the same high standard will require substantial time and other resources. We have therefore decided to uncouple this from the developmental data and pursue this in follow-up work. We will re-focus the current manuscript on the developmental data. We have proposed to the editors to instead include additional data on synovial fibroblast development, to complement our macrophage data and also delineate the maturation of their niche, thereby providing a conclusive developmental atlas.

      Major point:

      1. The numbers of VSIG4⁺ macrophages appear either unaffected or only minimally altered in both Csf1rMerCreMer Tgfbr2floxed and Fcgr1Cre Piezo1floxed mouse models, respectively. This raises an important question: was the gene deletion efficiency sufficient in each model? Accordingly, the authors are encouraged to include quantitative data on gene deletion efficiency for both mouse models, as this information is critical for interpreting the observed phenotypic outcomes and validating the conclusions regarding gene function. Furthermore, to better assess the impact of Tgfbr2 and Piezo1 disruption, the authors should provide more comprehensive flow cytometry analyses and histological data for these mouse models. Given the apparent homogeneity of VSIG4⁺ macrophages (as shown by the authors themselves), bulk RNA-Seq of sorted Tgfbr2- and Piezo1-deficient VSIG4⁺ macrophages (or from TGFβ-treated animals) would offer valuable insights into both the effectiveness of gene deletion and the molecular pathways governed by TGFβ and PIEZO1 in lining macrophages.

      As outlined above, we have decided to uncouple our functional data on TGFβ, Piezo1 and mechanical loading. The points raised here are all very valid, and we will implement your suggestions in our follow-up functional work focusing on signaling events regulating lining macrophage development. On the suggestion to perform bulk RNA sequencing for VSIG4+ macrophages: This is a good one in principle – although we will not be able to use this strategy where we want to assess the consequences of experimental treatments or genetic models on lining macrophage maturation, because acquisition of VSIG4 is a key maturation event that might be impaired in these conditions.

      Minor points:

      Consistent usage of Cx3cr1-GFP+ nomenclature (for instance: Fig. S1 legend "adult mouse synovial tissue, showing PDGFRα⁺ fibroblasts (yellow) and CX3CR1-GFP⁺ cells (cyan)." versus Fig. 1 legend "Automated spot detection highlights Cx3cr1-GFP⁺ macrophages)".

      We will implement these changes.

      Unclear Fig. 3 legend: "Representative immunofluorescence images of synovial tissue from Clec9aCre:Rosa26lsl-tdT mice at 3 weeks and in adulthood, showing and tdTomato (yellow) and stained for DAPI (blue), VSIG4 (cyan)" Check 'showing and tdTomato.'

      We will implement these changes.

      For greater clarity, it would have been helpful if the transcript names had been directly included within Figures 3C, S3A, and S3C.

      We will implement these changes.

      Page 24: "(Mki67CreERT2:Rosa26lsl-tdT)" Last bracket not superscript.

      We will implement these changes.

      Page 25: "we again leveraged our scRNAsequencing dataset" Missing punctuation.

      We will implement these changes.

      Page 27: Fig. 5C legend: " of synovial tissue of 1 week-old, 3 weeks-old and adult mice." Please specify and change to 'adult Csf1rΔFIRE/ΔFIRE mice'.

      We will implement these changes.

      Page 30: The outcome observed in the Acta1-rtTA:tetO-Cre:ChR2-V5fl mouse model appears to be inconclusive: "This approach resulted in an increased density of VSIG4+ and total (F4/80+) macrophages in the exposed leg of some 5 days-old pups, but others showed the opposite trend (Figure S5D)." This variability may reflect low efficiency of the model or other technical limitations (e.g. muscle contractions frequency or time point of analysis). Given this ambiguity, it is worth reconsidering whether the data are sufficiently robust to warrant inclusion. Should the authors choose to include these findings, further experimentation of appropriate depth and precision is required to allow a conclusive interpretation (either it increases the density of VSIG4+ macrophages or not). The same applies to the Yoda1-treated mice, for which additional data are needed to determine whether VSIG4⁺ macrophage density is truly affected.

      We have decided to remove the data on the optogenetic mouse model and Yoda1 treatment and follow-on separately, implementing these suggestions, including proof of concept data for optogenetically induced muscle contractions.

      Significance

      General assessment: provide a summary of the strengths and limitations of the study. What are the strongest and most important aspects? What aspects of the study should be improved or could be developed? This is a well-designed study that uses cutting-edge methodologies to investigate the developmental trajectory of synovial lining macrophages under homeostatic conditions. The authors present robust experimental evidence and compelling interpretations concerning synovial macrophage origin, which are both well-substantiated and impactful. Nonetheless, from the reviewer's perspective, the section exploring the molecular mechanisms underlying macrophage differentiation is comparatively less convincing. This section appears somewhat underdeveloped, as some of the mechanistic claims lack sufficient depth and experimental rigor to fully substantiate the conclusions.

      Describe the nature and significance of the advance (e.g. conceptual, technical, clinical) for the field: In contrast to earlier studies (PMID: 31391580, 32601335), the inclusion of fate-mapping experiments adds an important dimension, offering novel insight into the ontogeny of synovial macrophages. This expanded perspective may prove particularly valuable in advancing our understanding of joint immunology, especially regarding the local origins and lineage relationships of macrophage populations.

      Furthermore, the authors present novel insights into the molecular pathways underlying the differentiation and development of synovial lining macrophages. By demonstrating previously unrecognized regulatory mechanisms, this work significantly deepens our understanding of the cellular and transcriptional programs that drive macrophage specialization within the joint microenvironment.

      Place the work in the context of the existing literature (provide references, where appropriate): This study builds upon previous work characterizing the macrophage compartment in the joint (PMID: 31391580, 32601335), yet provides a substantially more comprehensive dataset that spans multiple developmental time points and data on the origin of this specialized macrophage subset.

      State what audience might be interested in and influenced by the reported findings: Immunologist, clinicians

      Define your field of expertise with a few keywords to help the authors contextualize your point of view. Indicate if there are any parts of the paper that you do not have sufficient expertise to evaluate. This study falls well within the scope of the reviewer's expertise in innate immunity.

      Reviewer #2

      Evidence, reproducibility and clarity

      In the manuscript „The synovial lining macrophage layer develops in the first weeks of life in a CSF1- and TGFβ- dependent but monocyte-independent process", Magalhaes Pinto and colleagues carefully employ a wide range of technologies including single cell profiling, imaging and an exceptional combination of fate mapping models to characterize the ontogeny and development of lining macrophages in the joint, thus dissecting their maturation during postnatal development. Over the last decade, several landmark studies highlighted the imprinting of tissue-resident macrophages by a combination of ontogenetic and tissue-specific niche factors during development. So far, the ontogeny and the tissue niche factors governing the development and maturation of lining macrophages have not been described. Therefore, the results of this study offers insights on a small highly adapted macrophage population with relevance in many disease settings in the joint. Furthermore, the findings are nicely showcasing how macrophages are specializing to even very small tissue niches across development within one bigger anatomical compartment to serve dedicated functions within this niche.

      This manuscript is beautifully written and highlights many novel, highly relevant findings on lining macrophage biology and the authors employ a wide range of different technologies to carefully dissect the postnatal development of lining macrophages.

      In particular, the combination of scRNA-seq and fate mapping is providing a unique the link of transcriptional programs to ontogeny within the tissue niche. Furthermore, the integrative use of distinct fate mapping strategies, transgenic mouse lines, and treatment paradigms to elucidate key niche factors guiding the development and maturation of lining macrophages provides many interesting findings and data that are highly relevant to the field. I really enjoyed reading this manuscript.

      Thank you for your complimentary and constructive assessment of our manuscript, and the detailed comments below, which are very helpful. Please find point-by-point responses below.

      Major points:

      The authors show dynamic regulation of VSIG4 in lining macrophages during development, therefore VSIG4 is maybe not an ideal choice for gating strategies to define lining macrophages or to show as a single markers in immunofluorescence (IF) stainings to demonstrate their abundance across development (even though it is clear that this is the reason why the F4/80 staining is shown next to it). To demonstrate the increase of lining macrophages during development in IF, it would be more helpful if the authors would show quantifications of all F4/80+ cells and additionally VSIG4+ as a proportion of F4/80+ cells (or VSIG4+ F4/80+ and all F4/80+ in a stacked bar plot). We agree with the assessment of VSIG4 not being ideal since this is a key marker of mature lining macrophages only.

      We will provide these additional analyses.

      In Figure 1C, the authors nicely demonstrate that the lining macrophages get closer in their distance across development to build the epithelial-like macrophage structure along the adult lining. Is the close proximity between lining macrophages already fully "matured" at 3 weeks of age and comparable to adults? Please quantify the distance in adult linings.

      We will provide data for adult joints.

      Can the authors explain how the grouping was performed between the analyzed human fetal joints? It is not clear why the cut was chosen between the groups at 16/17 weeks of age. Maybe it would be also beneficial if the authors would consider not grouping these samples but rather show the specific quantifications for each samples individually and estimate via linear regression the expansion over time across human development. Furthermore, can the authors give additional information about the distancing of lining macrophages in the human fetal samples, it would be great to see if they follow the same dynamics as in mouse. Maybe comparison to human juvenile/adult joints would also add on to substantiate the findings in human samples (if possible).

      We will show samples ungrouped and perform linear regression analysis as suggested.

      The scRNA-seq analysis leaves several questions open and some conclusions and workflows cannot be easily followed.

      We appreciate this comment and the complexity of the data, and will implement the below recommendations, and clarify the issues raised.

      It is not clear how and especially why the signature genes to define macrophages vs. monocytes were chosen. Especially as the signature genes for monocytes would not include patrolling monocytes and the macrophage signature genes seem to be highly regulated during development, see also Apoe expression in NB vs. adult in Figure S2e. Why did the authors not take classical markers such as Itgam, Fcgr1a, Csf1r?

      Can dendritic cell signatures be excluded? Cluster 11 and 12 show indeed some DC markers, are these really macrophages?

      The authors provide several figure panels showing TOP marker genes or key marker genes for the identified clusters, however it is not clear if these are TOP DE genes or if the genes were hand chosen. Somehow, the authors give the impression that the clusters were chosen and labeled not based on DE genes, but more on existing literature that previously reported these macrophage populations. DE gene lists for all annotated cell types and macrophage clusters need to be provided within the manuscript.

      The authors claim that Clusters 1 and 4 are "developing" macrophages. How is this defined? Why are these developing cells compared to other clusters? And why are these clusters later on not considered as progenitors of Aqp1 macrophages and Vsig4 macrophages? Why are Aqp1+ macrophages not labeled as developing when they are later on in the manuscript shown as potential intermediate progenitors of lining macrophages?

      Furthermore, it is again confusing that markers are used throughout Figure 2 which are labeled as "key marker genes" for a population and then later on they are claimed to be regulated during development within this population, see for example Figure 2D and 2H.

      It is appreciated that the authors distinguished cycling clusters such as 8, 9, and 10 based on their cycling gene signature. Here it would be very exciting to see a cell cycle analysis across all clusters and time points to see when exactly the cells are expanding during development; this would also substantiate the data later shown for the Mki67-CreERT2 mouse model.

      Can the authors identify certain gene modules during development of lining macrophages (and/or their progenitors) which are associated with certain functions (e.g. GO terms, GSEA enrichment)?

      To determine the actual presence of the identified macrophage clusters from the scRNA-seq as macrophage populations in the joint, the authors should perform IF or FACS for key markers. Especially, Aqp1+ macrophages should be shown in the developing joint.

      We will provide additional data, but would also like to reference a study by collaborators currently in revision at Immunity, which characterizes the Aqp1+ population in detail. We are hoping to have a doi available during our revision process.

      The authors used a wide range of fate mapping models, which is quite unique and highly appreciated. The obtained results and the conclusions made from the models raise a couple of questions: Whereas contribution of HSC-derived/monocyte-derived macrophages to the lining compartment seems to be minor, there is still labeling across different models. Various aspects would need to be clarified.

      We will clarify these data throughout as per below suggestions.

      For example, the authors employ Ms4a3-Cre as a tracing model for GMP-derived monocytes, however all quantifications of the labeling efficiency are not normalized to the labeling in monocytes or another highly recombined cell population. This should be shown, similar to the other fate mapping models (Figure 3 F-I).

      Labelling efficacy for Ms4a3-Cre is near complete for GMP-derived monocytes (and neutrophils) with the Rosa-lsl-tdT (aka Ai14) reporter we have used (see also PMID: 31491389 and doi: 10.1101/2024.12.03.626330); but we will include normalized data as requested.

      Please show Ms4a3 expression across clusters across time points, to exclude expression in fetal-derived clusters.

      We will include this in the revised supplementary information, but there is indeed very little at birth (in line with the original report for other tissues PMID: 31491389).

      In line with the question raised above, if the authors can exclude a development of the Egfr1+ and Clec4n+ developing macrophages into Aqp1+ macrophages and subsequently into Vsig4 lining macrophages, the obtained data from the Ms4a3-Cre model highly suggests a correlative labeling across these clusters what could implicate a relation. However, the authors do not discuss throughout the manuscript the role of these developing macrophages. It is highly encouraged to include this into the manuscript and it would be of high relevance to understand lining macrophage development.

      This is an interesting point and we agree it deserves consideration in the revised manuscript. Indeed, our trajectory analyses do not predict differentiation of the Egfr1+ and Clec4n+ developing macrophages into Aqp1+ macrophages, and hence, ultimately lining macrophages. Conversely, Aqp1+ cells might also convert into Egfr1+ and Clec4n+ developing macrophages. We will elaborate on this more in the revised manuscript.

      The authors conclude from the pseudo bulk transcriptomic profiling of the different macrophage clusters that TdT+ and TdT- macrophages do not differ in their gene expression profile and that this is due to niche imprinting rather than origin imprinting. Even though the data supports that conclusion, the authors should verify if inkling cells early during development also show this similar gene expression profile and gene expression should be compared at the different developmental time points. Tissue niche imprinting is happening within the niche during development, most likely in a stepwise progress, and therefore there should be differences in the beginning.

      This is another important point that we will address in the revised manuscript by performing additional differential gene expression analyses at the different developmental time points, including the earliest stages, as suggested.

      The trajectorial analysis using different pseudotime pipelines is very interesting and nicely points out the potential role of Aqp1 macrophages as intermediates of Vsig4 lining macrophages. From my point of view, all trajectories seem to suggest that Egfr1 developing macrophages and Clec4n developing macrophages might differentiate into Aqp1 macrophages, however the authors are not exploring this further and the role of both developing macrophage clusters is not further discussed (see also comments above).

      We will address and discuss this in the revised manuscript.

      How was the starting point of the trajectorial analyses defined and is it the same for each pipeline used?

      We will clarify this in the revised manuscript.

      Are there potentially two trajectories? It looks like there is one in the beginning of postnatal life and a second one appearing from the monocyte-compartment later in life. If this is true, that would rather speak for a dual ontogeny of Vsig4+ macrophages, wouldn't it?

      We will discuss this in the revised manuscript.

      A heatmap (transcriptional shift) of trajectories between more clusters should be shown at least for Cluster 0,1,2, and 3. It is not sufficient to demonstrate this only between two clusters.

      We will add these analyses during revision.

      To show the similarity between Aqp1 macrophages and proliferating macrophage clusters, the authors should remove the cycling signature and compare these clusters to show that the cycling cells might be Aqp1 macrophages or earlier developing macrophage progenitors aka Clec4n or Egfr1 macrophages.

      We will address this in the revised manuscript.

      The conclusions made from the Mki67-CreERT2 data are a bit difficult to understand, whereas all progenitors (monocyte progenitors and macrophage progenitors will proliferate at the neonatal time point and no conclusions can be made if the cells expand in the niche. The authors should employ Confetti mice or other models (Ubow mice) to analyze clonal expansion in the niche.

      We agree that interpretation of the Mki67-CreERT2 data is complicated by labeling of other cells, and notably, labeling observed in BM-derived cells. We will highlight this better in the revised manuscript. We have tried using Ubow mice to address this issue, but the recombination efficacy we yielded was too low to draw conclusions. We will address this during revision.

      All predicted cell-cell interactions between macrophages and fibroblasts should be provided in a supplementary table. Are the interactions shown in Figure 5 chosen interactions or the TOP predicted ones? Whereas the authors show different numbers of interactions, it is most likely hand-picked and therefore biased.

      We will provide a full list of all predicted interactions in the revised supplementary material in addition to a list of the full differential gene expression analysis.

      The authors further aim to dissect the factors involved in the developmental niche imprinting of lining macrophages. Even though it is highly appreciated that the authors used so many experimental setups to show the reliance of lining macrophages on Csf1 and TGF-beta as well as mechanosensation, the wide range of models the different methods used and selected developmental time points make it very difficult to really interpret the data. The authors should carefully choose time points and methods (either FACS analysis across all models or IF across all, or both). Often deletion efficiencies for transgenic models and proof of concept that the inhibitors and agonists are working in the treatment paradigm are not provided. For example, Csf1rMer-iCre-Mer Tgfbr2fl/fl mice are used but no deletion efficiency is shown or different time points of analysis, maybe the macrophages are not properly targeted in the set up.

      We have decided to uncouple our experimental data on Tgfb, Piezo1 and mechanosensing/mechanical loading, but are taking this into consideration for revision. In many cases, we have in fact performed flow cytometry and imaging analyses, and agree, we should be showing this consistently.

      The authors have shown the role of Csf1 and Tgfbr2 only for lining macrophages, is this specific in the joint to this population of are subliming macrophages affected in a similar manner.

      We will include data on sublining macrophages in the revised figure (for CSF1; Tgfb data will be uncoupled from this current manuscript).

      Can the authors confirm their results in CSF1R-FIRE mice with anti-Csf1 injections or in Csf1op/op mice?

      We will expand our discussion of the Csf1 findings, and will consider including anti-CSF1 data during revision. Phenotypes on other Csf1(r) deficient mice are published, if not with the same developmental resolution as our time course in Csf1rFIRE knockout mice and with simpler readouts. Csf1op/op mice are indeed deficient in synovial lining macrophages, from 2 days of age onwards (PMID: 8050349), and lining macrophages are also absent from 2-weeks-old and adult Csf1r-/- mice (PMID: 11756160).

      The setup in Figure S5G is very interesting to test the role of movement and mechanical load on the joint, however, there is basically no data on the model provided showing the efficiency of the induced optogenetic muscle contractions, and only one time point is shown.

      Data on mechanical loading will be uncoupled from the current manuscript and substantiated in a separate follow-up.

      The results regarding the role of Piezo1 and mechanosensation vary a lot. Could it be that analyses were done too early or that actually proper weight load on the joint must be applied for the maturation of the macrophages? The authors should test this to.

      We will uncouple these data from the current manuscript during revision. However, this is a possibility that we have discussed. In fact, the most appropriate experimental approach to address the involvement of mechanical loading, onset of walking and specifically, weight bearing would be a loss-of-function approach (i.e. paralysis at the newborn stage), for which we unfortunately could not obtain ethics approval from the UK Home Office.

      The Rolipram experiment is shown in Figure S5G, but is not described in the result section. It only appears at some point in the discussion part. The authors should move it to results or remove it from the manuscript.

      We will incorporate these data with the revised section on developing synovial macrophage populations.

      Minor points:

      Please reference the Figure panels in numeric order throughout the text.

      We will change this where not the case.

      Figure 2a and 2b are a bit out of the storyline, it is not obvious why this is shown here and maybe it would be good to move it to the supplements. Gating strategy is also not used for scRNA-seq. Therefore, it would better fit to the later analysis of joint macrophages across different transgenic mouse models and treatment paradigms. The gating strategies are changing across different experiments throughout the figures, it would be nice to have a similar gating strategy for all experiments, see also Figure 3 where the defining markers for joint macrophages are changing between models.

      We will revise Figures 2, 3 and the related supplementary figures.

      A lot of figure panels have very small labeling that is basically unreadable. Axes at FACS plots for example. Sometimes, it is even impossible to distinguish cluster labels especially when they have similar colors.

      We will revise this, thanks for pointing it out.

      In the text on page 14, many markers are named which are specifically regulated during development in lining macrophages, but these factors are not labeled anywhere in the volcano plot. It would be good to showcase at least some of these named genes in the figure panel, e.g. Trem2.

      We will do this for revision.

      Figure 2F and Figure S2F are really nicely showing the percentage of cells per cluster in each analyzed biological sample. Maybe the authors could additionally consider to show a stacked bar plot with the mean percentage of cells per cluster and how the clusters are distributed across time points?

      We will include this in the revised manuscript.

      Figure 3A: IF for adult lining macrophages and the quantification are missing.

      This will be included in the revised version.

      Significance

      This manuscript highlights novel, highly relevant findings on lining macrophage biology and the authors employ a wide range of different technologies to carefully dissect the postnatal development of lining macrophages. Furthermore, this study showcases in a very elegant and detailed way the adaptation of macrophage progenitors to a highly specific anatomical tissue niche.

      The manuscript is of high interest to basic scientists focussing on macrophage biology and immune cell development and clinicians and clinician scientists focussing on joint diseases such as RA.

      Therefore the manuscript is of interest to a wide community working in immunology.

      Reviewer #3

      Summary:

      Magalhaes Pinto, Malengier-Devlies, and co-authors investigated the developmental origins and maturation of synovial (lining and sublining) macrophages across embryonic, newborn, and postnatal stages in mouse. The authors used multiple transgenic reporter lines, lineage tracing, scRNA-seq, 2D confocal and 3D lightsheet imaging, and perturbations to delineate the macrophage states and ontogeny. They propose a model in which the majority of the joint lining macrophages has a fetal (EMP-derived) origin and a small proportion has a definitive HSC-derived monocyte origin, which both seed and mature within the synovial space in the postnatal period in the first 3 weeks of life. Using cell-cell communication analysis on their scRNA-seq data, they identified Fgf2, Csf1, and Tgfb as candidate signaling pathways that support (lining) macrophage development and maturation. Functional experiments indicate that the process is CSF1 and TGFb-dependent and also partly dependent on mechanosensing through Piezo1.

      The key conclusions on the composition of the synovial macrophages are convincing based on the presented results, and are carefully phrased. The study is very comprehensive, yet the description and organization of the results of the different mouse models could be altered to improve the storyline. Several refinements in data presentation, formulation, and minor validation experiments would further improve the clarity of the story, as well as summary recaps of the major findings throughout the text.

      We thank this reviewer for their detailed review. We will be implementing the requested changes wherever technically feasible.

      Major comments:

      Generally, the story could be more streamlined by introducing earlier reporter lines and lineage-origin logic. Clearly state which reporter/CreERT2 lines and acrosses are used. It was unclear in Figure 2 that cells of the cross of the Cx3cr1-GFP and Ms4a3Cre:Rosa26lsl-tdT reporter lines were used for the scRNA-seq. The principle that there are fetal-derived and bone marrow (GMP)-derived monocytes and macrophages doesn't need to be "hidden" until Figure 3. For example, also the imaging of Ms4a3Cre could be introduced before the scRNA-seq.

      We will revise the structure and order of the manuscript during revision.

      Figure 1 could benefit from a cartoon visualizing the anatomy of the knee joint. The terms "sublining" and "synovium" are now a bit unclear, as it appears that sometimes the synovium is indicated as sublining and vice versa. Additionally, a schematic developmental timeline could be added to indicate the parallels between mouse and human development (fetal and postnatal development in mouse versus gestational age in human). Also, the various waves of hematopoiesis could be indicated in this timeline, which would be particularly helpful for Figure 3 for the lineage-tracing readouts. Lastly, the authors could end the manuscript (a new Figure 6) with a general cartoon summarizing all the results presented.

      We will include illustrations as suggested.

      Figure 1 could be rearranged: first introduce the markers CX3CR1 and VSIG4 (Figure 1D) and then present the quantifications (Figure 1B/E). Where possible, co-visualization CX3CR1-GFP and VSIG4 on tissue sections to strengthen the claims on the relationship between these 2 markers. Tying the scRNA-seq insights (Figure 2) to the imaging would be elegant. Moreover, it would be informative to represent the CX3CR1+ and VSIG4+ macrophages as a percentage of F4/80+ macrophages (Figure 1B/E). Similarly, for the flow cytometry data in Figure 2, the relationship between the markers CX3CR1 and VSIG4 on macrophages could be more clearly displayed and discussed.

      Thanks for this remark. We will endeavor to show co-localization and analysis of both markers wherever possible. However, where we did not use Cx3cr1gfp mice, co-staining was limited by antibody choice.

      The 3D imaging of the joint is a nice addition to the manuscript, as it provides more context to the anatomical structure; however, while the text suggests several newborn joints were imaged, Figure 1F visualizes (again) the knee joint. Could other joints also be represented by 3D imaging? If the knee joint is the only joint available for imaging, and previous confocal imaging focused specifically on the meniscus in the knee joint, could the meniscus also be highlighted in the lightsheet imaging?

      Apologies if this was not clear from the original manuscript text, but we have only imaged the knee joint in 3D. We will clarify this during revision and consider inclusion of additional imaging data.

      Clarification is requested regarding the imaging quantification representation. The M&M section under "Statistical analysis and reproducibility" states that individual data points are displayed, and bars represent the mean. However, some of the Figure legends (e.g., Figures 1B and S1C) specify that each dot corresponds to an individual mouse, with quantification based on 2-3 sections per mouse. While this appears to be a very reasonable representation of the data, does this mean that for each dot, the mean value from the 2-3 sections per mouse was calculated and plotted?

      We will clarify this.

      It is not clear how the differential expression analysis was performed on the Vsig4+ cells. Please specify if Cluster 0 was used for analysis, or all Vsig4-expressing cells? Not all cells in Cluster 0 have Vsig4+ expression. The authors described the expression dynamics of Aqp1 as intriguing, but lack a reasoning on why this is interesting.

      We will revise this section.

      Figure S3E: In line with the previous comment, can the authors justify that the tdTomato+/- comparisons are not biased by scRNA-seq dropout (scRNA-seq is zero-inflated, so some tdTomato- cells could be false negatives), and provide methodological details (thresholds, ambient RNA correction, etc.) to support this?

      We will clarify this and include additional representations of the tdTomato transcript data.

      Although the sex-related differences in macrophage composition and the absence of differential expression are interesting, they distract from the manuscript's main messages. Moreover, the Discussion does not elaborate on how these observations relate to joint (disease) biology. Consider removing this section or integrating it clearly into the relevant biological context.

      We will remove this section as suggested.

      CreERT2 transgenic lines are often not 100% efficient in recombination, also depending on whether tamoxifen or 4-OHT is used. Could the authors report the percentage of tdTomato+ cells in the joints and compare them to the recombination efficiencies in the monocytes/microglia under the same tamoxifen or 4-OHT conditions? This would help clarify how the interpret the macrophage labeling %'s.

      We will report labelling efficacies and/or show normalized data in the revised manuscript.

      Could the authors draw parallels between the observations in the mouse knee joint macrophage populations and literature on other joints in mouse and the knee joint in human (for example, as described in Alivernini et al., 2020 and in the very recent Raut et al., 2025)?

      We will include a section on this in the revised manuscript.

      Minor comments:

      In general, the authors should clarify in the Results what each marker used for imaging, flow cytometry, or in the mouse reporter lines delineates. For example, mention that F4/80 is a marker for tissue-resident macrophages (correct?) in immunofluorescence, that IBA1 is a marker for macrophages on human tissue sections (Figure S1), and PDPN is GP38 (Figure S2 - align usage of marker reference across main text and figures).

      We will implement this request.

      For clarity in the microscopy representation, the single channels should be represented in a grey scale.

      We will revise image presentation.

      Figure S1B: Is CX3CR1 also restricted to the lining macrophages in human? Could a co-staining with IBA1 be performed to strengthen the species similarities?

      To our knowledge, there is no antibody available that works for imaging of human CX3CR1. Moreover, CX3CR1 is only limited to the lining population in adult joints, in fetal and newborn (mouse) joints, all macrophages express this receptor, as do fetal progenitors to macrophages. However, Alivernini and colleagues have reported that TREM2high macrophages are the human counterpart of the mouse CX3CR1+ lining population (PMID: 32601335).

      Adipocyte diameter quantification: Avoid plotting individual adipocytes from 2 mice without per-mouse visualization. Instead, report the mean adipocyte diameter per mouse and plot those means.

      We will implement this change.

      A little typo was spotted in the "Statistical analysis and reproducibility" section: it is Dunn's, not Bunn's multiple-comparison correction.

      Thanks for spotting this.

      Figure 2A: The gating strategy for the CX3CR1-GFP cells is missing.

      We will provide this in the revised manuscript or supplementary material.

      Improve the visualization of some plots. For example, Figure 2F is hard to read because of the big dot size. The dots seem to add no information to the graph and could be removed. Additionally, for comparing the clusters across the different time points, one could project the cells from the other time points in grey in the background.

      We will revise the presentation of these data.

      Figure S2: The dotplot is more informative than the heatmap, consider removing the heatmap.

      We will do that.

      Figure 3A: If technically feasible, image and visualize both the GFP and tdTomato expression. It would be informative to see the Cx3cr1+ and Ms4a3-derived cells in the same specimen.

      We will thrive to show this in the revised manuscript.

      Figure 3C: Highlight that tdTomato expression is visualized here.

      We will do that.

      Figure 3G,F: The authors should place the schematics and graphs next to each other, so the data points can be more easily compared.

      We aim to do this in the revised manuscript.

      Figure 4B: Which co-staining was performed for the immunofluorescence to quantify the % of tdTomato+ cells?

      We co-stained for F4/80 and assessed localization in the lining or sublining. This will be clarified in the revised Figure legend.

      Figure 4C: The trajectory analysis appears to have an arrow pointing from the Ccr2+ macrophages to the Ly6c+ monocytes. Please verify this directionality, as its seems against the known biology.

      This will be addressed during revision.

      Figure 5 mentions that the Csfr1 levels were reduced in a tissue-specific manner, but it is unclear how this tissue specificity was achieved.

      We apologize for this misunderstanding. Csfr1FIRE mice are not tissue-specific knockouts, but they are more specific than global knockout mice, since only a (myeloid-specific) enhancer is affected. We will clarify this in the relevant section.

      For the TGFb perturbations (Tgfbr2 KO and systemic TGFb depletion): did the authors validate reduced TGFb pathway activity in the macrophages, for example, reduced pSMAD2/3 levels? This would validate the effectiveness of the perturbations. This is an important point, and assessing signaling events downstream of TGFb is a very good suggestion.

      As per above comment, we have decided to uncouple the functional data with exception of CSF1 from the revised version of the current manuscript, but we will be taking this into account for substantiating our functional data in follow-up work.

      Figure 5F could benefit from a timeline of the treatment.

      As for the previous point raised, we will be taking this into account for follow-up work on the uncoupled functional data.

      The Methods mention that Gene Ontology analysis was performed on the single-cell data, but the results are not plotted in a figure. It would be informative to include this GO/pathway analysis in the appropriate figure(s).

      We will include this in the revised (supplementary) information.

      Significance:

      This work provides a high temporal-resolution and "spatial" resolution reference map of the ontogeny and maturation of the synovial lining macrophages in the knee joint. It complements existing literature that demonstrated the presence of tissue-resident macrophages in the synovial space and lining (Culemann, et al., 2019 and others) by charting the embryonic-to-postnatal emergence of lining and sublining subsets. In particular, this mouse work identified some key signaling pathways in shaping this tissue compartment. This dataset serves as a robust, steady-state reference for joint pathology and can be implemented with human studies on disease biology of the knee joint (e.g., Alivernini et al., 2020; Raut et al., 2025). Insights into the exact developmental origins, mechanisms contributing to diverse or seemingly similar cell types, and distinct maturation processes are crucial to understanding disease biology, in which developmental processes can be hijacked/reactivated.

      These findings will interest researchers in joint disease biology (osteoarthritis and immune-mediated arthritides such as RA and psoriasis), macrophage development (tissue-resident vs monocyte-derived lineages), the bone/joint microenvironment, and joint mechanobiology.

      The reviewer's expertise is in developmental biology, mesoderm, bone biology, hematopoiesis, and monocyte/macrophage biology in disease

    1. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      The authors try to investigate how the population of microtubules (LSPMB) that originate from sporozoite subpellicular microtubules (SSPM) and are remodelled during liver-stage development of malaria parasites. These bundles shrink over time and help form structures needed for cell division. The authors have used expansion microscopy, live-cell imaging, genetically engineered mutants, and pharmacological perturbation to study parasite development with liver cells.

      A major strength of the manuscript is the live cell imaging and expansion microscopy to study this challenging liver stage of parasite development. It gives important knowledge that PTMs of α-tubulin, such as polyglutamylation and tyrosination/detyrosination, are crucial for microtubule stability. Mutations in α-tubulin reduce the parasite's ability to move and proliferate in the liver cells. The drug oryzalin, which targets microtubules, also blocks parasite development, showing how important dynamic microtubules are at this stage.

      The major problem in the manuscript was the way it flows, as the authors keep shifting from the liver stage to the sporogony stages and then back to the liver stages. It was very confusing at times to know what the real focus of the study is, whether sporozoite development or liver stage development. The flow of the manuscript could be improved. Some of the findings reported here substantiate the previous electron microscopy.

      Overall, the study represents an important contribution towards understanding cytoskeletal remodelling during liver stage infection. The study suggests that tubulin modifications are key for the parasite's survival in the liver and could be targets for new malaria treatments. This is also the stage that has been used for vaccine development, so any knowledge of how parasites proliferate in the liver cells will be beneficial towards intervention approaches.

      We would like to express our sincere gratitude to Reviewer #1 for the positive and encouraging feedback on our manuscript. We are delighted that the reviewer found our experimental design and methodologies appropriate and that our study represents an important contribution to understanding cytoskeletal remodelling during liver stage infection, a critical phase for vaccine development. We are also grateful to the reviewer for highlighting the issue with the manuscript's flow. We acknowledge this limitation and will significantly improve the narrative structure and logical progression in the revised manuscript to ensure clarity and avoid any potential confusion. Thank you again for your thoughtful and constructive comments.

      Reviewer #2 (Public review):

      Summary:

      The authors investigated microtubule distribution and their possible post-translational modifications (PTM) in Plasmodium berghei during development of the liver stage, using either hepatocytes or HeLa cells as models. They used conventional immunofluorescence assays and expansion microscopy with various antibodies recognising tubulin and, in the second part of the work, its candidate PTMs, as well as markers of Plasmodium, in addition to live imaging with a fluorescent marker for tubulin. In the third part of the study, they generated 3 mutants deprived of either the last four residues or the last 11 residues, or where a candidate polyglutamylation site was substituted by an alanine residue.

      Strengths:

      In the first part, microtubules are monitored by a combination of two approaches (IFA and live), revealing nicely the evolution of the sporozoite subpellicular microtubules (SSPM, the sporozoite is the developmental stage present in salivary glands of the mosquitoes and that infects hepatocytes) into a different structure termed liver-stage parasite microtubule bundle (LSPMB). The LSPMB shrinks during the course of parasite development and finally disappears while hemi-spindles emerge over time. Contact points between these two structures are observed frequently in live cells and occasionally in fixed cells, suggesting the intriguing possibility that tubulin might be recycled from the LSPMB to contribute to hemi-spindle formation.

      In the second part, antibodies recognising (1) the final tyrosine found at the C-terminal tail and (2) a stretch of 3 glutamate residues in a side chain are used to monitor these candidate PTMs. Signals are positive at the SSPM, and while it remains positive for polyglutamylation, it becomes negative for the final tyrosine at the LSPM, while a positive signal emerges at hemi-spindles at later stages of development.

      In the last part, the three mutants are fed to mosquitoes, where they show reduced development, the one lacking the alpha-tubulin tail even failing to reach the salivary glands. However, the two other mutants infect HeLa cells normally, whereas sporozoites with the C-terminal tail deletion recovered from the haemolymph did not develop in these cells.

      The first part provides convincing evidence that microtubules are extensively remodelled during the infection of hepatocytes and HeLa cells, in agreement with the spectacular Plasmodium morphogenetic changes accompanying massive and rapid proliferation. The third part brings further confirmation that the C-terminal tail of alpha-tubulin is essential for multiple stages of parasite development, in agreement with previous work (50). Since it is the region where several post-translational modifications take place in other organisms (detyrosination, polyglutamylation, glycylation), it makes sense to propose that the essential function is related to these PTMs also in Plasmodium.

      Weaknesses:

      The significance of tubulin PTM relies on two antibodies whose reactivity to Plasmodium tubulins is unclear (see below). The interpretation of the literature on detyrosination and polyglutamylation is confusing in several places, meaning that the statements about the possible role of these PTMs need to be carefully revisited.

      The authors use the term "tyrosination" but the alpha1-tubulin studied here possesses the final tyrosine when it is synthesised, so it is "tyrosinated" by default. It could potentially be removed by a tyrosine carboxypeptidase of the vasoinhibin family (VASH) as reported in other species. After removal, this tyrosine can be added again by a tubulin-tyrosine ligase (TTL) enzyme. It is therefore more appropriate to talk about detyrosination-retyrosination rather than tyrosination (this confusion is unfortunately common in the literature, see Janke & Magiera, 2020).

      The difficulty here is that there is so far no evidence that detyrosination takes place in Plasmodium. Neither VASH nor TTL could be identified in the Plasmodium genome (ref 31, something we can confirm with our unsuccessful BLAST analyses), and mass spectrometry studies of purified tubulin, albeit from blood stages, did not find evidence for detyrosination (reference 43). Western blots using an antibody against detyrosinated tubulin did not produce a positive signal, neither on purified tubulin, nor on whole parasites (43). Of course, the situation could be different in liver stages, but the question of the detyrosinating enzyme is still there. The existence of a unique Plasmodium system for detyrosination cannot be formally ruled out but given the high degree of conservation of these PTMs and their associated enzymes, it sounds difficult to imagine.

      The fact that the anti-tyrosinated antibody still produced a signal in the cell line where the final tyrosine is deleted raises issues about its specificity. A cross-reactivity with beta-tubulin is proposed, but the Plasmodium beta-tubulin does not carry a final tyrosine, further raising concerns about antibody specificity.

      The interpretation of these results should therefore be considered carefully. There also seems to be some confusion in the function of detyrosination cited from the literature. It is said in line 229 that "tyrosination has been associated with stable microtubules" (33, 34, 50, 55). References 33 and 34 actually show that tyrosinated microtubules turn over faster in neurons or in epithelial cells, respectively, while references 50 and 55 do not study de/retyrosination. The general consensus is that tyrosinated microtubules are more dynamic (see reference 24).

      The situation is a bit different for polyglutamylation since several candidate poly- or mono-glutamylases have been identified in the Plasmodium genome, and at least mono-glutamylation of beta-tubulin has been formally proven, still in bloodstream stages (ref 43). The authors propose that the residue E445 is the polyglutamylation site. To our knowledge, this has not been demonstrated for Plasmodium. This residue is indeed the favourite one in several organisms such as humans and trypanosomes (Eddé et al., Science 1990; Schneider et al., JCS, 1997), and it is tempting to propose it would be the same here. However, TTLLs bind the tubulin tails from their C-terminal end like a glove on a finger (Garnham et al., Cell, 2015), and the presence of two extra residues in Plasmodium tubulins would mean that the reactive glutamate might be in position E447 rather than E445. This is worth discussing.

      On the positive side, it is encouraging to see that signals for both anti-tyrosinated tail and poly-glutamylated side chain are going down in the various mutants, but this would need validation with a comparison for alpha-tubulin signal.

      Line 316: polyglutamylation "is commonly associated with dynamic microtubule behavior (78-80)". Actually, references 78 and 79 show the impact of this PTM on interaction with spastin, and reference 80 discusses polyglutamylation as a marker of stable microtubules in the context of cilia and flagella. The consensus is that polyglutamylated microtubules tend to be more stable (ref24).

      Conclusion:

      The first and the third parts of this manuscript - evolution of microtubules and importance of the C-terminal tails for Plasmodium development - are convincing and well supported by data. However, the presence and role of tubulin PTM should be carefully reconsidered.

      Plasmodium tubulins are more closely related to plant tubulins and are sensitive to inhibitors that do not affect mammalian microtubules. They therefore represent promising drug targets as several well-characterised compounds used as herbicides are available. The work produced here further defines the evolution of the microtubule network in sporozoites and liver stages, which are the initial and essential first steps of the infection. Moreover, Plasmodium has multiple specificities that make it a fascinating organism to study both for cell biology and evolution. The data reported here are elegant and will attract the attention of the community working on parasites but also on the cytoskeleton at large. It will be interesting to have the feedback of other people working on tubulin PTMs to figure out the significance of this part of the work.

      We thank Reviewer #2 for the thoughtful and detailed evaluation of our manuscript. We are pleased that the reviewer found our study elegant and believe it will attract the attention of the broader scientific community, both those working on parasites and those focused on cytoskeleton biology. We also acknowledge the concerns raised regarding the specificity of the antibodies used to detect tubulin post-translational modifications (PTMs), as well as the interpretation of their signals and the current lack of identified detyrosination enzymes in the Plasmodium genome. We agree that these are important limitations, and we will address them thoroughly in the revised manuscript. This includes clarifying our interpretation of tyrosination versus detyrosination, adjusting our claims regarding polyglutamylation sites, and carefully revisiting the literature cited to ensure accurate contextualization of PTM function in microtubule stability.

      We are grateful for the reviewer’s close reading and critical feedback, which will help us substantially improve the clarity, precision, and strength of our manuscript.

      Reviewer #3 (Public review):

      Summary:

      The manuscript by Atchou et al. investigates the role of the microtubule cytoskeleton in sporozoites of Plasmodium berghei, including possible functions of microtubule post-translational modifications (tyrosination and polyglutamylation) in the development of sporozoites in the liver. They also assessed the development of sporozoites in the mosquito. Using cell culture models and in vivo infections with parasites that contain tubulin mutants deficient in certain PTMs, they show that may aspects of the life cycle progression are impaired. The main conclusion is that microtubule PTMs play a major role in the differentiation processes of the parasites.

      However, there are a number of major and minor points of criticism that relate to the interpretation of some of the data.

      We thank Reviewer #3 for the overall positive assessment of our study and for recognizing its contribution to advancing our understanding of Plasmodium biology and malaria pathogenesis. We appreciate the reviewer’s constructive feedback, particularly regarding the interpretation of some of our data. These comments have been very helpful in guiding our revisions, and we have worked to improve both the clarity of our presentation and the precision of our interpretations in the revised manuscript.

      Below, we respond in detail to each of the reviewer’s points.

      Comments:<br /> (1) The first paragraph of "Results" almost suggests that the presence of a subpellicular MT-array in sporozoites is a new discovery. This is not the case, see e.g. the recent publication by Ferreira et al. (Nature Communications, 2023).

      We thank the reviewer for pointing this out and fully agree that the subpellicular microtubule (SPM) array in sporozoites is well established, as documented in earlier work (e.g., Cyrklaff et al., 2007) and more recently by Ferreira et al. (Nat. Commun., 2023). Our intention was not to suggest that the existence of the SSPM is a novel finding. Rather, our study builds on this existing knowledge by demonstrating that these sporozoite-derived microtubules are not disassembled upon hepatocyte entry but are repurposed into a newly described structure, the liver stage parasite microtubule bundle (LSPMB). This reorganization, its persistence into liver stage development, and its dynamic role in microtubule remodeling and nuclear division are, to our knowledge, novel observations. We will revise the manuscript to make this distinction clearer in the introduction and the results section.

      (2) Why were HeLa cells and not hepatocytes (as in Figure 3) used for measuring infection rates of the mutants in Figure 5H and 5L? As I understand, HeLa cells are not natural host cells for invading sporozoites. HeLa cells are epithelial cells derived from a cervical tumour. I am not an expert in Plasmodium biology, but is a HeLa infection an accepted surrogate model for liver stage development?

      We appreciate the opportunity to clarify our experimental model. While HeLa cells are not the natural host cells, they are a well-established and validated in vitro model for studying Plasmodium berghei liver stage development in our lab and others. In this system, the parasite completes its full development and generates infectious merozoites. Numerous studies have successfully used HeLa cells as a liver stage infection model, with key findings subsequently validated in primary hepatocytes or in vivo, confirming its utility as a representative model. We employed this cell line primarily to reduce animal usage in accordance with the 3Rs principles (Replacement, Reduction, Refinement). Importantly, to ensure the biological relevance of our discoveries in HeLa cells, we validated our key findings in primary mouse hepatocytes, as shown in Figure 3. Furthermore, we confirmed the in vivo infectivity of mutant parasite lines that produced typical salivary gland sporozoites through an in vivo infection assay, presented in Figure S4C.

      (3) The tubulin staining in Figures 1A and 1B is confusing and doesn't seem to make sense. Whereas in 1A the antibody nicely stains host and parasite tubulin, in 1B, only parasite tubulin is visible. If the same antibody and the same host cells have been used, HeLa cytoplasmic microtubules should be visible in 1B. In fact, they should be the predominant antigen. The same applies to Figure 2, where host microtubules are also not visible.

      We thank the reviewer for this careful observation regarding the α-tubulin staining in Figures 1A and 1B. The same host cell type (HeLa) and α-tubulin antibody were indeed used in both experiments. Figure 1A shows results from conventional immunofluorescence assays, where both host and parasite microtubules are clearly stained. In contrast, Figure 1B shows the outcome of ultrastructure expansion microscopy (U-ExM), where parasite microtubules appear prominently, while host microtubules are less visible.

      This effect appears to be a technical outcome of the U-ExM protocol, which can differentially preserve or reveal microtubule epitopes. We consistently observed stronger parasite signal across various cell types, including primary hepatocytes (Figure 3A,B). The lack of visible host microtubules in some U-ExM images does not reflect their absence, but rather reduced signal intensity relative to the parasite structures. This is not observed with all antibodies, e.g., host microtubules stain strongly with anti-tyrosinated α-tubulin (Figure 3B), likely reflecting their high tyrosination state.

      To overcome this limitation, we employed PS-ExM and combined PS-ExM/U-ExM approaches (as described in reference 56), which allowed simultaneous high-resolution visualization of both host and parasite microtubule networks. These combined methods are now being used in follow-up studies to investigate host–parasite microtubule interactions in more detail.

      We will clarify this point in the revised manuscript to avoid confusion.

      (4) In Figures 2A and B, the host nuclei appear to have very different sizes in the DMSO controls and in the drug-treated cells. For example, in the 20 µM (-) image (bottom right), the nuclei are much larger than in the DMSO (-) control (top left). If this is the case, expansion microscopy hasn't worked reproducibly, and therefore, quantification of fluorescence is problematic. The scalebar is the same for all panels.

      The expansion microscopy methods used in this study have been rigorously validated for both reproducibility and isotropicity. However, as the reviewer rightly notes, host cell nuclei can vary in size due to several factors, including cell cycle stage, infection status, and the extent of parasite development, all of which can influence host nuclei morphology and size.

      Importantly, the quantifications relevant to our conclusions were focused specifically on parasite structures. We did not rely on host nuclear size or host fluorescence intensity as a quantitative readout in this context. While we acknowledge the observed variability in host nuclear dimensions, it does not compromise the accuracy or reproducibility of the parasite specific measurements central to our study.

      We will clarify this point in the revised figure legend and manuscript.

      (5) I don't quite follow the argument that spindles and the LSPMB are dynamic structures (e.g., lines 145, 174). That is a trivial statement for the spindle, as it is always dynamic, but beyond that, it has only been shown that the structure is sensitive to oryzalin. That says little about any "natural" dynamic behaviour. Any microtubule structure can be destroyed by a particular physical or chemical treatment, but that doesn't mean all structures are dynamic. It also depends on the definition of "dynamic" in a particular context, for example, the time scale of dynamic behaviour (changes within seconds, minutes, or hours).

      We agree that sensitivity to chemical depolymerization alone does not necessarily indicate dynamic behavior, particularly in the absence of data on turnover kinetics or temporal changes.

      Our interpretation was based on two observations: first, that the LSPMB, which derives from the highly stable sporozoite subpellicular microtubules (known to be drug-resistant), becomes susceptible to depolymerization during the liver stage; and second, that the LSPMB gradually shrinks over time during parasite development. These features suggested a transition toward a more dynamic state compared to its origin. However, we fully agree that “dynamic” is a context-dependent term and that direct evidence such as turnover rates or structural changes on short time scales, is required to rigorously define microtubule dynamics.

      We will revise the manuscript to clarify our use of this term and explicitly acknowledge the need for further studies to characterize the timescale and mechanisms underlying LSPMB remodeling.

      (6) I am not sure what part in the story EB1 plays. The data are only shown in the Supplements and don't seem to be of particular relevance. EB1 is a ubiquitous protein associated with microtubule plus ends. The statement (line 192) that it "may play a broader role..." is unsubstantiated and cannot be based merely on the observation that it is expressed in a particular life cycle stage.

      We agree that EB1 is a ubiquitous microtubule plus-end binding protein and that its presence alone does not imply a novel function. Previous studies (e.g., Maurer et al., 2023; Yang et al., 2023; Zeeshan et al., 2023) have focused on its role during Plasmodium sexual stages, while its expression during liver and mosquito stages has not been previously documented.

      Our data extend this knowledge by showing that EB1 is also expressed during liver stage development, particularly during the highly mitotic schizont phase. While we agree that this observation alone does not prove functional involvement, it raises the possibility of a broader role for EB1 in regulating microtubule dynamics beyond sexual stages. To avoid overinterpretation, we have presented these findings in the supplementary material and will revise the manuscript to tone down speculative statements and clearly frame this as a preliminary observation that warrants further investigation.

      (7) Line 196 onwards: The antibody IN105 is better known in the field as polyE. Maybe that should be added in Materials and Methods. Also, the antibody T9028 against tyrosinated tubulin is poorly validated in the literature and rarely used. Usually, researchers in this field use the monoclonal antibody YL1/2. I am not sure why this unusual antibody was chosen in this study. In fact, has its specificity against tyrosinated α-tubulin from Plasmodium berghei ever been shown? The original antigen was human and had the sequence EGEEY. The Plasmodium sequence is YEADY and hence very different. It is stated that the LSPMB is both polyglutamylated and tyrosinated. This is unusual because polyglutamylated microtubules are usually indicative of stable microtubules, whereas tyrosinated microtubules are found on freshly polymerised and dynamic microtubules. However, a co-localisation within the same cell has not been attempted. This is, however, possible since polyE is a rabbit antibody and T9028 is a mouse antibody. I suspect that differences or gradients along the LSPMB would have been noticed. Also, in lines 207/208, it is said that tyrosination disappears after hepatocyte invasion, which is shown in Figure 3. However, in Figure 3A, quite a lot of positive signals for tyrosination are visible in the 54 and 56 hpi panels.

      First, we acknowledge that the IN105 antibody is more widely known as "polyE" in the field. We will update the Materials and Methods section accordingly to reflect this nomenclature.

      Regarding the use of the T9028 antibody against tyrosinated α-tubulin: we agree that this monoclonal antibody is less commonly used than YL1/2, and we appreciate the reviewer drawing attention to this. The original antigen for T9028 is based on the mammalian C-terminal sequence EGEEY, which differs from the Plasmodium α1-tubulin sequence (YEADY). Like many in the field, we face the challenge that most available antibodies are raised against mammalian epitopes, and specificity in Plasmodium can vary. Nonetheless, the literature (e.g., Hirst et al., 2022; Fennell et al., 2008) has demonstrated that tyrosination occurs in Plasmodium α1-tubulin, using anti-tyrosination antibodies including YL1/2.

      Following the reviewer’s excellent suggestion, we are currently repeating the key experiments using the YL1/2 antibody to compare staining patterns directly with those obtained using T9028. We will include these results in the revised manuscript.

      Concerning the potential co-localization of polyglutamylation and tyrosination on the LSPMB: we agree that this is an interesting and testable hypothesis. In the current manuscript, Figures 3A and 3B were generated from independent experiments, and thus co-localization was not assessed. However, as the reviewer correctly notes, polyE and T9028 antibodies are raised in rabbit and mouse, respectively, making co-staining feasible. We will follow up on this experimentally and, if feasible within our revision timeline, include data in the revised version or highlight this as a future direction.

      Finally, with regard to Figure 3 and the observation that tyrosination appears to persist at 54 and 56 hpi (Figure 3B): the reviewer is correct that tyrosination signal is still detectable at these time points. Our statement that tyrosination “disappears after hepatocyte invasion” was intended to refer to an overall decrease in signal intensity during early liver stage development, with a reappearance at later stages (e.g., cytomere formation). We will rephrase this section for greater clarity and ensure that figure annotations and legends unambiguously reflect the dynamics observed.

      (8) In line 229, it is stated that tyrosination "has previously been associated with stable microtubule in motility". This statement is not correct. In fact, none of the cited references that apparently support this statement show that this is the case. On the contrary, stable microtubules, such as flagellar axonemes, are almost completely detyrosinated. Therefore, tyrosination is a marker for dynamic microtubules, whereas detyrosinated microtubules are indicative of stable microtubules. This is an established fact, and it is odd that the authors claim the opposite.

      We fully agree that in canonical eukaryotic systems, tyrosinated microtubules are generally markers of dynamic microtubule populations, whereas detyrosinated microtubules are typically associated with stability particularly in structures such as flagellar axonemes.

      Our original statement will be corrected. In our study, we observed that tyrosinated microtubules are prevalent in invasive stages (sporozoites and merozoites), while detyrosinated forms become more prominent during intracellular liver stage development. This pattern is consistent with the established link between tyrosination and dynamic microtubules.

      What is particularly intriguing in Plasmodium is the apparent cycling of tyrosination despite the absence of known tubulin tyrosine ligase (TTL) homologs in the genome. This suggests either a highly divergent enzyme or the involvement of host cell factors, a hypothesis supported by the reappearance of tyrosinated microtubules during liver stage schizogony (Figure 3B).

      We will revise the relevant text and the Discussion section to reflect these mechanistic considerations more accurately and to avoid misrepresenting established principles of microtubule biology.

      (9) Line 236 onwards: Concerning the generation of tubulin mutants, I think it is necessary to demonstrate successful replacement of the wild-type allele by the mutant allele. I am sure the authors have done this by amplification and subsequent sequencing of the genomic locus using PCR primers outside the plasmid sequences. I suggest including this information, e.g., by displaying the chromatograph trace in a supplementary figure. Or are the sequences displayed in Figure S3B already derived from sequenced genomic DNA? This is not described in the Legend or in Materials and Methods. The left PCR products obtained for Figure S3 B would be a suitable template for sequencing.

      Indeed, these data are presented in Figure 4B and the corresponding sequence data are shown in Figure S3B. We appreciate the reviewer’s suggestion, which will help improve the transparency and reproducibility of our methodology.

      (10) It is also important to be aware of the fact that glutamylation also occurs on β-tubulin. This signal will also be detected by polyE (IN105). Therefore, it is surprising that IN105 immunofluorescence is negative on the C-term Δ cells (Figure S3 D). Is there anything known about confirmed polyglutamylation sites on both α- and β-tubulins in Plasmodium, e.g., by MS? In Toxoplasma, both α- and β-tubulin have been shown to be polyglutamylated.

      Indeed, polyglutamylation is known to occur not only on α-tubulin but also on β-tubulin in many organisms, including Toxoplasma gondii, and the polyE (IN105) antibody is expected to detect polyglutamylation on both tubulin isoforms.

      The parasites shown in Figure S3D correspond to mutant lines originally generated by Spreng et al. (2019): the IntronΔ mutant (with deletion of introns in the Plasmodium α1-tubulin gene) and the C-termΔ mutant (with deletion of the final three C-terminal residues: ADY). As the reviewer correctly notes, this particular C-terminal deletion does not include the predicted polyglutamylation site (E445 or E447, depending on alignment), and thus should not abolish all polyglutamylation. However, in our experiments, the IN105 signal is substantially reduced in this mutant. This may suggest that structural alterations in the tubulin tail affect accessibility of the polyglutamylation epitope or influence the modification itself though we cannot exclude other possibilities, including changes in antibody recognition.

      To date, polyglutamylation sites in Plasmodium tubulins have not been definitively confirmed by mass spectrometry. However, a recent MS-based study (reference 43) detected monoglutamylation on β-tubulin in blood stage parasites. Direct MS evidence for polyglutamylation of either α- or β-tubulin in Plasmodium liver stages is still lacking. We will clarify these points in the revised manuscript to avoid potential confusion and to highlight the need for future biochemical validation of PTM sites.

      (11) Figure S3 is very confusing. In the legend, certain intron deletions are mentioned. How does this relate to posttranslational tubulin modifications? The corresponding section in Results (lines 288-292) is also not very helpful in understanding this.

      The parasite lines shown in Figure S3D were originally generated by Spreng et al. (2019) and are not directly part of the main set of PTM-targeted mutants described in our study. Specifically, the IntronΔ line carries deletions in introns of the Plasmodium α1-tubulin gene, while the C-termΔ line lacks the final three C-terminal residues (ADY). These lines were included for comparative purposes to explore whether structural changes in α-tubulin could impact polyglutamylation signal, as detected by the polyE (IN105) antibody.

      We acknowledge that the figure legend and corresponding text (lines 288–292) did not adequately explain the rationale for including these control lines. We will revise both the legend and Results section to more clearly describe the origin, purpose, and relevance of these mutants to the overall study.

      (12) Figure 4E doesn't look like brightfield microscopy but like some sort of fluorescent imaging. In Figure 4C, were the control (NoΔ) cells with an integrated cassette, but no mutations, or non-transgenic cells?

      The reviewer is absolutely correct: Figure 4E shows a fluorescent image acquired using widefield microscopy and not a brightfield image. We will revise the figure legend accordingly to avoid confusion. The “BF” (brightfield) label applies only to the left panel in Figure 4C, which depicts oocysts imaged using transmitted light.

      Regarding the controls labeled "NoΔ" in Figure 4C, we confirm that these parasites contain the integrated selection cassette but do not harbor any mutations in the target gene. They serve as proper integration controls, allowing us to distinguish the effects of the point mutations or deletions introduced in the experimental lines.

      (13) It is difficult to understand why the TyΔ and the CtΔ mutants still show quite a strong signal using the anti-tyrosination antibody. If the mutants have replaced all wild-type alleles, the signal should be completely absent, unless the antibody (see my comment above concerning T9028) cross-reacts with detyrosinated microtubules. Therefore, the quantitation in Figures 5F and 5G is actually indicative of something that shouldn't be like that. The quantitation of 5F is at odds with the microscopy image in 5D. If this image is representative, the anti-Ty staining in TyΔ is as strong as in the control NoΔ.

      We agree that the persistence of anti-tyrosination signal in the TyΔ and CtΔ mutant lines is unexpected, given that all wild-type alleles were replaced. This discrepancy has led us to further investigate the specificity of the T9028 antibody, as raised in the reviewer’s earlier comment. To address this concern, we are currently repeating the key experiments using the well-established YL1/2 monoclonal antibody, which is widely accepted for detecting tyrosinated α-tubulin in other systems.

      We also acknowledge that Figure 5F shows residual tyrosination signal, and the reviewer is correct that this should not occur if the modified residues are the exclusive PTM sites. One possible explanation is that adjacent residues or even alternative tubulin isoforms may serve as substrates. While α1-tubulin is the dominant isoform in Plasmodium, low-level expression of α2-tubulin has been detected in liver stages based on transcriptomic data, and it may contribute to the observed signal.

      Regarding the apparent discrepancy between the quantification in Figure 5F and the representative image in Figure 5D, we will revise the figure legend to clarify that image selection aimed to show detectable signal, not necessarily the average phenotype. We will also reassess and, if needed, repeat the quantification with improved image sets to ensure accuracy and consistency.

      We will revise the manuscript to reflect these points and include a more nuanced interpretation of the residual staining in the mutant lines.

      (14) The statement that the failure of CtΔ mutants to generate viable sporozoites is due to the lack of microtubule PTMs (lines 295-296) is speculative. The lack of the entire C-terminal tail could have a number of consequences, such as impaired microtubule assembly or failure to recruit and bind associated proteins. This is not necessarily linked to PTMs. Also, it has been shown in yeast that for microtubules to form properly and exquisite regulation (proteostasis) of the ratio between α- and β-tubulin is essential (Wethekam and Moore, 2023). I am not sure, but according to Materials and Methods (line 423), the gene cassettes for replacing the wild-type tubulin gene with the mutant versions contain a selectable marker gene for pyrimethamine selection. Are there qPCR data that show that expression levels of mutant α-tubulin are more or less the same as the wild-type levels?

      We agree that attributing the developmental failure of the CtΔ mutants solely to the absence of microtubule post-translational modifications (PTMs) is speculative. As the reviewer rightly points out, deletion of the entire C-terminal tail may have multiple effects, including impaired microtubule assembly, altered α/β-tubulin stoichiometry, or disruption of interactions with essential microtubule-associated proteins (MAPs). These consequences may arise independently of PTMs.

      That said, we note that PTMs particularly polyglutamylation, can modulate MAP binding by altering the surface charge of microtubules (Genova et al., 2023; Mitchell et al., 2010). Therefore, while PTM loss may be a contributing factor, we acknowledge that the phenotype likely results from a combination of mechanisms. We will revise the relevant section of the manuscript to present a more cautious and balanced interpretation.

      Regarding the reviewer’s question on expression levels: although the replacement constructs include a pyrimethamine resistance cassette, we have not yet quantified α-tubulin transcript levels by qPCR. In the interim, the study by Spreng et al. (2019) (reference 50) on a related α1-tubulin nutations provides valuable insight. They observed no difference in mRNA levels in day 12 oocysts, yet reported fainter microtubule staining and shorter sporozoites, suggesting a post-transcriptional mechanism affecting protein expression or function in later stages. Furthermore, the phenotypic spectrum across their mutant panel (Suppl. Fig. 3 D and E) implies that robust α-tubulin regulation is highly sensitive to specific sequences.

      We acknowledge this as a current limitation in our study and will address it in the revised manuscript, noting that direct measurement of transcript levels is a key area for future investigation.

      (15) In the Discussion, my impression is that two recent studies, the superb Expansion Microscopy study by Bertiaux et al. (2021) and the cryo-EM study by Ferreira et al. (2023), are not sufficiently recognised (although they are cited elsewhere in the manuscript). The latter study includes a detailed description of the microtubule cytoskeleton in sporozoites. However, the present study clearly expands the knowledge about the structure of the cytoskeleton in liver stage parasites and is one of the few studies addressing the distribution and function of microtubule post-translational modifications in Plasmodium.

      Indeed, our work builds upon the established knowledge from Bertiaux et al. (2021) and the cryo-EM study by Ferreira et al. (2023), as rightly mentioned by the reviewer. We agree that these foundational studies, combined with our findings, will significantly expand the understanding of Plasmodium biology and cytoskeleton dynamics across its life cycle and will open the door for further investigations. We are grateful for this suggestion and will ensure these key studies are appropriately acknowledged in the revised manuscript.

      (16) I somewhat disagree with the statement of a co-occurrence of polyglutamylated and tyrosinated microtubules. I think the resolution is too low to reach that conclusion. As this is a bold claim, and would be contrary to what is known from other organisms, it would require a more rigorous validation. Given the apparent problems with the anti-Ty antibody (signal in the TyΔ mutant), one should be very cautious with this claim.

      This is a very important point to clarify. As mentioned previously, the initial experiments for these modifications were performed independently. It is established that sporozoite subpellicular microtubules exhibit both tyrosination and polyglutamylation. We will revise the manuscript to temper this statement and clearly indicate that the co-occurrence of these PTMs remains a hypothesis that requires more rigorous validation. As suggested, we are now conducting additional co-staining experiments using the better validated YL1/2 antibody to re-express and directly compare the distribution of both PTMs within the same cell. These follow-up experiments will help clarify whether both modifications occur simultaneously on the same microtubule structures in Plasmodium liver stages.

      (17) In the Discussion (lines 311 and 377), it is again claimed that tyrosinated microtubules are "a well-known marker of stable microtubules". This statement is completely incorrect, and I am surprised by this serious mistake. A few lines later, the authors say that polyglutamylated is "commonly associated with dynamic microtubule behaviour". Again, this is completely incorrect and is the opposite of what is firmly established in the literature. Polyglutamylation and detyrosination are markers of stable microtubules.

      Indeed, in canonical eukaryotic systems, tyrosinated microtubules are generally considered markers of dynamic microtubule populations, whereas detyrosinated and polyglutamylated microtubules are more commonly associated with stability.

      We acknowledge this mistake and will revise the Discussion to correct these statements accordingly. In the context of Plasmodium, our observations suggest an unusual regulation of microtubule dynamics, which may reflect parasite-specific adaptations. For example, we observed tyrosinated α-tubulin in the stable subpellicular microtubules of sporozoites structures typically known for their exceptional stability. This atypical association implies either non-canonical roles for tyrosination or parasite-specific mechanisms for modulating microtubule properties. Additionally, the presence of both PTMs at different stages of development and on different microtubule populations suggests tightly regulated spatial and temporal modulation of microtubule function.

      We will carefully revise the relevant sections of the manuscript to remove incorrect generalizations and ensure accurate representation of the current consensus in the field, while emphasizing the possibility of Plasmodium-specific adaptations that merit further study.

      (18) In line 339, the authors interpret the residual antibody staining after the introduction of the mutant tubulin as a compensatory mechanism. There is no evidence for this. More likely explanations are firstly the quality of the anti-Ty-antibody used (see comment above), and the fact that also β-tubulin carries C-terminal polyglutamylation sites, which haven't been investigated in this study. PTMs on β-tubulin are not compensatory, but normal PTMs, at least in all other organisms where microtubule PTMs have been investigated.

      As mentioned above, we are currently repeating the key experiments with the [YL1/2] antibody, as suggested. Furthermore, we fully agree with the reviewer's point regarding polyglutamylation on β-tubulin. The C-terminal tail of β-tubulin does indeed contain polyglutamylation sites. As we noted in the manuscript (Lines 340-352), this aspect has not been investigated in the present study, and we acknowledge it as a valuable direction for future research. We will revise the text accordingly to avoid overinterpretation and to more accurately reflect the limitations of our current data.

    1. many therapists believe strongly in the unconscious and the impact of early childhood experiences on the rest of a person’s life.

      Many therapists think that things we go through as young children, along with unconscious thoughts we may not even be aware of, can shape how we think, feel, and act for the rest of our lives. How might early childhood experiences affect the way a person handles relationships or emotions as an adult?

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      In the presented study, the authors aim to explore the role of nociceptors in the fine particulate matter (FPM) mediated Asthma phenotype, using rodent models of allergic airway inflammation. This manuscript builds on previous studies and identify transcriptomic reprogramming and an increased sensitivity of the jugular nodose complex (JNC) neurons, one of the major sensory ganglia for the airways, on exposure to FPM along with Ova during the challenge phase. The authors then use OX-314 a selectively permeable form of lidocaine, and TRPV1 knockouts to demonstrate that nociceptor blocking can reduce airway inflammation in their experimental setup. The authors further identify the presence of Gfra3 on the JNC neurons, a receptor for the protein Artemin, and demonstrate their sensitivity to Artemin as a ligand. They further show that alveolar macrophages release Artemin on exposure to FPM.

      We thank the reviewer for their valuable comments, which have significantly enhanced the quality of our manuscript. A point-by-point rebuttal is provided below.

      Strength

      The study builds on results available from multiple previous work and presents important results which allow insights into the mixed phenotypes of Asthma seen clinically. In addition, by identifying the role of nociceptors, they identify potential therapeutic targets which bear high translational potential.

      Weakness

      While the results presented in the study are highly relevant, there is a need for further mechanistic dissection to allow better inferences. Currently certain results seem associative. Also, certain visualisations and experimental protocols presented in the manuscript need careful assessment and interpretation. While Asthma is a chronic disease, the presented results are particularly important to explore Asthma exacerbations in response to acute exposure to air pollutants. This is relevant in today's age of increasing air pollution and increasing global travel.

      Major

      The JNC is a major group of neurons responsible for receiving sensory inputs from the airways. However, the DRG also contains nociceptors and is known to receive afference from the upper airways. An explanation of why the study was restricted to the JNC would be important.

      We acknowledge that some afferents to the upper airways do arise from the DRG, specifically in the upper thoracic segments (T1–T5). We have added a statement in the text to note this subset of nociceptive and spinally mediated pathways. However, the preponderance of evidence indicates that the majority of airway and lung afferents (70–80%, sometimes up to 90%) originate from the jugular–nodose complex (JNC). Given this large imbalance—and because our study focuses on the mechanosensory, and chemosensory functions mediated primarily by the JNC—we restricted our analysis to this main vagal pathway. By contrast, DRG innervation, though functionally important for nociception and irritation-related reflexes, accounts for a smaller yet significant (~20–30%) fraction of the total afferent pool. The referenced tracing studies[1,2] support this distribution and are cited to clarify our rationale for emphasizing the JNC in our work.

      Similarly, the role of the Artemin in the study remains associative. The study results present that Artemin sensitize nociceptors to lead to an increased inflammatory response (Supplementary Figure 2), however, both upstream and downstream evidence for this inference needs to be dissected further. For instance, the evidence for the role of Artemin in the model comes from ex vivo experiments with alveolar macrophages, but not in the experimental model created. Blocking or activation experiments could be performed, along with investigating the change in the total number of nociceptors with Artemin exposure. Similarly, the downstream effects of the potential Artemin-mediated JNC stimulation should be explored in the context of this experimental setup. A detailed dissection of the mechanisms is important. Additionally, it is also important to discuss the hypothesis leading to the selection of Artemin as a target, which currently seems arbitrary.

      Our data show that exogenous i) OVA-FPM exposed AM secrete Artemin and that ii) recombinant Artemin can sensitize nociceptors, potentially heightening the inflammatory response. As suggested, we agree that more upstream and downstream evidence is needed for definitive mechanistic insight. In response, we have expanded our experiments to include intravital microscopy, which demonstrates impaired motility of alveolar macrophages and neutrophils in nociceptor-ablated mice, suggesting a bidirectional crosstalk between AMs and nociceptor neurons.  

      In future studies, we will perform blocking or activation studies to clarify Artemin’s in vivo effects and confirm its role in modulating airway nociceptors. We also recognize the importance of examining whether Artemin exposure alters the phenotype of these neurons and lung innervation density. As recommended, we plan targeted interventions (e.g., Artemin-neutralizing antibodies or overexpression strategies) to delineate the mechanisms by which Artemin-mediated nociceptor stimulation influences the local inflammatory environment.

      We have expanded our discussion to clarify that Artemin is a recognized growth factor known to sensitize certain sensory neurons, including those responsive to tissue injury and inflammation. This literature-based rationale guided our hypothesis that Artemin might increase nociceptor reactivity in the lung and thereby influence alveolar macrophage function. By combining ex vivo and intravital approaches, we have begun to map these interactions but agree that further in vivo studies are necessary to confirm causality, dissect signal transduction pathways, and fully validate Artemin’s contributions to AM–nociceptor crosstalk. We have revised our manuscript accordingly to highlight these limitations.

      A deeper exploration of the inflammatory parameters could be performed. The multiplex analysis of the cytokine analysis shows a reduction in certain cytokines like IL-6 and MCP (figure 3F), which needs to be discussed. Additionally, investigating the change in proportions of the different immune cell populations is important, which currently restricts the eosinophil and neutrophil counts in the BAL. This is also important as the study builds on work from Prof. Chang's group, which also identified the expansion of an invariant iNKT cell population by FPM, regulatory in nature. Adding data on airway hyperresponsiveness, if possible, would be a welcome addition, considering Asthma as the disease context.

      We thank the reviewer for highlighting the need for a more comprehensive exploration of inflammatory parameters. To address these concerns:

      (1) Cytokine Analysis: We re-ran all statistical analyses, including the CBA and ELISA assays, and confirmed that TNFα and Artemin are the only differentially expressed cytokines across experimental groups. We have expanded the Discussion to emphasize TNFα’s role in this context.

      (2) Immune Cell Profiling in BALF: Our data show that co-exposure with FPM exacerbates CD45+ cells, eosinophil, neutrophil, T-cells and monocyte infiltration. Notably, CD45+ cells and neutrophils were the only population reduced under nociceptor neuron loss-of-function conditions (QX314–treated or TRPV1-DTA mice, Author response image 1).

      Of note, we also confirmed these data using intravital imaging and in a second line of nociceptor ablated mice (NaV1.8DTA). We are aware of Prof. Chang’s work suggesting expansion of an invariant iNKT cell population this population in future

      (3) Airway Hyperresponsiveness (AHR): We recognize that adding AHR data would strengthen the asthma-related context. Unfortunately, we are not currently equipped to perform AHR measurements, but we intend to include this in future experiments to provide a more complete assessment of airway function.

      Author response image 1.

      The authors could revisit the data presented in terms of visualization. For instance, the pooled data presented in some of the figures is probably leading to a wide variation which makes interpretation more difficult. Presenting data separately for each experimental replicate might help the reader. This is also important considering the possible variation seen between experiments (for instance, in Figure 3A and 3C and 3B and 3D, the neutrophil and eosinophil panels for the same groups seem to have an almost 2-fold difference.). Similarly, in the cytokine analysis, the authors have used a common axis for depicting all cytokine values which leads to difficulties in interpretation (Figure 3F). Analysis of the RNA seq results and the DEGs could be revisited to include pathway analysis etc (Figure 2), and the supplementary information could include detailed lists of the major target genes.

      To address this query, we have completely reformatted all graphs and included both gene lists and lists of enriched pathways for all three comparisons in Supplementary Table 1. We also confirmed our flow cytometry analysis functionally by performing intravital imaging.

      The authors should also consider citing the previous experimental setup used for some particular protocols. For instance, the use of the specified protocol for OVA in a C57 background needs to be justified, as there are various protocols reported in the literature. Additionally, doses used in some experiments seem arbitrary (The FPM and Artemin exposure in Figure 4). Depicting the dose-response curve or citing previous literature for the same would be important. Similarly, different sample sizes seen in experiments should be explained, whether they are due to mortality, failure to exhibit phenotypes, or due to technical failures. The RNA seq experiment mentions only 2 biological replicates in one of the groups which should be addressed either by increasing the sample size or by replicating the experiment. Moreover, nested comparisons in experiments performed for Figure 1 need to be performed. Neurons isolated from each mouse should be maintained and analysed separately to retain biological replicates to better represent the heterogeneity.

      We appreciate the request for clarity regarding the experimental protocols and sample sizes:

      OVA Model in C57BL/6 Mice: We adapted a previously published OVA protocol in C57BL/6 mice[3-5] (PMID: 39661516), which uses two doses of sensitization to compensate for the lower Th2 response compared to BALB/c[6]. We increased the dose of OVA (100 µg) because our initial experiments produced low eosinophil infiltration. Although this dosage is on the higher side, some studies have noted local IFNγ induction in C57BL/6 mice; however, we did not detect IFNγ in our setup.

      FPM and Artemin Doses: We did not perform a full dose-response assay for FPM and Artemin but used 100 ng/mL as reported in prior literature, where TRPA1 and TRPV1 mRNA were upregulated after 18 hours of incubation[7]. This reference has been added for clarity.

      Sample Sizes and Exclusions: One control mouse was excluded from the RNA-seq experiment because a parallel PCA analysis indicated it was an outlier. This was the only exclusion in the study, and this have been indicated in the method section of the article.  

      Nested Comparisons and Biological Replicates: We reanalyzed the relevant data with a nested one-way ANOVA and updated the figures accordingly. Neurons isolated from each mouse were first averaged to preserve biological replicates and capture potential heterogeneity; and data was analysed on the per mouse averages.

      The manuscript should be more detailed regarding the statistics employed. Currently, there is a section mentioned in the methods section, but details of corrections employed and specific stats for specific experiments should be described. There are also some minor grammatical errors and incomplete sentences in the manuscript which should be corrected. The authors should also consider a more expansive literature review in the introduction/discussion sections.

      We have updated the figure legends and methods to include more detailed information on the specific statistical tests used for each experiment. In addition, we have fixed minor grammatical errors and incomplete sentences throughout the manuscript. Finally, we have expanded our Introduction and Discussion to include additional references and a broader literature context.

      Reviewer #2 (Public review):

      The authors sought to investigate the role of nociceptor neurons in the pathogenesis of pollutionmediated neutrophilic asthma.

      We thank the reviewer for their valuable comments, which have significantly enhanced the quality of our manuscript. A point-by-point rebuttal is provided below.

      Strength

      The authors utilize TRPV1 ablated mice to confirm effects of intranasally administered QX-314 utilized to block sodium currents. The authors demonstrate that via artemin, which is upregulated in alveolar macrophages in response to pollution, sensitizes JNC neurons thereby increasing their responsiveness to pollution. Ablation or inactivity of nociceptor neurons prevented the pollution induced increase in inflammation.

      Weakness

      While neutrophilic, the model used doesn't appear to truly recapitulate a Th2/Th17 phenotype.  No IL-17A is visible/evident in the BALF fluid within the model. (Figure 3F). Unclear of the relevance of the RNAseq dataset, none of the identified DEGs were evaluated in the context of mechanism. The authors overall achieved the aim of demonstrating that nociceptor neurons are important to the pathogenesis of pollutionexacerbated asthma. Their results support their conclusions overall, although there are ways the study findings can be strengthened. This work further evaluates how nociceptor neurons contribute to asthma pathogenesis important for consideration while proposing treatment strategies for undertreated asthma endotypes.

      Major

      Utilizing a different model, one using house dust mite or alternaria alternata or similar that is able to induce a true Th2/th17 type response that is also more translatable to humans for confirmation.

      We appreciate the suggestion to use additional allergen models. In a pilot study, we did observe increased Artemin in the BALF of house dust mite–treated mice, although the levels were low under our current dosing schedule (20 µg/dose daily from Day 0–4 and Day 7–9, with sacrifice on Day 10; Auhtor response image 2). Conversely, using an Alternaria alternata model at 100 µg/dose daily from Day 0–2 (sacrificed on Day 3) did not yield a detectable increase in Artemin. We suspect these findings may reflect the specific dose and timing used. We plan to refine our protocols (e.g., longer exposures or higher doses) for HDM and/or Alternaria to better model a Th2/Th17 response and further validate our observations in a setting more translatable to human asthma.

      Author response image 2.

      Additional analysis, maybe pathway analysis on the RNAseq dataset presented in Figure 2. Unclear how these genes are relevant/how they affect functionality. At present it is acceptable to say they are transcriptionally reprogramed, but no protein evaluation is provided which would get more at function, however, the authors do show some functional data in Figure 1, so maybe this could somehow be discussed/related to Figure 2.

      We have expanded our RNA-seq analysis to include gene lists and enriched pathways for all three comparisons in Supplementary Table 1. We have also revised our discussion to align these transcriptomic changes with the functional data shown in Figure 1. While we have not yet performed protein-level validation for all identified genes, the patterns observed in our RNA-seq dataset suggest pathways potentially tied to nociceptor activation and the downstream inflammatory response. We plan to conduct targeted protein analyses in future studies to further substantiate these findings.

      Histology and localization of neutrophils/nociceptor neurons/alveolar macrophages would enhance the study findings.

      We appreciate the reviewer’s suggestion to include histological data showing the distribution of neutrophils, nociceptor neurons, and alveolar macrophages. While we have not yet performed detailed histological staining of these cell types, we have added live in-vivo intravital microscopy data (Figure 4) that illustrate impaired AM and neutrophil motility in nociceptor-ablated mice. We plan to include additional histological analyses in future studies to further localize these cells in the lung tissue.

      Minor:

      The first 3 figures are small and hard to read.

      We have enlarged Figures 1 and 3 in the revised manuscript to improve readability. We have also added the corresponding gene lists and enriched pathways to Supplementary Table 1 for clarity.

      The figures are mislabeled in the text. Figure 2 is discussed twice in two different contexts; the second mention is supposed to be labeled as Figure 2.

      We corrected the mislabeled figures in the text, ensuring that each figure is referenced accurately.

      Figure 4 isn't cited in the text. I think it is supposed to be referenced in the paragraph before the discussion starts and is currently labeled as Figure 1.

      We have updated the text to properly cite Figure 4 in the relevant paragraph before the Discussion begins, rather than labeling it as Figure 1.

      Notating which statistical analysis was used with each figure/subfigure would be beneficial. Also, it's important to notate if the data was analyzed for multiple comparisons.

      We have revised each figure/subfigure legend to specify the statistical tests used, including information on whether corrections for multiple comparisons were applied. This provides a clearer understanding of how each dataset was analyzed.

      Reviewer #3 (Public review):

      Asthma is a complex disease that includes endogenous epithelial, immune, and neural components that respond awkwardly to environmental stimuli. Small airborne particles with diameters in the range of 2.5 micrometers or less, so-called PM2.5, are generally thought to contribute to some forms of asthma. These forms of asthma may have increased numbers of neutrophils and/or eosinophils present in bronchoalveolar lavage fluid and are difficult to treat effectively as they tend to be poorly responsive to steroids. Here, Wang and colleagues build on a recent model that incorporated PM2.5 which was found to have a neutrophilic component. Wang altered the model to provide an extra kick via the incorporation of ovalbumin. Building on their prior expertise linking nociceptors and inflammation, they find that silencing TRPV1-expressing neurons either pharmacologically or genetically, abrogated inflammation and the accumulation of neutrophils. By examining bronchoalveolar lavage fluid, they found not only that levels of the number of cytokines were increased, but also that artemin, a protein that supports neuronal development and function, was elevated, which did not occur in nociceptor-ablated mice. They also found that alveolar macrophages exposed to PM2.5 particles had increased artemin transcription, suggesting a further link between pollutants, and immune and neural interactions.

      We thank the reviewer for their valuable comments, which have significantly enhanced the quality of our manuscript. A point-by-point rebuttal is provided below.

      Weakness

      There are substantial caveats that must be attached to the suggestions by the authors that targeting nociceptors might provide an approach to the treatment of neutrophilic airway inflammation in pollutiondriven asthma in general and wildfire-associated respiratory problems in particular.  

      These caveats include the uncertainty of the relevance of the conventional source of PM2.5, to pollution and asthma. According to the National Institute of Standards and Technology (NIST), the standard reference material (SRM) 2786 is a mix obtained from an air intake system in the Czech Republic. It is not clear exactly what is in the mix, and a recent bioRxiv preprint, https://www.biorxiv.org/content/10.1101/2023.08.18.553903v3.full.pdf reveals the presence of endotoxin. Care should thus be taken in interpreting data using particulate matter. Regarding wildfires, there is data that indicates that such exposure is toxic to macrophages. What impact might that then have on the production of cytokines, and artemin, in humans?

      We recognize the potential limitations of using SRM2786 (obtained from a Czech air-intake system) as a model for realworld PM2.5 exposure. Our rationale for choosing SRM2786 is that it is commercially available and represents a broad spectrum of ambient air pollutants, in contrast to more specialized sources like diesel exhaust particles. However, we acknowledge in the discussion the presence of endotoxin in SRM2786, as suggested by recent reports, and agree that this may influence immune responses and should be considered when interpreting our data.

      Regarding wildfire-associated exposure, we are aware that certain components of wildfire smoke can be toxic to macrophages. We do not think this play a significant role in the current study design as number of AMs, as determined by flow cytometry and intravital microscopy, are similar when comparing OVA-exposed mice to OVA-FPM exposed animals. Thus, these results rule out significant AM toxicity by FPM.

      Ultimately, while our findings suggest that modulating nociceptor activity may reduce neutrophilic inflammation, we emphasize that additional research—including different PM2.5 sources, validation of endotoxin levels, and in vivo confirmation in human-relevant models—is necessary before drawing definitive conclusions about treating pollutiondriven asthma or wildfire-induced respiratory problems.

      The Introductory paragraph implies links between wildfire events, particular exposure, and neutrophilic asthma. I am not aware of such a link having been established, in which case the paragraph needs revision. In the paragraph that begins with 'Urban pollution', it is suggested that eosinophilic asthma is treatment responsive in comparison to the neutrophilic form. That may not be the case, and they may often these cellular components may occur together. In much of the manuscript, there is a mismatch between the text and the figure numbers. For example, in the Results, Figure 2 should be Figure 3 some of the time, and Figure 3 is actually Figure 4, while the reference to Figure 1F-H is Figure 4H. Please check carefully.

      (a) Introduction Paragraph and Wildfire–Neutrophilic Asthma Link

      We add references to the introduction to support the link between wildfire, respiratory symptoms and the link to neutrophilic asthma [8-12].

      (b) Distinction Between Eosinophilic and Neutrophilic Asthma

      We recognize that eosinophilic and neutrophilic airway infiltrates can co-occur in the same individual and that treatment responsiveness can vary considerably. Our intention was to note that conventional asthma therapies (e.g., inhaled corticosteroids) are generally more effective for eosinophilic-driven disease than for neutrophilic phenotypes, but we agree that these inflammatory endotypes often overlap in clinical practice. We have revised the text in the “Urban pollution” section to acknowledge this complexity and to clarify that inflammatory cell populations in asthma are not always discrete.

      Figure Numbering and Text–Figure Mismatch

      We sincerely apologize for the confusion caused by mismatched figure labels and references in the Results section. We have carefully reviewed and corrected all figure references throughout the manuscript to ensure accuracy.

      References

      (1) Kim, S. H. et al. Mapping of the Sensory Innervation of the Mouse Lung by Specific Vagal and Dorsal Root Ganglion Neuronal Subsets. eNeuro 9 (2022). https://doi.org/10.1523/ENEURO.0026-22.2022

      (2) McGovern, A. E. et al. Evidence for multiple sensory circuits in the brain arising from the respiratory system: an anterograde viral tract tracing study in rodents. Brain Struct Funct 220, 3683-3699 (2015). https://doi.org/10.1007/s00429-014-0883-9

      (3) Shen, C.-C., Wang, C.-C., Liao, M.-H. & Jan, T.-R. A single exposure to iron oxide nanoparticles attenuates antigen-specific antibody production and T-cell reactivity in ovalbumin-sensitized BALB/c mice. International journal of nanomedicine, 1229-1235 (2011).  

      (4) Delayre-Orthez, C., De Blay, F., Frossard, N. & Pons, F. Dose-dependent effects of endotoxins on allergen sensitization and challenge in the mouse. Clinical & Experimental Allergy 34, 1789-1795 (2004).  

      (5) Morokata, T., Ishikawa, J. & Yamada, T. Antigen dose defines T helper 1 and T helper 2 responses in the lungs of C57BL/6 and BALB/c mice independently of splenic responses. Immunology letters 72, 119-126 (2000).  

      (6) Li, L., Hua, L., He, Y. & Bao, Y. Differential effects of formaldehyde exposure on airway inflammation and bronchial hyperresponsiveness in BALB/c and C57BL/6 mice. PLoS One 12, e0179231 (2017).  

      (7) Ikeda-Miyagawa, Y. et al. Peripherally increased artemin is a key regulator of TRPA1/V1 expression in primary afferent neurons. Molecular pain 11, s12990-12015-10004-12997 (2015).  

      (8) Baan, E. J. et al. Characterization of Asthma by Age of Onset: A Multi-Database Cohort Study. J Allergy Clin Immunol Pract 10, 1825-1834 e1828 (2022). https://doi.org/10.1016/j.jaip.2022.03.019

      (9) de Nijs, S. B., Venekamp, L. N. & Bel, E. H. Adult-onset asthma: is it really different? Eur Respir Rev 22, 44-52 (2013). https://doi.org/10.1183/09059180.00007112

      (10) Gianniou, N. et al. Acute effects of smoke exposure on airway and systemic inflammation in forest firefighters. J Asthma Allergy 11, 81-88 (2018). https://doi.org/10.2147/JAA.S136417

      (11) Noah, T. L., Worden, C. P., Rebuli, M. E. & Jaspers, I. The Effects of Wildfire Smoke on Asthma and Allergy. Curr Allergy Asthma Rep 23, 375-387 (2023). https://doi.org/10.1007/s11882-023-01090-1

      (12) Wilgus, M. L. & Merchant, M. Clearing the Air: Understanding the Impact of Wildfire Smoke on Asthma and COPD. Healthcare (Basel) 12 (2024). https://doi.org/10.3390/healthcare12030307

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      In this work, the authors apply TDCS to awake and anesthetized macaques to determine the effect of this modality on dynamic connectivity measured by fMRI. The question is to understand the extent to which TDCS can influence conscious or unconscious states. Their target was the PFC. During the conscious states, the animals were executing a fixation task. Unconsciousness was achieved by administering a constant infusion of propofol and a continuous infusion of the muscle relaxant cisatracurium. They observed the animals while awake receiving anodal or cathodal hd-TDCS applied to the PFC. During the cathodal stimulation, they found disruption of functional connectivity patterns, enhanced structure-function correlations, a decrease in Shannon entropy, and a transition towards patterns that were more commonly anatomically based. In contrast under propofol anesthesia anodal hd-TDCS stimulation appreciably altered the brain connectivity patterns and decreased the correlation between structure and function. The PFC stimulations altered patterns associated with consciousness as well as those associated with unconsciousness.

      Strengths: 

      The authors carefully executed a set of very challenging experiments that involved applying tDCS in awake and anesthetized non-human primates while conducting functional imaging.

      We thank the Reviewer for summarising our study and for his appreciation of the highly challenging experiments we performed.

      Weaknesses:

      The authors show that tDCS can alter functional connectivity measured by fMRI but they do not make clear what their studies teach the reader about the effects of tDCS on the brain during different states of consciousness. No important finding is stated contrary to what is stated in the abstract. It is also not clear what the work teaches us about how tDCS works nor is it clear what are the "clinical implications for disorders of consciousness." The deep anesthesia is akin to being in a state of coma. This was not discussed.  

      While the authors have executed a set of technically challenging experiments, it is not clear what they teach us about how tDCS works, normal brain neurophysiology, or brain pathological states such as disorders of consciousness.

      We thank the reviewer for his comments. We agree that we could better highlight the value and implications of our work, and we take this opportunity to improve our manuscript according to the suggestions.

      Actions in the text: We have added several new paragraphs in the Discussion section, considering these comments and other related remarks from the Reviewing Editor (see below our answer to the first comment of the Reviewing Editor: REC#1).

      Reviewer #2 (Public review): 

      General comments: 

      The authors investigated the effects of tDCS on brain dynamics in awake and anesthetized monkeys using functional MRI. They claim that cathodal tDCS disrupts the functional connectivity pattern in awake monkeys while anodal tDCS alters brain patterns in anesthetized monkeys. This study offers valuable insight into how brain states can influence the outcomes of noninvasive brain stimulation. However, there are several aspects of the methods and results sections that should be improved to clarify the findings.

      We thank the Reviewer for the summary and appreciation of our study.  

      Major comments 

      For the anesthetized monkeys, the anode location differs between subjects, with the electrode positioned to stimulate the left DLFPC in monkey R and the right DLPFC in monkey N. The authors mention that this discrepancy does not result in significant differences in the electric field due to the monkeys' small head size. However, this is incorrect, as placing the anode on the left hemisphere would result in a much lower EF in the right DLPFC than placing the anode on the right side. Running an electric field simulation would confirm this. Additionally, the small electrode size suggested by the Easy cap configuration for NHP appears sufficient to stimulate the targeted regions focally. If this interpretation is correct, the authors should provide additional evidence to support their claim, such as a computational simulation of the EF distribution.

      We thank the Reviewer for the comments. First, regarding the reviewer’s statement that placing the anode on the left hemisphere would result in a much lower EF in the right DLPFC than placing the anode on the right side, we would like to clarify that we did not use a typical 4 x 1 concentric ring high-definition setup (which consists of a small centre electrode surrounded by four return electrodes), but a two-electrode montage, with one electrode over the left or right PFC and the other one over the contralateral occipital cortex. According to EF modelling papers, a 4 x 1 high-definition setup would produce an EF that is focused and limited to the cortical area circumscribed by the ring of the return electrodes (Datta et al. 2009; Alam et al. 2016). Therefore, targeting the left or right DLPFC with a 4 x 1 setup would produce an EF confined to the targeted hemisphere of the PFC. In contrast, we expect the brain current flow generated with our 2-electrode setup to be broader, despite the small size of the electrodes,  because there is no constraint from return electrodes. Thus, with our setup, the current is expected to flow between the PFC and the occipital cortex (see also our responses to comments R3.3., R.E.C.#2.1. and R.E.C.#2.2.). 

      Second, we would like to point out that in awake experiments, in which we stimulated the right PFC of both monkeys, there was no gross evidence of left or right asymmetry in the computed functional connectivity patterns (Figure 3A, Figure 3 - figure supplement 2A; Figure 5A). These results, showing that our stimulation montages did not induce asymmetric dynamic FC changes in NHPs, support the idea that our setups did not generate EFs that were spatially focused enough to alter brain activity in one hemisphere substantially more than the other.

      Third, it is also worth noting that current evidence suggests that human brains are significantly more lateralized than those of macaques. Macaque monkeys have been found to have some degree of lateralized networks, but these are of lower complexity, and the lateralization is less pronounced and functionally organized than in humans. (Whey et al., 2014; Mantini et al., 2013). This suggests that, even if the stimulation were focal enough to stimulate the left or the right part of the PFC only, the behavioural effects would likely be similar.

      We strongly agree with the reviewer that conducting an EF simulation would be valuable to confirm our expectations and to gain a comprehensive view of the characteristics of the EFs generated with our different setups in NHPs. However, the challenge is in the fact that EF computational models have been developed for humans, and their use in NHPs is not straightforward due to significant anatomical differences. For example, macaque monkeys are distinct from humans in terms of brain size, shape and cortical organisation, skull thickness, and the presence of muscles, as well as different tissue conductivities (Lee et al. 2015; Datta et al.2016; Mantell et al. 2023). We plan to address this in future work.

      Actions in the text: In the Materials and Methods section, we have modified the sentence: “Because of the small size of the monkey's head and because we did not use return electrodes to restrict the current flow (as is achieved with typical high-definition montages (Datta et al. 2009; Alam et al. 2016)), we expected that tDCS stimulation with the two symmetrical montages would result in nearly equivalent electric fields across the monkey’s head and produce roughly similar effects on brain activity.” 

      We also added a new sentence about EF simulation: 

      “This would need to be confirmed by running an electric field simulation. However, computational electric field models have been developed for humans, and their use in NHPs is not straightforward due to anatomical specificities. Indeed, monkeys differ from humans in terms of brain size, shape and cortical organization, skull thickness, tissue conductivities and the presence of muscles (Lee et al. 2015; Datta et al. 2016; Mantell et al. 2023). Modelling of EFs generated with the specific tDCS montages employed in this study will be performed in future work.”

      For the anesthetized monkeys, the authors applied 1 mA tDCS first, followed by 2 mA tDCS. A 20-minute stimulation duration of 1 mA tDCS is strong enough to produce after-effects that could influence the brain state during the 2 mA tDCS. This raises some concerns. Previous studies have shown that 1 mA tDCS can generate EF of over 1 V/m in the brain, and the effects of stimulation are sensitive to brain state (e.g., eye closed vs. eye open). How do the authors ensure that there are no after-effects from the 1 mA tDCS? This issue makes it challenging to directly compare the effects of 1 mA and 2 mA stimulation.

      We agree with the reviewer's comment that 1 mA tDCS may induce aftereffects, as has been observed in several human studies (e.g., (Jamil et al. 2017, 2020). Although the differences between the 1 mA post-stimulation and baseline conditions were not significant in our analyses, it's still possible that the stimulation produced some effects below the threshold of significance that may contribute, albeit weakly, to the changes observed during and after 2 mA stimulation. We have, therefore, amended the paper in line with the reviewer's comments.

      Actions in the text: We have added the following text in the Result section: 

      “While several human studies have reported that 1 mA transcranial stimulation induces aftereffects (e.g., (Jamil et al. 2017, 2020; Monte-Silva et al. 2010), the differences between the 1 mA post-stimulation and baseline conditions were not significant in our analyses. However, it is still possible that the 1 mA stimulation produced some effects below the threshold of significance that may contribute to the changes observed during and after the 2 mA stimulation.”

      The occurrence rate of a specific structural-functional coupling pattern among random brain regions shows significant effects of tDCS. However, these results seem counterintuitive. It is generally understood that noninvasive brain stimulation tends to modulate functional connectivity rather than structural or structural-functional connectivity. How does the occurrence rate of structural-functional coupling patterns provide a more suitable measure of the effectiveness of tDCS than functional connectivity alone? I would recommend that the authors present the results based on functional connectivity itself. If there is no change in functional connectivity, the relevance of changes in structural-functional coupling might not translate into a meaningful alteration in brain function, making it unclear how significant this finding is without corresponding functional evidence.

      First, of all, we would like to make it clear that the occurrence rate of patterns as a function of their SFC is not intended to be used or seen as a ‘better’ measure of the efficacy of tDCS. Instead, it is one aspect of the effects of tDCS on whole-brain functional cortical dynamics, obtained from refined measures (phase-coherences), that specifically addresses the coupling between structure and function. This type of analysis is further motivated by its increasing use in the literature due to its suspected relationship to wakefulness (e.g., (Barttfeld et al. 2015, Demertzi et al. 2019; Castro et al. 2023)). Also, in our analysis, the structure is kept constant: the connectivity matrix used to correlate the functional brain states is always the same (CoCoMac82). Thus, the influence of tDCS on the structure-function side can only be explained by modulating the functional aspects, as suggested by intuition and previous results.

      Then, we agree with the reviewer that studying the functional changes induced by tDCS alone could be valuable. However, usual metrics used in FC analysis are usually done statistically: FC-states are either computed through averaging spatial correlations over time, then analyzed through graph-theoretical properties for instance (or by just directly computing the element-wise differences), or either by considering the properties of the different visited FC-states by computing spatial correlations over a sliding time-window, and then similar analysis can be done as previously explained. But these are static metrics, if the states visited are essentially the same (which is expected from non-invasive neuromodulations that haven’t already demonstrated strong and/or characteristic impact), but the dynamical process of visiting said states changes, one would see no difference in that regard. As such, in the case of resting-state fMRI, differences in FCs are hard to interpret given that between-sessions within-condition differences are usually found with some degree of variance for the respective conditions. Trying then to interpret between-condition differences is quite tricky in the case of subtle modulations of the system’s activity. On the other hand, more subtle differences can be captured by considering more detailed analysis, such as using phase-based methods like we did,  by incorporating some statistical learning component with regard to the dynamicity of the system (supervised learning for instance like we did followed by temporal & transition-based methodology), and by adding some dimensions along which one will be able to give some interpretation to the analysis.  In our case we were interested in characterizing resting-state differences between stimulation conditions, which have nuanced and subtle interactions with the biological system. 

      As such, classical measures of differences between FC states are likely to not be refined and precise enough. In fact, we propose additional files investigating those classically used measures such as differences in average FC matrices, or changes in functional graph properties (like modularity, efficiency and density) of the visited FC states. These figures show that, for the first case, comparing region-to-region specific FCs provides very few statistically significant results. With respect to the second part, we show that virtually no differences are observed in the properties of the functional states visited. 

      These results suggest, as expected, that the actual brain states visited across the different stimulation conditions are topologically quite similar, and that only very few region-specific pairwise functional connectivities are particularly modulated by specific tDCS montages while, on the other hand, the actual dynamical process dictating how the brain activity passes from one state to another is in fact being influenced as shown by the dynamical analysis presented in the main figures in a more apparent and meaningful way (in that it is dependent on the montage, somewhat consistent with regard to the post-stimulations conditions, and can be made sense of by considering the theoretical effect of near-anodal versus near-cathodal neuromodulatory effects).

      Actions in the text: We have added new supplementary files showing the effects of the stimulations on FC matrices and on classical functional graph properties in awake and anesthesia datasets (Supplementary Files 3 & 4).

      We have added new sentences about these new analyses on the effects of the stimulations on FC matrices and on classical functional graph properties in the Results section:

      “In addition, we performed the main analyses separately for the two monkeys, explored the inter-condition variability (Supplementary File 2), and computed classical measures of functional connectivity such as average FC matrices and functional graph properties (modularity, efficiency and density) of the visited FC states (Supplementary File 3).... In contrast, classical FC metrics did not show significant differences across stimulation conditions, highlighting the value of dynamic FC metrics to capture the neuromodulatory effects of tDCS.”

      “Analyses of the two monkeys separately showed that the changes in slope and Shannon entropy were bigger in one of the two monkeys but went in the same direction (Supplementary File 2), while classical FC metrics did not capture any statistical differences between the different stimulation conditions (Supplementary File 3).”

      The authors recorded data from only two monkeys, which may limit the investigation of the group effects of tDCS. As the number of scans for the second monkey in each consciousness condition is lower than that in the first monkey, there is a concern that the main effects might primarily reflect the data from a single monkey. I suggest that the authors should analyze the data for each monkey individually to determine if similar trends are observed in both subjects.

      We agree that the small number of subjects is a limitation of our study. However, we have already addressed these aspects by reporting statistical analyses that consider them, using linear models of such variables, and running them through ANOVA tests. In addition, we experimentally ensured that we recorded a relatively high number of sessions over a period of several years. Regardless, we agree that our study would benefit from further investigation into this matter. We have therefore prepared complementary figures showing the main analysis performed separately for the two monkeys as proposed, as well as further investigations into the inter-condition variability outmatching the inter-individual variability, itself being also outmatched by intra-individual changes. 

      Actions in the text: We have added a supplementary file showing the main analyses performed separately for the two monkeys (Supplementary File 2) and further investigations into the inter-condition variability (Supplementary Files 3 & 4).

      We have added new sentences about these analyses performed separately for the two monkeys in the Results section:

      “In addition, we performed the main analyses separately for the two monkeys, explored the inter-condition variability (Supplementary File 2), and computed classical measures of functional connectivity such as average FC matrices and functional graph properties (modularity, efficiency and density) of the visited FC states (Supplementary File 3). The separate analyses showed that the changes in slope and Shannon entropy were substantially more pronounced in one of the two monkeys, corroborating some of the effects captured in the ANOVA tests.”

      “Analyses of the two monkeys separately showed that the changes in slope and Shannon entropy were bigger in one of the two monkeys but went in the same direction (Supplementary

      File 2)”.

      Anodal tDCS was only applied to anesthetized monkeys, which limits the conclusion that the authors are aiming for. It raises questions about the conclusion regarding brain state dependency. To address this, it would be better to include the cathodal tDCS session for anesthetized monkeys. If cathodal tDCS changes the connectivity during anesthesia, it becomes difficult to argue that the effects of cathodal tDCS vary depending on the state of consciousness as discussed in this paper. On the other hand, if cathodal tDCS would not produce any changes, the conclusion would then focus on the relationship between the polarity of tDCS and consciousness. In that case, the authors could maintain their conclusion but might need to refine it to reflect this specific relationship more accurately. 

      We agree with the reviewer that it would have been interesting to investigate the effects of cathodal tDCS in anesthetized monkeys. However, due to the challenging nature of the experimental procedures under anesthesia, we had to limit the investigations to only one stimulation modality. We chose to deliver anodal stimulation because, from a translational point of view, we aimed to provide new information on the effects of tDCS under anesthesia as a model for disorders of consciousness. It also made much more sense to increase the cortical excitability of the prefrontal cortex in an attempt to wake up the sedated monkeys rather than doing the opposite.

      Actions in the text: We have added a new sentence in the Results section:

      “Due to the challenging nature of the experimental procedures under anesthesia, we limited the investigations to only one stimulation modality. We chose to deliver anodal stimulation to provide new information on the effects of tDCS under anesthesia as a model for disorders of consciousness and to increase the cortical excitability of the PFC in an attempt to wake up the sedated monkeys.”

      Reviewer #3 (Public review): 

      Summary: 

      This study used transcranial direct current stimulation administered using small 'high-definition' electrodes to modulate neural activity within the non-human primate prefrontal cortex during both wakefulness and anaesthesia. Functional magnetic resonance imaging (fMRI) was used to assess the neuromodulatory effects of stimulation. The authors report on the modification of brain dynamics during and following anodal and cathodal stimulation during wakefulness and following anodal stimulation at two intensities (1 mA, 2 mA) during anaesthesia. This study provides some possible support that prefrontal direct current stimulation can alter neural activity patterns across wakefulness and sedation in monkeys. However, the reported findings need to be considered carefully against several important methodological limitations. 

      Strengths: 

      A key strength of this work is the use of fMRI-based methods to track changes in brain activity with good spatial precision. Another strength is the exploration of stimulation effects across wakefulness and sedation, which has the potential to provide novel information on the impact of electrical stimulation across states of consciousness.

      We thank the Reviewer for the summary and for highlighting the strengths of our study. 

      Weaknesses: 

      The lack of a sham stimulation condition is a significant limitation, for instance, how can the authors be sure that results were not affected by drowsiness or fatigue as a result of the experimental procedure?

      We agree with the reviewer that adding control conditions could have strengthened our study. Control conditions usually consist of a sham condition or active control conditions. However, as mentioned in response to one of Reviewer 2 comments (R.2.5), we had to make choices as we could not perform as many experiments due to their demanding nature, especially under anesthesia. 

      In the awake state, we acquired data with two experimental conditions; the monkeys were exposed to either anodal (F4/O1) or cathodal (O1/F4) PFC tDCS. As anodal tDCS of the PFC induced only minor changes in brain dynamics, it could be considered as an active control condition for the cathodal condition, which had striking effects on the cortical dynamics. It is also worth noting that doubts have been raised about the neurobiological inertia of certain sham protocols. Indeed, different sham protocols have been employed in the literature, some of which may produce unintended effects (Fonteneau et al. 2019). Therefore, active control conditions, such as reversing the polarity of the stimulation or targeting a different brain region, have been proposed to provide better control (Fonteneau et al. 2019). Furthermore, in the context of experiments performed under anesthesia, the relevance of a sham control condition typically used to achieve adequate blinding is questionable. 

      With regard to drowsiness and fatigue as a result of the experimental procedure, we agree with the reviewer that this is a common problem in functional imaging due to the length of the recording sessions. We assumed, as was done in previous work (Uhrig, Dehaene, and Jarraya 2014; Wang et al. 2015), that the monkeys' performance on the fixation task during acquisition would capture these periods of fatigue. Therefore, only sessions with fixation rates above 85% were included in our analysis. 

      Actions in the text: We have now specified, in the Materials and Methods section, the fact that only runs with a high fixation rate (> 85%) were included in the study: 

      “To ensure that the results were not biased by fatigue or drowsiness due to the lengthy

      In the anaesthesia condition, the authors investigated the effects of two intensities of stimulation (1 mA and 2 mA). However, a potential confound here relates to the possibility that the initial 1 mA stimulation block might have caused plasticity-related changes in neural activity that could have interfered with the following 2 mA block due to the lack of a sufficient wash-out period. Hence, I am not sure any findings from the 2 mA block can really be interpreted as completely separate from the initial 1 mA stimulation period, given that they were administered consecutively. Several previous studies have shown that same-day repeated tDCS stimulation blocks can influence the effects of neuromodulation (e.g., Bastani and Jaberzadeh, 2014, Clin Neurophysiol; Monte-Silva et al., J. Neurophysiology). 

      We agree with the reviewer’s comment that the initial 1 mA stimulation block might have induced changes in neural activity and that the 20-minute post 1 mA block would not be long enough to wash out these changes. This comment is very similar to the second comment made by Reviewer 2 (R.2.2). Although our experimental data do not support this possibility (as the differences between the 1 mA post-stimulation and baseline conditions were not significant), it is still conceivable that the stimulation produced some effects below the threshold of significance and that these might weakly contribute to the changes observed during and after the 2 mA stimulation. 

      Actions in the text: We have modified the paper according to the reviewers' comments (please see our answer and actions in the text to R.2.2.).

      The different electrode placement for the two anaesthetised monkeys (i.e., Monkey R: F3/O2 montage, Monkey N: F4/O1 montage) is problematic, as it is likely to have resulted in stimulation over different brain regions. The authors state that "Because of the small size of the monkey's head, we expected that tDCS stimulation with these two symmetrical montages would result in nearly equivalent electric fields across the monkey's head and produce roughly similar effects on brain activity"; however, I am not totally convinced of this, and it really would need E-field models to confirm. It is also more likely that there would in fact be notable differences in the brain regions stimulated as the authors used HD-tDCS electrodes, which are generally more focal.

      We thank the Reviewer for the remark, which is very similar to the second comment from Reviewer 2. Please see our answer to the first comment of Reviewer 2 

      Actions in the text: We have modified the paper according to the reviewers' comments (please see the actions taken in response to R.2.1.).

      Given the very small sample size, I think it is also important to consider the possibility that some results might also be impacted by individual differences in response to stimulation. For instance, in the discussion (page 9, paragraph 2) the authors contrast findings observed in awake animals versus anaesthetised animals. However, different monkeys were examined for these two conditions, and there were only two monkeys in each group (monkeys J and Y for awake experiments [both male], and monkeys R and N [male and female] for the anaesthesia condition). From the human literature, it is well known that there is a considerable amount of inter-individual variability in response to stimulation (e.g., Lopez-Alonso et al., 2014, Brain Stimulation; Chew et al., 2015, Brain Stimulation), therefore I wonder if some of these differences could also possibly result from differences in responsiveness to stimulation between the different monkeys? At the end of the paragraph, the authors also state "Our findings also support the use of tDCS to promote rapid recovery from general anesthesia in humans...and suggest that a single anodal prefrontal stimulation at the end of the anesthesia protocol may be effective." However, I'm not sure if this statement is really backed-up by the results, which failed to report "any behavioural signs of awakening in the animals" (page 7)?

      We thank the Reviewer for this comment. Because working with non-human primates is expensive and labor intensive, the sample sizes in classical macaque experiments are generally small (typically 2-4 subjects per experiment). Our sample size (i.e. 2 rhesus macaques in awake experiments and 2 macaques under sedation, 11 +/- 9 scan sessions per animal, 288 and 136 runs in the awake and anesthesia state, respectively) is comparable to other previous work in non-human primates using fMRI (Milham et al. 2018; Yacoub et al. 2020; Uchimura, Kumano, and Kitazawa 2024). In addition, we would like to point out that the baseline cortical dynamics we found before stimulation, whether in the awake or sedated state, are comparable to previous studies (Barttfeld et al. 2015; Uhrig et al. 2018; Tasserie et al. 2022). This suggests our results are reproducible across datasets, despite the small sample size.

      That being said, we agree with the reviewer that inter-individual variability in response to stimulation can be considerable, as shown by a large body of literature in the field. It seems possible that the two monkeys studied in each condition responded differently to the stimulation. But even if that’s the case, our results suggest that at least in one of the two monkeys, cathodal PFC stimulation in the awake state and anodal PFC stimulation under propofol anesthesia induced striking changes in brain dynamics, which we believe is a significant contribution to the field. 

      In fact, supplementary analysis, as proposed by Reviewer 2 (cf R2.4), investigating how the different measurables we’ve used were differently affected by tDCS show that indeed monkey Y’s case is more apparent and significant than monkey J’s. Still, the effects observed in monkey J’s case are still congruent with what is observed in monkey Y’s and at the population level (though less flagrant). We also show that these inter-individual variabilities are outmatched by the inter-condition variability, (as indicated by our initially strong statistical results at the population levels), thus showing that, even though we have different responses depending on the subject, the effects observed at the population level cannot be only accounted for by the differences in subjects’ specificities.

      Lastly, the Reviewer questioned whether our results support that a single anodal prefrontal stimulation at the end of the anesthesia protocol could effectively promote rapid recovery from general anesthesia, because the stimulation did not wake the animals in our experiments. It should be emphasized that in our case, the monkeys were stimulated while they were still receiving continuous propofol perfusion. In contrast, during the recovery process from anesthesia, the delivery of the anesthetic drug is stopped. It is therefore conceivable that anodal PFC tDCS, which successfully enriched brain dynamics in sedated monkeys in our experiments, may accelerate the recovery from anesthesia when the drug is no longer administered. 

      Actions in the text: We have added a line in the Materials and Methods to compare to other studies:

      “Our sample size is comparable to previous work in NHP using fMRI (Milham et al. 2018; Yacoub et al. 2020; Uchimura, Kumano, and Kitazawa 2024).”

      Reviewing Editor Comments: 

      In some cases, authors opt to submit a revised manuscript. Should you choose to do so, please be aware that the reviewers have indicated that their appraisal is unlikely to change unless some of the suggested field modelling is incorporated into the work. This may change the evaluation of the strength of evidence, but the final wording will be subject to reviewer discretion. Details for responding to the reviews are provided at the bottom of this email.

      Reviewer #1 (Recommendations for the authors): 

      The work should discuss the implications of their experiments for using tDCS to arouse a patient from a coma. The anesthetized animal is effectively in a drug-induced coma. While they observed connectivity changes, these changes did not map nicely onto behavioral changes. 

      I would suggest that the authors spell out more clearly what they view as the clinical implications of their work in terms of new insights into how tDCS may be used to either understand and or treat disorders of consciousness.

      We thank the Reviewer for his thoughtful comments. We appreciate the opportunity to clarify and expand on the key findings and implications of our work, particularly regarding the new insights into how tDCS can be used to understand and treat disorders of consciousness. We therefore provide a broader perspective on the clinical implications of our experiments regarding coma and disorders of consciousness. We also agree with the Reviewer that the absence of behavioral changes but the presence of functional differences should be more clearly addressed. 

      Actions in the text: We have added a few lines about the relevance of anesthesia as a model for disorders of consciousness in the Introduction part:

      “Anesthesia provides a unique model for studying consciousness, which, similarly to DOC, is characterized by the disruption or even  the loss of consciousness (Luppi 2024). Additionally, anesthesia mechanisms involve several subcortical nuclei that are key components of the brain's sleep and arousal circuits (Kelz and Mashour 2019).”

      In the Discussion section, we have modified and expanded a paragraph about the effects of tDCS in DOC patients and how this technique could be further used to study consciousness: From another clinical perspective, our results demonstrating that 2 mA anodal PFC tDCS decreased the structure-function correlation and modified the dynamic repertoire of brain patterns during anesthesia (Figures 6 and 7) are consistent with the beneficial effects of such stimulation in DOC patients (Thibaut et al., 2014; Angelakis et al., 2014; Thibaut et al., 2017; Zhang et al., 2017; Martens et al., 2018; Cavinato et al., 2019; Wu et al., 2019; Hermann et al., 2020; Peng et al., 2022; Thibaut et al., 2023). Although some clinical trials investigated the effects of stimulating other brain regions, such as the motor cortex (Martens et al., 2019; Straudi et al., 2019) or the parietal cortex (Huang et al., 2017; Guo et al., 2019; Zhang et al., 2022; Wan et al., 2023; Wang et al., 2020), the DLPFC appears to be the most effective target for patients with a minimally conscious state (Liu et al., 2023). In terms of neuromodulatory effects in DOC patients, DLPFC tDCS has been reported to increase global excitability (Bai et al., 2017), increase the P300 amplitude (Zhang et al., 2017; Hermann et al., 2020), improve the fronto-parietal coherence in the theta band (Bai et al., 2018), enhance the putative EEG markers of consciousness (Bai et al., 2018; Hermann et al., 2020) and reduce the incidence of slow-waves in the resting state (Mensen et al., 2020). Our findings further support the PFC as a relevant target for modulating consciousness level and align with growing evidence showing that the PFC plays a key role in conscious access networks (Mashour, Pal, and Brown 2022; Panagiotaropoulos 2024). Nevertheless, we hypothesize that other brain targets for tDCS may be of interest for consciousness restoration, potentially using multi-channel tDCS (Havlík et al., 2023). Among transcranial electrical stimulation techniques, tDCS has the great advantage of facilitating either excitation or inhibition of brain regions, depending on the polarity of the stimulation (Sdoia et al., 2019) exploited this advantage to investigate the causal involvement of the DLPFC in conscious access to a visual stimulus during an attentional blink paradigm. While conscious access was enhanced by anodal stimulation of the left DLPFC compared to sham stimulation, opposite effects were found with cathodal stimulation compared to sham over the same locus. Finally, this literature and our findings suggest that tDCS constitutes a non-invasive, reversible, and powerful tool for studying consciousness.”

      We have added a new paragraph about patients with cognitive-motor dissociation and dissociation between consciousness and behavioral responsiveness:

      “Changes in the state of consciousness are generally closely associated with changes in behavioural responsiveness, although some rare cases of dissociation have been described. Cognitive-motor dissociation (CMD) is a condition observed in patients with severe brain injury, characterized by behavior consistent with unresponsive wakefulness syndrome or a minimally conscious state minus (Thibaut et al., 2019). However, in these patients, specific cortical brain areas activate in response to mental imagery tasks (e.g., imagining playing tennis or returning home) in a manner indistinguishable from that of healthy controls, as shown through fMRI or EEG (Thibaut et al., 2019; Owen et al., 2006; Monti et al., 2010; Bodien et al., 2024). Thus, although CMD patients are behaviorally unresponsive, they demonstrate cognitive awareness that is not outwardly apparent. It is worth noting that both the structure-function correlation and the rate of the pattern closest to the anatomy were shown to be significantly reduced in unresponsive patients showing command following during mental imagery tasks compared to those who do not show command following (Demertzi et al., 2019). These observations would be compatible with our findings in anesthetized macaques exposed to 2 mA anodal PFC tDCS. The richness of the brain dynamics would be recovered (at least partially, in our experiments), but not the behaviour. This hypothesis also fits with a recent longitudinal fMRI study on patients recovering from coma (Crone et al., 2020). The researchers examined two groups of patients: one group consisted of individuals who were unconscious at the acute scanning session but regained consciousness and improved behavioral responsiveness a few months later, and the second group consisted of patients who were already conscious from the start and only improved behavioral responsiveness at follow-up. By comparing these two groups, the authors could distinguish between the recovery of consciousness and the recovery of behavioral responsiveness. They demonstrated that only initially conscious patients exhibited rich brain dynamics at baseline. In contrast, patients who were unconscious in the acute phase and later regained consciousness had poor baseline dynamics, which became more complex at follow-up. Complete recovery of both consciousness and responsiveness under general anesthesia is possible through electrical stimulation of the central thalamus (Redinbaugh et al., 2020; Tasserie et al., 2022).”

      Reviewer #2 (Recommendations for the authors): 

      Method 

      (1) The authors mentioned that they used HD-tDCS in their experiments; however, they used 1 x 1 tDCS, which is not HD-tDCS but rather single-channel tDCS.

      We thank the Reviewing Editor for pointing out this ambiguous wording. We understand that "HD-tDCS", which we used in our paper to refer to high-density 1x1 tDCS (because we used small carbon electrodes instead of the large sponge electrodes employed in conventional tDCS), may cause some confusion with high-definition tDCS, which uses compact ring electrodes and most commonly refers to a 4x1 montage (1 active central electrode over the target area and 4 return electrodes placed around the central electrode).

      Therefore, to avoid any confusion, we will use the term "tDCS" rather than “HD-tDCS” to qualify the technique used in this paper and suppress mentions of high-density or high-definition tDCS.

      Actions in the text: We have replaced the abbreviation “HD-tDCS” with “tDCS” throughout the paper. We have also suppressed the sentence about high-definition tDCS in the Introduction (“While conventional tDCS relies on the use of relatively large rectangular pad electrodes, high-density tDCS (HD-tDCS) utilizes more compact ring electrodes, allowing for increased focality, stronger electric fields, and presumably, greater neurophysiological changes (Datta et al. 2009; Dmochowski et al. 2011)”) and the two related citations in the References section.

      (2) Please provide the characteristics of electrodes, including their size, shape, and thickness.

      We thank the Reviewing Editor for this recommendation. We now provide the complete characteristics of the tDCS electrodes used in the paper.

      Actions in the text: We have added a sentence describing the characteristics of the tDCS electrodes in the Materials and Methods section:

      “We used a 1x1 electrode montage with two carbon rubber electrodes (dimensions: 1.4 cm x 1.85 cm, 0.93  cm thick) inserted into Soterix HD-tES MRI electrode holders (base diameter: 25 mm; height: 10.5 mm), which are in contact with the scalp. These electrodes (2.59 cm2) are smaller than conventional tDCS sponge electrodes (typically 25 to 35 cm<sup>2</sup>).”

      (3) Could the authors clarify why they chose to stimulate the right DLPFC? Is there a specific rationale for this choice? Additionally, could the authors explain how they ensured that the stimulation targeted the DLPFC, given that the monkey cap might differ from human configurations? In many NHP studies, structural MRI is used to accurately determine electrode placement. Considering that a single channel F4 - O2 montage was used, even a small displacement of the frontal electrode laterally could result in the electric field not adequately covering the DLPFC. Could the authors provide structural MRI images and details of electrode positioning to help readers better understand targeting accuracy?

      We thank the Reviewing Editor for the thoughtful comments and recommendations. We appreciate the opportunity to further clarify our rationale for stimulating the right DLPFC and also the suggestion to provide structural MRI images and details of electrode positioning, which we think will improve the quality of the paper by showing targeting accuracy.

      First, we would like to clarify that our initial decision to stimulate the right PFC in most animals was driven by experimental constraints. Indeed, we had limited access to the left PFC in three of the four macaques, either due to the presence of cement (spreading asymmetrically from the centre of the head) used to fix the head post in awake animals or due to a scar in one of the two animals studied under anesthesia. 

      Second, we agree with the Reviewing Editor on the importance of showing details of electrode positioning and evidence of targeting accuracy across MRI sessions. Therefore, we now provide structural images showing the positions of anodal and cathodal electrodes in almost all acquired sessions: 10 sessions (out of 10) under anesthesia and 30 sessions in the awake state (out of 34 sessions, because we could not acquire structural images in four sessions). These images show that, in anesthesia experiments, the anodal electrode was positioned over the dorsal prefrontal cortex and the cathodal electrode was placed over the contralateral occipital cortex (at the level of the parieto–occipital junction) in both monkeys. In the awake state, the montage still targeted the prefrontal cortex and the occipital cortex, but with a slightly different placement. One of the electrodes was placed over the prefrontal cortex, closer to the premotor cortex than in anesthesia experiments, while the other one was placed over the occipital cortex (V1), slightly more posterior than in anesthesia experiments. These images therefore show that the placement was relatively accurate across sessions and reproducible between monkeys in each of the two arousal conditions.

      Actions in the text: We have added a supplementary file showing electrode positioning in 40 of the 44 acquired MRI sessions (Supplementary File 1). We have also added a new supplement figure (Figure 1 - figure supplement 1) showing electrode positioning in representative MRI sessions of the awake and anesthetized experiments in the main manuscript. 

      We added a few sentences referring to these figures in the Result section: 

      “Representative structural images showing electrode placements on the head of the two awake monkeys are shown in Figure 1 - figure supplement 1A). Supplementary File 1 displays the complete set of structural images, showing that the two electrodes were accurately placed over the prefrontal cortex and the occipital cortex in a reproducible manner across awake sessions.”

      Figure 1 - figure supplement 1. Structural images displaying electrode placements on the head of monkeys. A) Awake experiments. Representative sagittal, coronal and transverse MRI sections, and the corresponding skin reconstruction images showing the position of the prefrontal and the occipital electrodes on the head of monkeys J. and Y. B) Anesthesia experiments. Representative sagittal, coronal and transverse MRI sections, and the corresponding skin reconstruction images showing the position of the prefrontal and occipital electrodes over the occipital cortex on the head of monkeys R. and N.

      Supplementary File 1 (see attached file). Structural images showing the position of the tDCS electrodes on the monkey's head across sessions. Sagittal, coronal and transverse MRI sections, and corresponding skin reconstruction images showing the position of the prefrontal and occipital electrodes on the monkey's head for each MRI session (except for 4 sessions in which no anatomical scan was acquired). The two electrodes were accurately placed over the prefrontal cortex and the occipital cortex in a reproducible manner across sessions and between the two monkeys studied in each arousal state. In anesthesia experiments, the anodal electrode was placed over the dorsal prefrontal cortex, while the cathodal electrode was positioned over the parieto-occipital junction. In awake experiments, the prefrontal electrode was positioned over the dorsal prefrontal cortex/pre-motor cortex, while the occipital electrode was placed over the visual area 1. The position of the two electrodes differed slightly between the anesthetized and awake experiments due to different body positions (the prone position of the sedated monkeys prevented a more posterior position of the occipital electrode) and also due to the presence of a headpost on the head of the two monkeys in awake experiments (the monkeys we worked with in anesthesia experiments did not have an headpost).

      (4) If the authors did not analyze the data for the passive event-related auditory response, it may be helpful to remove the related sentence to avoid potential confusion for readers.

      We thank the Reviewing Editor for the comment. Although we understand the reviewer’s point of view, we decide to keep this information in the paper to inform the reader that the macaques were passively engaged in an auditory task, as this could have some influence on the brain state. In the Materials and Methods section, we already mentioned that the analysis of the cerebral responses to the auditory paradigm is not part of the paper. We have modified the sentence to make it clearer and to avoid potential confusion for readers.

      Actions in the text: We have modified the sentence referring to the passive event-related auditory response in the Materials and Methods section:

      “All fMRI data were acquired while the monkeys were engaged in a passive event-related auditory task, the local-global paradigm, which is based on local and global deviations from temporal regularities (Bekinschtein et al. 2009; Uhrig, Dehaene, and Jarraya 2014). The present paper does not address how tDCS perturbs cerebral responses to local and global deviants, which will be the subject of future work.”

      (5) Could the authors clarify what x(t) represents in the equation? Additionally, it would be better to number the equations.

      We apologize for the confusion,  x(t) represents the evolution of the BOLD signals over time. We have numbered the equations as suggested. 

      Actions in the text: We have added explanations about the notation and numerotation of equations.

      (6) It would be much better to provide schematic illustrations to explain what the authors did for analyzing fMRI data.

      We thank the Reviewing Editor for the suggestion and now provide a new figure as suggested.  

      Actions in the text: We have added a new figure (Figure 2) graphically showing the overall analysis performed. We have added a sentence about the new Figure 2 in the Results section:  “A graphical overview of the overall analysis is shown in Figure 2.” We have renumbered Figure 2 - supplement figures accordingly.

      Figure 2. fMRI Phase Coherence analysis. A) Left) Animals were scanned before, during and after PFC tDCS stimulation in the awake state (two macaques) or under deep propofol anesthesia (two macaques). Right) Example of Z-scored filtered BOLD time series for one macaque, 111 time points with a TR of 2.4 s. B) Hilbert transform of the z-scored BOLD signal of one ROI into its time-varying amplitude A(t) (red) and the real part of the phase φ (green). In blue, we recover the original z-scored BOLD signal as A(t)cos(φ). C) Example of the phase of the Hilbert transform for each brain region at one TR. D) Symmetric matrix of cosines of the phase differences between all pairs of brain regions. E) We concatenated the vectorized form of the triangular superior of the phase difference matrices for all TRs for all participants, in all the conditions for both datasets separately obtaining using the K-means algorithm, the brain patterns whose statistics are then analyzed in the different conditions.

      Results 

      (1) In Figures 3A, 5A, and 6A showing brain connectivity, it is difficult to relate the connectivity variability among the brain regions. Instead of displaying connection lines for nodes, it would be more effective if the authors highlighted significant, strong connectivity within specific brain regions using additional methods, such as bootstrapping.

      We thank the Reviewing Editor for the comment and suggestion. The connection lines indeed represent all the synchronizations above 0.5 and all the anti-synchronization below -0.5 between all pairs of brain regions. As suggested, another element we haven’t addressed is the heterogeneity in coherences between individual brain regions. We hence propose additional supplementary figures showing, for all centroids mentioned in main figures, the variance in phase-based connectivity of the distributions of coherence of all brain regions to the rest of the brain. High value would then indicate a wide range of values of coherence, while low would indicate the different coherence a region has with the rest of the brain have similar values. Thus, a brain with uniform color would indicate high homogeneity in coherence among brain regions, while sharp changes in colors would reveal that certain regions are more subject to high variance in their coherence distributions. We expect this new figure to more clearly expose the connectivity variability among the brain regions.

      Actions in the text: We have added new figures showing, for all centroids mentioned in the main figures, the variances in phase-based connectivity of the distributions of coherence  (Figure 3 - figure supplement 3;  Figure 5 - figure supplement 2; Figure 6 - figure supplement 3; Figure 7 - figure supplement 2). One of them is shown below for the only awake analysis (Figure 3 - figure supplement 3).

      Figure 3 - figure supplement 3. Variance in inter-region phase coherences of brain patterns. Low values (red and light red) indicate that the distribution of synchronizations between a brain region and the rest of the brain has relatively low variance, while high values (blue and light blue) indicate relatively high variance. Are displayed both supra (top) and subdorsal (bottom) views for each brain pattern from the main figure, ordered similarly as previously: from left (1) to right (6) as their respective SFC increases. 

      We added a few sentences about variances in phase-based connectivity of the distributions of coherence in the Result section: 

      “Further investigation of the variances in inter-region phase coherences of brain patterns, presented in Figure 3 - figure supplement 3, revealed two main findings. First, all the patterns exhibited some degree of lateral symmetry. Second, except for the pattern with the highest SFC, most patterns displayed high heterogeneity in their coherence variances and striking inter-pattern differences. These observations reflect both the segmentation of distinct functional networks across patterns and a topological organization within the patterns themselves: some regions showed a broader spectrum of synchrony with the rest of the brain, while others exhibited narrower distributions of coherence variances. For instance, unlike other brain patterns, pattern 5 was characterized by a high coherence variance in the frontal premotor areas and low variance in the occipital cortex, whereas pattern 3 had a high variance in the frontal and orbitofrontal regions. In addition, we performed the main analyses separately for the two monkeys, explored the inter-condition variability (Supplementary File 2), and computed classical measures of functional connectivity such as average FC matrices and functional graph properties (modularity, efficiency and density) of the visited FC states (Supplementary File 3).”

      “The variance in inter-regional phase coherence across brain patterns showed notably that pattern 4, in contrast to most other patterns, was characterized by a high variance in frontal premotor areas and a low variance in the occipital cortex (Figure 5 - figure supplement 2)." 

      “The variance in inter-region phase coherences of the brain patterns is displayed in Figure 6 - figure supplement 3 and showed a striking heterogeneity between the patterns. For example, pattern 5 had a low overall variance (except in the frontal cortex), while pattern 1 was the only pattern with a high variance in the occipital cortex.”

      “The variance in inter-region phase coherences of brain patterns is displayed in Figure 6 - figure supplement 2.”

      (2) For both conditions, only 2 to 3 out of 6 patterns showed significant effects of tDCS on the occurrence rate. Is it sufficient to claim the authors' conclusion?

      We thank the Reviewer Editor for the comment. We would like to point out that similar kinds of differences in the occurrence rates of specific brain patterns (particularly in patterns at the extremities of the SFC scale) have already been reported previously. Prior works in patients suffering from disorders of consciousness, in healthy humans or in non-human primates,  have shown, by using a similar method of analysis, that not all brain states are equally disturbed by loss of consciousness, even in different modalities of unconscious transitioning (Luppi et al. 2021; Z. Huang et al. 2020; Demertzi et al. 2019; Castro et al. 2023; Golkowski et al. 2019; Barttfeld et al. 2015). Therefore, yes we believe that our conclusions are still supported by the results.

      (3) If the authors want to assert that the brain state significantly influences the effects of tDCS as discussed in the manuscript, further analysis is necessary. First, it would be great to show the difference in connectivity between two consciousness conditions during the baseline (resting state) to see how resting state connectivity or structural connectivity varies. Second, demonstrating the difference in connectivity between the awake and anesthetized conditions (e.g., awake during cathodal vs. anesthetized cathodal) to show how the connectivity among the brain regions was changed by the brain state during tDCS. This would strengthen the authors' conclusion.

      We thank the reviewer for this comment. Firstly, we’d like to clarify that the structural connectivity doesn’t change from one session to another in the same animal and minimally between subjects. Secondly, we agree with the Reviewing Editor that it is informative to show the differences between the baselines and this is what we have done. The results are shown in Figures 5 and 7. Regarding the comparison of the stimulating conditions across arousal levels, the only contrast that we could make is to compare 2 mA anodal awake with 2 mA anodal anesthetized (during and post-stimulation). However, as 2 mA anodal stimulation in the awake state did not affect the connectivity much (compared to the awake baseline), the results would be almost similar to the comparison of the awake baseline with 2 mA anodal anesthetized, which is shown in Figure 7. Therefore, we believe that this would result in minimal informative gains and even more redundancy. 

      Reviewer #3 (Recommendations for the authors): 

      Introduction, par 2: HD-tDCS does not necessarily produce stronger electric fields (E-fields) in the brain. The E-field is largely montage-dependent, and some configurations such as the 4x1 configuration can actually have weaker E-fields compared to conventional tDCS designs (i.e., with two sponge electrodes) as electrodes are often closer together resulting in more current being shunted by skull, scalp, and CSF. I would consider re-phrasing this section.

      We agree with the Reviewer Editor that high-definition tDCS does not necessarily produce stronger electric fields in the brain and apologize for the confusion caused by our use of HD-tDCS to refer to high-density tDCS. To avoid any confusion, we have removed the sentence mentioning that HD-tDCS produces stronger electric fields. 

      Actions in the text: We have removed the sentence about high-definition tDCS in the Introduction (“While conventional tDCS relies on the use of relatively large rectangular pad electrodes, high-density tDCS (HD-tDCS) utilizes more compact ring electrodes, allowing for increased focality, stronger electric fields, and presumably, greater neurophysiological changes (Datta et al. 2009; Dmochowski et al. 2011)”) and the two related citations in the References section.

    1. Author response:

      General Statements:

      The formation of three-dimensional tubes is a fundamental process in the development of organs and aberrant tube size leads to common diseases and congenital disorders, such as polycystic kidney disease, asthma, and lung hypoplasia. The apical (luminal) extracellular matrix (ECM) plays a critical role in epithelial tube morphogenesis during organ formation, but its composition and organization remain poorly understood. Using the Drosophila embryonic salivary gland as a model, we reveal a critical role for the PAPS Synthetase (Papss), an enzyme that synthesizes the universal sulfate donor PAPS, as a critical regulator of tube lumen expansion. Additionally, we identify two zona pellucida (ZP) domain proteins, Piopio (Pio) and Dumpy (Dpy) as key apical ECM components that provide mechanical support to maintain a uniform tube diameter.

      The apical ECM has a distinct composition compared to the basal ECM, featuring a diverse array of components. Many studies of the apical ECM have focused on the role of chitin and its modification, but the composition of the non-chitinous apical ECM and its role, and how modification of the apical ECM affects organogenesis remain elusive. The main findings of this manuscript are listed below.

      (1) Through a deficiency screen targeting ECM-modifying enzymes, we identify Papss as a key enzyme regulating luminal expansion during salivary gland morphogenesis. 

      (2) Our confocal and transmission electron microscopy analyses reveal that Papss mutants exhibit a disorganized apical membrane and condensed aECM, which are at least partially linked to disruptions in Golgi structures and intracellular trafficking. Papss is also essential for cell survival and basal ECM integrity, highlighting the role of sulfation in regulating both apical and basal ECM.

      (3) Salivary gland-specific overexpression of wild-type Papss rescues all defects in Papss mutants, but the catalytically inactive mutant form does not, suggesting that defects in sulfation are the underlying cause of the phenotypes.

      (4) We identify two ZP domain proteins, Piopio (Pio) and Dumpy (Dpy), as key components of the salivary gland aECM. In the absence of Papss, Pio is progressively lost from the aECM, while the Dpy-positive aECM structure is condensed and detaches from the apical membrane, resulting in a narrowed lumen. 

      (5) Mutations in pio or dpy, or in Notopleural (Np), which encodes a matriptase that cleaves Pio, cause the salivary gland lumen to develop alternating bulges and constrictions. Additionally, loss of pio results in loss of Dpy in the salivary gland lumen, suggesting that the Dpycontaining filamentous structures of the aECM is critical for maintaining luminal diameter, with Pio playing an essential role in organizing this structure.

      (6) We further reveal that the cleavage of the ZP domain of Pio by Np is critical for the role of Pio in organizing the aECM structure.

      Overall, our findings underscore the essential role of sulfation in organizing the aECM during tubular organ formation and highlight the mechanical support provided by ZP domain proteins in maintaining tube diameter. Mammals have two isoforms of Papss, Papss1 and Papss2. Papss1 shows ubiquitous expression, with higher levels in glandular cells and salivary duct cells, suggesting a high requirement for sulfation in these cell types. Papss2 shows a more restricted expression, such as in cartilage, and mutations in Papss2 have been associated with skeletal dysplasia in humans. Our analysis of the Drosophila Papss gene, a single ortholog of human Papss1 and Papss2, reveals its multiple roles during salivary gland development. We expect that these findings will provide valuable insights into the function of these enzymes in normal development and disease in humans. Our findings on the key role of two ZP proteins, Pio and Dpy, as major components of the salivary gland aECM also provide valuable information on the organization of the non-chitinous aECM during organ formation.

      We believe that our results will be of broad interest to many cell and developmental biologists studying organogenesis and the ECM, as well as those investigating the mechanisms underlying human diseases associated with conserved mutations.

      Point-by-point description of the revisions:

      We are delighted that all three reviewers were enthusiastic about the work. Their comments and suggestions have improved the paper. The details of the changes we have made in response to each reviewer’s comments are included in italicized text below.

      Reviewer #1 (Evidence, reproducibility and clarity):

      PAPS is required for all sulfotransferase reactions in which a sulfate group is covalently attached to amino acid residues of proteins or to side chains of proteoglycans. This sulfation is crucial for properly organizing the apical extracellular matrix (aECM) and expanding the lumen in the Drosophila salivary gland. Loss of Papss potentially leads to decreased sulfation, disorganizing the aECM, and defects in lumen formation. In addition, Papss loss destabilizes the Golgi structures.

      In Papss mutants, several changes occur in the salivary gland lumen of Drosophila. The tube lumen is very thin and shows irregular apical protrusions. There is a disorganization of the apical membrane and a compaction of the apical extracellular matrix (aECM). The Golgi structures and intracellular transport are disturbed. In addition, the ZP domain proteins Piopio (Pio) and Dumpy (Dpy) lose their normal distribution in the lumen, which leads to condensation and dissociation of the Dpy-positive aECM structure from the apical membrane. This results in a thin and irregularly dilated lumen.

      (1) The authors describe various changes in the lumen in mutants, from thin lumen to irregular expansion. I would like to know the correct lumen diameter, and length, besides the total area, by which one can recognize thin and irregular.

      We have included quantification of the length and diameter of the salivary gland lumen in the stage 16 salivary glands of control, Papss mutant, and salivary gland-specific rescue embryos (Figure 1J, K). As described, Papss mutant embryos have two distinct phenotypes, one group with a thin lumen along the entire lumen and the other group with irregular lumen shapes. Therefore, we separated the two groups for quantification of lumen diameter. Additionally, we have analyzed the degree of variability for the lumen diameter to better capture the range of phenotypes observed (Figure 1K’). These quantifications enable a more precise assessment of lumen morphology, allowing readers to distinguish between thin and irregular lumen phenotypes.

      (2) The rescue is about 30%, which is not as good as expected. Maybe the wrong isoform was taken. Is it possible to find out which isoform is expressed in the salivary glands, e.g., by RNA in situ Hyb? This could then be used to analyze a more focused rescue beyond the paper.

      Thank you for this point, but we do not agree that the rescue is about 30%. In Papss mutants, about 50% of the embryos show the thin lumen phenotype whereas the other 50% show irregular lumen shapes. In the rescue embryos with a WT Papss, few embryos showed thin lumen phenotypes. About 40% of the rescue embryos showed “normal, fully expanded” lumen shapes, and the remaining 60% showed either irregular (thin+expanded) or slightly overexpanded lumen. It is not uncommon that rescue with the Gal4/UAS system results in a partial rescue because it is often not easy to achieve the balance of the proper amount of the protein with the overexpression system. 

      To address the possibility that the wrong isoform was used, we performed in situ hybridization to examine the expression of different Papss spice forms in the salivary gland. We used probes that detect subsets of splice forms: A/B/C/F/G, D/H, and E/F/H, and found that all probes showed expression in the salivary gland, with varying intensities. The original probe, which detects all splice forms, showed the strongest signals in the salivary gland compared to the new probes which detect only a subset. However, the difference in the signal intensity may be due to the longer length of the original probe (>800 bp) compared to other probes that were made with much smaller regions (~200 bp). Digoxigenin in the DIG labeling kit for mRNA detection labels the uridine nucleotide in the transcript, and the probes with weaker signals contain fewer uridines (all: 147; ABCFG, 29; D, 36; EFH, 66). We also used the Papss-PD isoform, for a salivary gland-specific rescue experiment and obtained similar results to those with Papss-PE (Figure 1I-L, Figure 4D and E). 

      Furthermore, we performed additional experiments to validate our findings. We performed a rescue experiment with a mutant form of Papss that has mutations in the critical rescues of the catalytic domains of the enzyme, which failed to rescue any phenotypes, including the thin lumen phenotype (Figure 1H, J-L), the number and intensity of WGA puncta (Figure 3I, I’), and cell death (Figure 4D, E). These results provide strong evidence that the defects observed in Papss mutants are due to the lack of sulfation.  

      (3) Crb is a transmembrane protein on the apicolateral side of the membrane. Accordingly, the apicolateral distribution can be seen in the control and the mutant. I believe there are no apparent differences here, not even in the amount of expression. However, the view of the cells (frame) shows possible differences. To be sure, a more in-depth analysis of the images is required. Confocal Z-stack images, with 3D visualization and orthogonal projections to analyze the membranes showing Crb staining together with a suitable membrane marker (e.g. SAS or Uif). This is the only way to show whether Crb is incorrectly distributed. Statistics of several papas mutants would also be desirable and not just a single representative image. When do the observed changes in Crb distribution occur in the development of the tubes, only during stage 16? Is papss only involved in the maintenance of the apical membrane? This is particularly important when considering the SJ and AJ, because the latter show no change in the mutants.

      We appreciate your suggestion more thoroughly analyze Crb distribution. We adapted a method from a previous study (Olivares-Castiñeira and Llimargas, 2017) to quantify Crb signals in the subapical region and apical free region of salivary gland cells. Using E-Cad signals as a reference, we marked the apical cell boundaries of individual cells and calculated the intensity of Crb signals in the subapical region (along the cell membrane) and in the apical free region. We focused on the expanded region of the SG lumen in Papss mutants for quantification, as the thin lumen region was challenging to analyze. This quantification is included in Figure 2D. Statistical analysis shows that Crb signals were more dispersed in SG cells in Papss mutants compared to WT.

      (4) A change in the ECM is only inferred based on the WGA localization. This is too few to make a clear statement. WGA is only an indirect marker of the cell surface and glycosylated proteins, but it does not indicate whether the ECM is altered in its composition and expression. Other important factors are missing here. In addition, only a single observation is shown, and statistics are missing.

      We understand your concern that WGA localization alone may not be sufficient to conclude changes in the ECM. However, we observed that luminal WGA signals colocalize with Dpy-YFP in the WT SG (Figure 5-figure supplement 2C), suggesting that WGA detects the aECM structure containing Dpy. The similar behavior of WGA and Dpy-YFP signals in multiple genotypes further supports this idea. In Papss mutants with a thin lumen phenotype, both WGA and Dpy-YFP signals are condensed (Figure 5E-H), and in pio mutants, both are absent from the lumen (Figure 6B, D). We analyzed WGA signals in over 25 samples of WT and Papss mutants, observing consistent phenotypes. We have included the number of samples in the text. While we acknowledge that WGA is an indirect marker, our data suggest that it is a reliable indicator of the aECM structure containing Dpy. 

      (5) Reduced WGA staining is seen in papss mutants, but this could be due to other circumstances. To be sure, a statistic with the number of dots must be shown, as well as an intensity blot on several independent samples. The images are from single confocal sections. It could be that the dots appear in a different Z-plane. Therefore, a 3D visualization of the voxels must be shown to identify and, at best, quantify the dots in the organ.

      We have quantified cytoplasmic punctate WGA signals. Using spinning disk microscopy with super-resolution technology (Olympus SpinSR10 Sora), we obtained high-resolution images of cytoplasmic punctate signals of WGA in WT, Papss mutant, and rescue SGs with the WT and mutant forms of Papss-PD. We then generated 3D reconstructed images of these signals using Imaris software (Figure 3E-H) and quantified the number and intensity of puncta. Statistical analysis of these data confirms the reduction of the number and intensity of WGA puncta in Papss mutants (Figure 3I, I’). The number of WGA puncta was restored by expressing WT Papss but not the mutant form. By using 3D visualization and quantification, we have ensured that our results are not limited to a single confocal section and account for potential variations in Z-plane localization of the dots.

      (6) A colocalization analysis (statistics) should be shown for the overlap of WGA with ManII-GFP.

      Since WGA labels multiple structures, including the nuclear envelope and ECM structures, we focused on assessing the colocalization of the cytoplasmic WGA punctate signals and ManIIGFP signals. Standard colocalization analysis methods, such as Pearson’s correlation coefficient or Mander’s overlap coefficient, would be confounded by WGA signals in other tissues. Therefore, we used a fluorescent intensity line profile to examine the spatial relationship between WGA and ManII-GFP signals in WT and Papss mutants (Figure 3L, L’). 

      (7) I do not understand how the authors describe "statistics of secretory vesicles" as an axis in Figure 3p. The TEM images do not show labeled secretory vesicles but empty structures that could be vesicles.

      Previous studies have analyzed “filled” electron-dense secretory vesicles in TEM images of SG cells (Myat and Andrew, 2002, Cell; Fox et al., 2010, J Cell Biol; Chung and Andrew, 2014, Development). Consistent with these studies, our WT TEM images show these vesicles. In contrast, Papss mutants show a mix of filled and empty structures. For quantification, we specifically counted the filled electron-dense vesicles (now Figure 3W). A clear description of our analysis is provided in the figure legend.

      (8) The quality of the presented TEM images is too low to judge any difference between control and mutants. Therefore, the supplement must present them in better detail (higher pixel number?).

      We disagree that the quality of the presented TEM images is too low. Our TEM images have sufficient resolution to reveal details of many subcellular structures, such as mitochondrial cisternae. The pdf file of the original submission may not have been high resolution. To address this concern, we have provided several original high-quality TEM images of both WT and Papss mutants at various magnifications in Figure 2-figure supplement 2. Additionally, we have included low-magnification TEM images of WT and Papss mutants in Figure 2H and I to provide a clearer view of the overall SG lumen morphology. 

      (9) Line 266: the conclusion that apical trafficking is "significantly impaired" does not hold. This implies that Papss is essential for apical trafficking, but the analyzed ECM proteins (Pio, Dumpy) are found apically enriched in the mutants, and Dumpy is even secreted. Moreover, they analyze only one marker, Sec15, and don't provide data about the quantification of the secretion of proteins.

      We agree and have revised our statement to “defective sulfation affects Golgi structures and multiple routes of intracellular trafficking”. 

      (10) DCP-1 was used to detect apoptosis in the glands to analyze acellular regions. However, the authors compare ST16 control with ST15 mutant salivary glands, which is problematic. Further, it is not commented on how many embryos were analyzed and how often they detect the dying cells in control and mutant embryos. This part must be improved.

      Thank you for the comment. We agree and have included quantification. We used stage 16 samples from WT and Papss mutants to quantify acellular regions. Since DCP-1 signals are only present at a specific stage of apoptosis, some acellular regions do not show DCP-1 signals. Therefore, we counted acellular regions regardless of DCP-1 signals. We also quantified this in rescue embryos with WT and mutant forms of Papss, which show complete rescue with WT and no rescue with the mutant form, respectively. The graph with a statistical analysis is included (Figure 4D, E).

      (11) WGA and Dumpy show similar condensed patterns within the tube lumen. The authors show that dumpy is enriched from stage 14 onwards. How is it with WGA? Does it show the same pattern from stage 14 to 16? Papss mutants can suffer from a developmental delay in organizing the ECM or lack of internalization of luminal proteins during/after tube expansion, which is the case in the trachea.

      Dpy-YFP and WGA show overlapping signals in the SG lumen throughout morphogenesis. DpyYFP is SG enriched in the lumen from stage 11, not stage 14 (Figure 5-figure supplement 2). WGA is also detected in the lumen throughout SG morphogenesis, similar to Dpy. In the original supplemental figure, only a stage 16 SG image was shown for co-localization of Dpy-YFP and WGA signals in the SG lumen. We have now included images from stage 14 and 15 in Figure 5figure supplement 2C. 

      Given that luminal Pio signals are lost at stage 16 only and that Dpy signals appear as condensed structures in the lumen of Papss mutants, it suggests that the internalization of luminal proteins is not impaired in Papss mutants. Rather, these proteins are secreted but fail to organize properly. 

      (12) Line 366. Luminal morphology is characterized by bulging and constrictions. In the trachea, bulges indicate the deformation of the apical membrane and the detachment from the aECM. I can see constrictions and the collapsed tube lumen in Fig. 6C, but I don't find the bulges of the apical membrane in pio and Np mutants. Maybe showing it more clearly and with better quality will be helpful.

      Since the bulging phenotype appears to vary from sample to sample, we have revised the description of the phenotype to “constrictions” to more accurately reflect the consistent observations. We quantified the number of constrictions along the entire lumen in pio and Np mutants and included the graph in Figure 6F.

      (13) The authors state that Papss controls luminal secretion of Pio and Dumpy, as they observe reduced luminal staining of both in papss mutants. However, the mCh-Pio and Dumpy-YFP are secreted towards the lumen. Does papss overexpression change Pio and Dumpy secretion towards the lumen, and could this be another explanation for the multiple phenotypes? 

      Thank you for the comment. To clarify, we did not observe reduced luminal staining of Pio and Dpy in Papss mutants, nor did we state that Papss controls luminal secretion of Pio and Dpy. In Papss mutants, Pio luminal signals are absent specifically at stage 16 (Figure 5H), whereas strong luminal Pio signals are present until stage 15 (Figure 5G). For Dpy-YFP, the signals are not reduced but condensed in Papss mutants from stages 14-16 (Figure 5D, H). 

      It remains unclear whether the apparent loss of Pio signals is due to a loss of Pio protein in the lumen or due to epitope masking resulting from protein aggregation or condensation. As noted in our response to Comment 11 internalization of luminal proteins seems unaffected in Papss mutants; proteins like Pio and Dpy are secreted into the lumen but fail to properly organize. Therefore, we have not tested whether Papss overexpression alters the secretion of Pio or Dpy.

      In our original submission, we incorrectly stated that uniform luminal mCh-Pio signals were unchanged in Papss mutants. Upon closer examination, we found these signals are absent in the expanded luminal region in stage 16 SG (where Dpy-YFP is also absent), and weak mCh-Pio signals colocalize with the condensed Dpy-YFP signals (Figure 5C, D). We have revised the text accordingly. 

      Regulation of luminal ZP protein level is essential to modulate the tube expansion; therefore, Np releases Pio and Dumpy in a controlled manner during st15/16. Thus, the analysis of Pio and Dumpy in NP overexpression embryos will be critical to this manuscript to understand more about the control of luminal ZP matrix proteins.

      Thanks for the insightful suggestion. We overexpressed both the WT and mutant form of Np using UAS-Np.WT and UAS-Np.S990A lines (Drees et al., 2019) and analyzed mCh-Pio, Pio antibody, and Dpy-YFP signals. It is important to note that these overexpression experiments were done in the presence of the endogenous WT Np. 

      Overexpression of Np.WT led to increased levels of mCh-Pio, Pio, and Dpy-YFP signals in the lumen and at the apical membrane. In contrast, overexpression of Np.S990A resulted in a near complete loss of luminal mCh-Pio signals. Pio antibody signals remained strong at the apical membrane but was weaker in the luminal filamentous structures compared to WT. 

      Due to the GFP tag present in the UAS-Np.S990A line, we could not reliably analyze Dpy-YFP signals because of overlapping fluorescent signals in the same channel. However, the filamentous Pio signals in the lumen co-localized with GFP signals, suggesting that these structures might also include Dpy-YFP, although this cannot be confirmed definitively. 

      These results suggest that overexpressed Np.S990A may act in a dominant-negative manner, competing with endogenous Np and impairing proper cleavage of Pio (and mCh-Pio). Nevertheless, some level of cleavage by endogenous Np still appears to occur, as indicated by the residual luminal filamentous Pio signals. These new findings have been incorporated into the revised manuscript and are shown in Figure 6H and 6I.

      (14) Minor:

      Fig. 5 C': mChe-Pio and Dumpy-YFP are mixed up at the top of the images.

      Thanks for catching this error.  It has been corrected.

      Sup. Fig7. A shows Pio in purple but B in green. Please indicate it correctly.

      It has been corrected.

      Reviewer #1 (Significance):

      In 2023, the functions of Pio, Dumpy, and Np in the tracheal tubes of Drosophila were published. The study here shows similar results, with the difference that the salivary glands do not possess chitin, but the two ZP proteins Pio and Dumpy take over its function. It is, therefore, a significant and exciting extension of the known function of the three proteins to another tube system. In addition, the authors identify papss as a new protein and show its essential function in forming the luminal matrix in the salivary glands. Considering the high degree of conservation of these proteins in other species, the results presented are crucial for future analyses and will have further implications for tubular development, including humans.

      Reviewer #2 (Evidence, reproducibility and clarity):

      Summary:

      There is growing appreciation for the important of luminal (apical) ECM in tube development, but such matrices are much less well understood than basal ECMs. Here the authors provide insights into the aECM that shapes the Drosophila salivary gland (SG) tube and the importance of PAPSS-dependent sulfation in its organization and function.

      The first part of the paper focuses on careful phenotypic characterization of papss mutants, using multiple markers and TEM. This revealed reduced markers of sulfation (Alcian Blue staining) and defects in both apical and basal ECM organization, Golgi (but not ER) morphology, number and localization of other endosomal compartments, plus increased cell death. The authors focus on the fact that papss mutants have an irregular SG lumen diameter, with both narrowed regions and bulged regions. They address the pleiotropy, showing that preventing the cell death and resultant gaps in the tube did not rescue the SG luminal shape defects and discussing similarities and differences between the papss mutant phenotype and those caused by more general trafficking defects. The analysis uses a papss nonsense mutant from an EMS screen - I appreciate the rigorous approach the authors took to analyze transheterozygotes (as well as homozygotes) plus rescued animals in order to rule out effects of linked mutations.

      The 2nd part of the paper focuses on the SG aECM, showing that Dpy and Pio ZP protein fusions localize abnormally in papss mutants and that these ZP mutants (and Np protease mutants) have similar SG lumen shaping defects to the papss mutants. A key conclusion is that SG lumen defects correlate with loss of a Pio+Dpy-dependent filamentous structure in the lumen. These data suggest that ZP protein misregulation could explain this part of the papss phenotype.

      Overall, the text is very well written and clear. Figures are clearly labeled. The methods involve rigorous genetic approaches, microscopy, and quantifications/statistics and are documented appropriately. The findings are convincing, with just a few things about the fusions needing clarification.

      Minor comments

      (1) Although the Dpy and Qsm fusions are published reagents, it would still be helpful to mention whether the tags are C-terminal as suggested by the nomenclature, and whether Westerns have been performed, since (as discussed for Pio) cleavage could also affect the appearance of these fusions.

      Thanks for the comment. Dpy-YFP is a knock-in line in which YFP is inserted into the middle of the dpy locus (Lye et al., 2014; the insertion site is available on Flybase). mCh-Qsm is also a knock-in line, with mCh inserted near the N-terminus of the qsm gene using phi-mediated recombination using the qsm<sup>MI07716</sup> line (Chu and Hayashi, 2021; insertion site available on Flybase). Based on this, we have updated the nomenclature from Qsm-mCh to mCh-Qsm throughout the manuscript to accurately reflect the tag position. To our knowledge, no western blot has been performed on Dpy-YFP or mCh-Qsm lines. We have mentioned this explicitly in the Discussion.  

      (2) The Dpy-YFP reagent is a non-functional fusion and therefore may not be a wholly reliable reporter of Dpy localization. There is no antibody confirmation. As other reagents are not available to my knowledge, this issue can be addressed with text acknowledgement of possible caveats.

      Thanks for raising this important point. We have added a caveat in the Discussion noting this limitation and the need for additional tools, such as an antibody or a functional fusion protein, to confirm the localization of Dpy.

      (3) TEM was done by standard chemical fixation, which is fine for viewing intracellular organelles, but high pressure freezing probably would do a better job of preserving aECM structure, which looks fairly bad in Fig. 2G WT, without evidence of the filamentous structures seen by light microscopy. Nevertheless, the images are sufficient for showing the extreme disorganization of aECM in papss mutants.

      We agree that HPF is a better method and intent to use the HPF system in future studies. We acknowledge that chemical fixation contributes to the appearance of a gap between the apical membrane and the aECM, which we did not observe in the HPF/FS method (Chung and Andrew, 2014). Despite this, the TEM images still clearly reveal that Papss mutants show a much thinner and more electron-dense aECM compared to WT (Figure 2H, I), consistent to the condensed WGA, Dpy, and Pio signals in our confocal analyses. As the reviewer mentioned, we believe that the current TEM data are sufficient to support the conclusion of severe aECM disorganization and Golgi defects in Papss mutants.

      (4) The authors may consider citing some of the work that has been done on sulfation in nematodes, e.g. as reviewed here: https://pubmed.ncbi.nlm.nih.gov/35223994/ Sulfation has been tied to multiple aspects of nematode aECM organization, though not specifically to ZP proteins.

      Thank you for the suggestion. Pioneering studies in C. elegans have highlighted the key role of sulfation in diverse developmental processes, including neuronal organization, reproductive tissue development, and phenotypic plasticity. We have now cited several works.  

      Reviewer #2 (Significance):

      This study will be of interest to researchers studying developmental morphogenesis in general and specifically tube biology or the aECM. It should be particularly of interest to those studying sulfation or ZP proteins (which are broadly present in aECMs across organisms, including humans).

      This study adds to the literature demonstrating the importance of luminal matrix in shaping tubular organs and greatly advances understanding of the luminal matrix in the Drosophila salivary gland, an important model of tubular organ development and one that has key matrix differences (such as no chitin) compared to other highly studied Drosophila tubes like the trachea.

      The detailed description of the defects resulting from papss loss suggests that there are multiple different sulfated targets, with a subset specifically relevant to aECM biology. A limitation is that specific sulfated substrates are not identified here (e.g. are these the ZP proteins themselves or other matrix glycoproteins or lipids?); therefore it's not clear how direct or indirect the effects of papss are on ZP proteins. However, this is clearly a direction for future work and does not detract from the excellent beginning made here.

      My expertise: I am a developmental geneticist with interests in apical ECM

      Reviewer #3 (Evidence, reproducibility and clarity):

      In this work Woodward et al focus on the apical extracellular matrix (aECM) in the tubular salivary gland (SG) of Drosophila. They provide new insights into the composition of this aECM, formed by ZP proteins, in particular Pio and Dumpy. They also describe the functional requirements of PAPSS, a critical enzyme involved in sulfation, in regulating the expansion of the lumen of the SG. A detailed cellular analysis of Papss mutants indicate defects in the apical membrane, the aECM and in Golgi organization. They also find that Papss control the proper organization of the Pio-Dpy matrix in the lumen. The work is well presented and the results are consistent.

      Main comments

      - This work provides a detailed description of the defects produced by the absence of Papss. In addition, it provides many interesting observations at the cellular and tissular level. However, this work lacks a clear connection between these observations and the role of sulfation. Thus, the mechanisms underlying the phenotypes observed are elusive. Efforts directed to strengthen this connection (ideally experimentally) would greatly increase the interest and relevance of this work.

      Thank you for this thoughtful comment. To directly test whether the phenotypes observed in Papss mutants are due to the loss of sulfation activity, we generated transgenic lines expressing catalytically inactive forms of Papss, UAS-PapssK193A, F593P, in which key residues in the APS kinase and ATP sulfurylase domains are mutated. Unlike WT UAS-Papss (both the Papss-PD or Papss-PE isoforms), the catalytically inactive UAS-Papssmut failed to rescue any of the phenotypes, including the thin lumen phenotype (Figure 1I-L), altered WGA signals (Figure I, I’) and the cell death phenotype (Figure 4D, E). These findings strongly support the conclusion that the enzymatic sulfation activity of Papss is essential for the developmental processes described in this study.  

      - A main issue that arises from this work is the role of Papss at the cellular level. The results presented convincingly indicate defects in Golgi organization in Papss mutants. Therefore, the defects observed could stem from general defects in the secretion pathway rather than from specific defects on sulfation. This could even underly general/catastrophic cellular defects and lead to cell death (as observed).

      This observation has different implications. Is this effect observed in SGs also observed in other cells in the embryo? If Papss has a general role in Golgi organization this would be expected, as Papss encodes the only PAPs synthatase in Drosophila.

      Can the authors test any other mutant that specifically affect Golgi organization and investigate whether this produces a similar phenotype to that of Papss?

      Thank you for the comment. To address whether the defects observed in Papss mutants stem from general disruption of the secretory pathway due to Golgi disorganization, we examined mutants of two key Golgi components: Grasp65 and GM130. 

      In Grasp65 mutants, we observed significant defects in SG lumen morpholgy, including highly irregular SG lumen shape and multiple constrictions (100%; n=10/10). However, the lumen was not uniformly thin as in Papss mutants. In contrast, GM130 mutants–although this line was very sick and difficult to grow–showed relatively normal salivary glands morphology in the few embryos that survived to stage 16 (n=5/5). It is possible that only embryos with mild phenotypes progressed to this stages, limiting interpretation. These data have now been included in Figure 3-figure supplement 2. Overall, while Golgi disruption can affect SG morphology, the specific phenotypes seen in Papss mutants are not fully recapitulated by Grasp65 or GM130 loss. 

      - A model that conveys the different observations and that proposes a function for Papss in sulfation and Golgi organization (independent or interdependent?) would help to better present the proposed conclusions. In particular, the paper would be more informative if it proposed a mechanism or hypothesis of how sulfation affects SG lumen expansion. Is sulfation regulating a factor that in turn regulates Pio-Dpy matrix? Is it regulating Pio-Dpy directly? Is it regulating a

      product recognized by WGA?

      For instance, investigating Alcian blue or sulfotyrosine staining in pio, dpy mutants could help to understand whether Pio, Dpy are targets of sulfation.

      Thank you for the comment. We’re also very interested in learning whether the regulation of the Pio-Dpy matrix is a direct or indirect consequence of the loss of sulfation on these proteins. One possible scenario is that sulfation directly regulates the Pio-Dpy matrix by regulating protein stability through the formation of disulfide bonds between the conserved Cys residues responsible for ZP module polymerization. Additionally, the Dpy protein contains hundreds of EGF modules that are highly susceptible to O-glycosylation. Sulfation of the glycan groups attached to Dpy may be critical for its ability to form a filamentous structure. Without sulfation, the glycan groups on Dpy may not interact properly with the surrounding materials in the lumen, resulting in an aggregated and condensed structure. These possibilities are discussed in the Discussion.

      We have not analyzed sulfation levels in pio or dpy mutants because sulfation levels in mutants of single ZP domain proteins may not provide much information. A substantial number of proteoglycans, glycoproteins, and proteins (with up to 1% of all tyrosine residues in an organism’s proteins estimated to be sulfated) are modified by sulfation, so changes in sulfation levels in a single mutant may be subtle. Especially, the existing dpy mutant line is an insertion mutant of a transposable element; therefore, the sulfation sites would still remain in this mutant. 

      - Interpretation of Papss effects on Pio and Dpy would be desired. The results presented indicate loss of Pio antibody staining but normal presence of cherry-Pio. This is difficult to interpret. How are these results of Pio antibody and cherry-Pio correlating with the results in the trachea described recently (Drees et al. 2023)?

      In our original submission, we stated that the uniform luminal mCh-Pio signals were not changed in Papss mutants, but after re-analysis, we found that these signals were actually absent from the expanded luminal region in stage 16 SG (where Dpy-YFP is also absent), and weak mCh-Pio signals colocalize with the condensed Dpy-YFP signals (Figure 5C, D). We have revised the text accordingly. 

      After cleavages by Np and furin, the Pio protein should have three fragments. The Nterminal region contains the N-terminal half of the ZP domain, and mCh-Pio signals show this fragment. The very C-terminal region should localize to the membrane as it contains the transmembrane domain. We think the middle piece, the C-terminal ZP domain, is recognized by the Pio antibody. The mCh-Pio and Pio antibody signals in the WT trachea (Drees et al., 2023) are similar to those in the SG. mCh-Pio signals are detected in the tracheal lumen as uniform signals, at the apical membrane, and in cytoplasmic puncta. Pio antibody signals are exclusively in the tracheal lumen and show more heterogenous filamentous signals. 

      In Papss mutants, the middle fragment (the C-terminal ZP domain) seems to be most affected because the Pio antibody signals are absent from the lumen. The loss of Pio antibody signals could be due to protein degradation or epitope masking caused by aECM condensation and protein misfolding. This fragment seems to be key for interacting with Dpy, since Pio antibody signals always colocalize with Dpy-YFP. The N-terminal mCh-Pio fragment does not appear to play a significant role in forming a complex with Dpy in WT (but still aggregated together in Papss mutants), and this can be tested in future studies.

      In response to Reviewer 1’s comment, we performed an additional experiment to test the role of Np in cleaving Pio to help organize the SG aECM. In this experiment, we overexpressed the WT and mutant form of Np using UAS-Np.WT and UAS-Np.S990A lines (Drees et al., 2019) and analyzed mCh-Pio, Pio antibody, and Dpy-YFP signals. Np.WT overexpression resulted in increased levels of mCh-Pio, Pio, and Dpy-YFP signals in the lumen and at the apical membrane. However, overexpression of Np.S990A resulted in the absence of luminal mCh-Pio signals. Pio antibody signals were strong at the apical membrane but rather weak in the luminal filamentous structures. Since the UAS-Np.S990A line has the GFP tag, we could not reliably analyze Dpy-YFP signals due to overlapping Np.S990A.GFP signals in the same channel. However, the luminal filamentous Pio signals co-localized with GFP signals, and we assume that these overlapping signals could be Dpy-YFP signals. 

      These results suggest that overexpressed Np.S990A may act in a dominant-negative manner, competing with endogenous Np and impairing proper cleavage of Pio (and mCh-Pio). Nevertheless, some level of cleavage by endogenous Np still appears to occur, as indicated by the residual luminal filamentous Pio signals. These new findings have been incorporated into the revised manuscript and are shown in Figure 6H and 6I. 

      A proposed model of the Pio-Dpy aECM in WT, Papss, pio, and Np mutants has now been included in Figure 7.

      -  What does the WGA staining in the lumen reveal? This staining seems to be affected differently in pio and dpy mutants: in pio mutants it disappears from the lumen (as dpy-YFP does), but in dpy mutants it seems to be maintained. How do the authors interpret these findings? How does the WGA matrix relate to sulfated products (using Alcian blue or sulfotyrosine)?

      WGA binds to sialic acid and N-acetylglucosamine (GlcNAc) residues on glycoproteins and glycolipids. GlcNAc is a key component of the glycosaminoglycan (GAG) chains that are covalently attached to the core protein of a proteoglycan, which is abundant in the ECM. We think WGA detects GlcNAc residues in the components of the aECM, including Dpy as a core component, based on the following data. 1) WGA and Dpy colocalize in the lumen, both in WT (as thin filamentous structures) and Papss mutant background (as condensed rod-like structures), and 2) are absent in pio mutants. WGA signals are still present in a highly condensed form in dpy mutants. That’s probably because the dpy mutant allele (dpyov1) has an insertion of a transposable element (blood element) into intron 11 and this insertion may have caused the Dpy protein to misfold and condense. We added the information about the dpy allele to the Results section and discussed it in the Discussion.

      Minor points:

      - The morphological phenotypic analysis of Papss mutants (homozygous and transheterozygous) is a bit confusing. The general defects are higher in Papss homozygous than in transheterozygotes over a deficiency. Maybe quantifying the defects in the heterozygote embryos in the Papss mutant collection could help to figure out whether these defects relate to Papss mutation.

      We analyzed the morphology of heterozygous Papss mutant embryos. They were all normal. The data and quantifications have now been added to Figure 1-figure supplement 3. 

      - The conclusion that the apical membrane is affected in Papss mutants is not strongly supported by the results presented with the pattern of Crb (Fig 2). Further evidences should be provided. Maybe the TEM analysis could help to support this conclusion

      We quantified Crb levels in the sub-apical and medial regions of the cell and included this new quantification in Figure 2D. TEM images showed variation in the irregularity of the apical membrane, even in WT, and we could not draw a solid conclusion from these images.

      - It is difficult to understand why in Papss mutants the levels of WGA increase. Can the authors elaborate on this?

      We think that when Dpy (and many other aECM components) are condensed and aggregated into the thin, rod-like structure in Papss mutants, the sugar residues attached to them must also be concentrated and shown as increased WGA signals.   

      - The explanation about why Pio antibody and mcherry-Pio show different patterns is not clear. If the antibody recognizes the C-t region, shouldn't it be clearly found at the membrane rather than the lumen?

      The Pio protein is also cleaved by furin protease (Figure 5B). We think the Pio fragment recognized by the antibody should be a “C-terminal ZP domain”, which is a middle piece after furin + Np cleavages. 

      - The qsm information does not seem to provide any relevant information to the aECM, or sulfation.

      Since Qsm has been shown to bind to Dpy and remodel Dpy filaments in the muscle tendon (Chu and Hayashi, 2021), we believe that the different behavior of Qsm in the SG is still informative. As mentioned briefly in the Discussion, the cleaved Qsm fragment may localize differently, like Pio, and future work will need to test this. We have shortened the description of the Qsm localization in the manuscript and moved the details to the figure legend of Figure 5-figure supplement 3.

      Reviewer #3 (Significance):

      Previous reports already indicated a role for Papss in sulfation in SG (Zhu et al 2005). Now this work provides a more detailed description of the defects produced by the absence of Papss. In addition, it provides relevant data related to the nature and requirements of the aECM in the SG. Understanding the composition and requirements of aECM during organ formation is an important question. Therefore, this work may be relevant in the fields of cell biology and morphogenesis.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      In this manuscript, the authors use anatomical tracing and slice physiology to investigate the integration of thalamic (ATN) and retrosplenial cortical (RSC) signals in the dorsal presubiculum (PrS). This work will be of interest to the field, as the postsubiculum is thought to be a key region for integrating internal head direction representations with external landmarks. The main result is that ATN and RSC inputs drive the same L3 PrS neurons, which exhibit superlinear summation to near-coincident inputs. Moreover, this activity can induce bursting in L4 PrS neurons, which can pass the signals LMN (perhaps gated by cholinergic input).

      Strengths:

      The slice physiology experiments are carefully done. The analyses are clear and convincing, and the figures and results are well-composed. Overall, these results will be a welcome addition to the field.

      We thank this reviewer for the positive comment on our work.

      Weaknesses:

      The conclusions about the circuit-level function of L3 PrS neurons sometimes outstrip the data, and their model of the integration of these inputs is unclear. I would recommend some revision of the introduction and discussion. I also had some minor comments about the experimental details and analysis.

      Specific major comments:

      (1) I found that the authors' claims sometimes outstrip their data, given that there were no in vivo recordings during behavior. For example, in the abstract, their results indicate "that layer 3 neurons can transmit a visually matched HD signal to medial entorhinal cortex", and in the conclusion they state "[...] cortical RSC projections that carry visual landmark information converge on layer 3 pyramidal cells of the dorsal presubiculum". However, they never measured the nature of the signals coming from ATN and RSC to L3 PrS (or signals sent to downstream regions). Their claim is somewhat reasonable with respect to ATN, where the majority of neurons encode HD, but neurons in RSC encode a vast array of spatial and non-spatial variables other than landmark information (e.g., head direction, egocentric boundaries, allocentric position, spatial context, task history to name a few), so making strong claims about the nature of the incoming signals is unwarranted.

      We agree of course that RSC does not only encode landmark information. We have clarified this point in the introduction (line 69-70) and formulated more carefully in the abstract (removed the word ‘landmark’ in line 17) and in the  introduction (line 82-83). In the discussion we explicitly state that ‘In our slice work we are blind to the exact nature of the signal that is carried by ATN and RSC axons’ (line 522-523).

      (2) Related to the first point, the authors hint at, but never explain, how coincident firing of ATN and RSC inputs would help anchor HD signals to visual landmarks. Although the lesion data (Yoder et al. 2011 and 2015) support their claims, it would be helpful if the proposed circuit mechanism was stated explicitly (a schematic of their model would be helpful in understanding the logic). For example, how do neurons integrate the "right" sets of landmarks and HD signals to ensure stable anchoring? Moreover, it would be helpful to discuss alternative models of HD-to-landmark anchoring, including several studies that have proposed that the integration may (also?) occur in RSC (Page & Jeffrey, 2018; Yan, Burgess, Bicanski, 2021; Sit & Goard, 2023). Currently, much of the Discussion simply summarizes the results of the study, this space could be better used in mapping the findings to the existing literature on the overarching question of how HD signals are anchored to landmarks.

      We agree with the reviewer on the importance of the question, how do neurons integrate the “right” sets of landmarks and HD signals to ensure stable anchoring? Based on our results we provide a schematic to illustrate possible scenarios, and we include it as a supplementary figure (Figure 1, to be included in the ms as Figure 7—figure supplement 2), as well as a new paragraph in the discussion section (line 516-531).  We point out that critical information on the convergence and divergence of functionally defined inputs is still lacking, both for principal cells and interneurons

      Interestingly, recent evidence from functional ultrasound imaging and electrical single cell recording demonstrated that visual objects may refine head direction coding, specifically in the dorsal presubiculum (Siegenthaler et al. bioRxiv 2024.10.21.619417; doi: https://doi.org/10.1101/2024.10.21.619417). The increase in firing rate for HD cells whose preferred firing direction corresponds to a visual landmark could be supported by the supralinear summation of thalamic HD signals and retrosplenial input described in our study. We include this point in the discussion (line 460-462), and hope that our work will spur further investigations.

      Reviewer #2 (Public Review):

      Richevaux et al investigate how anterior thalamic (AD) and retrosplenial (RSC) inputs are integrated by single presubicular (PrS) layer 3 neurons. They show that these two inputs converge onto single PrS layer 3 principal cells. By performing dual-wavelength photostimulation of these two inputs in horizontal slices, the authors show that in most layer 3 cells, these inputs summate supra-linearly. They extend the experiments by focusing on putative layer 4 PrS neurons, and show that they do not receive direct anterior thalamic nor retrosplenial inputs; rather, they are (indirectly) driven to burst firing in response to strong activation of the PrS network.

      This is a valuable study, that investigates an important question - how visual landmark information (possibly mediated by retrosplenial inputs) converges and integrates with HD information (conveyed by the AD nucleus of the thalamus) within PrS circuitry. The data indicate that near-coincident activation of retrosplenial and thalamic inputs leads to non-linear integration in target layer 3 neurons, thereby offering a potential biological basis for landmark + HD binding.

      The main limitations relate to the anatomical annotation of 'putative' PrS L4 neurons, and to the presentation of retrosplenial/thalamic input modularity. Specifically, more evidence should be provided to convincingly demonstrate that the 'putative L4 neurons' of the PrS are not distal subicular neurons (as the authors' anatomy and physiology experiments seem to indicate). The modularity of thalamic and retrosplenial inputs could be better clarified in relation to the known PrS modularity.

      We thank the reviewer for their important feedback. We discuss what defines presubicular layer 4 in horizontal slices, cite relevant literature, and provide new and higher resolution images. See below for detailed responses to the reviewer’s comments, in the section ‘recommendations to authors’.

      Reviewer #3 (Public Review):

      Summary:

      The authors sought to determine, at the level of individual presubiculum pyramidal cells, how allocentric spatial information from the retrosplenial cortex was integrated with egocentric information from the anterior thalamic nuclei. Employing a dual opsin optogenetic approach with patch clamp electrophysiology, Richevaux, and colleagues found that around three-quarters of layer 3 pyramidal cells in the presubiculum receive monosynaptic input from both brain regions. While some interesting questions remain (e.g. the role of inhibitory interneurons in gating the information flow and through different layers of presubiculum, this paper provides valuable insights into the microcircuitry of this brain region and the role that it may play in spatial navigation).

      Strengths:

      One of the main strengths of this manuscript was that the dual opsin approach allowed the direct comparison of different inputs within an individual neuron, helping to control for what might otherwise have been an important source of variation. The experiments were well-executed and the data was rigorously analysed. The conclusions were appropriate to the experimental questions and were well-supported by the results. These data will help to inform in vivo experiments aimed at understanding the contribution of different brain regions in spatial navigation and could be valuable for computational modelling.

      Weaknesses:

      Some attempts were made to gain mechanistic insights into how inhibitory neurotransmission may affect processing in the presubiculum (e.g. Figure 5) but these experiments were a little underpowered and the analysis carried out could have been more comprehensively undertaken, as was done for other experiments in the manuscript.

      We agree that the role of interneurons for landmark anchoring through convergence in Presubiculum requires further investigation. In our latest work on the recruitment of VIP interneurons we begin to address this point in slices (Nassar et al., 2024 Neuroscience. doi: 10.1016/j.neuroscience.2024.09.032.); more work in behaving animals will be needed.

      Reviewer #1 (Recommendations For The Authors):

      Full comments below. Beyond the (mostly minor) issues noted below, this is a very well-written paper and I look forward to seeing it in print.

      Major comments:

      (1) I found that the authors' claims sometimes outstrip their data, given that there were no in vivo recordings during behavior. For example, in the abstract, their results indicate "that layer 3 neurons can transmit a visually matched HD signal to medial entorhinal cortex", and in the conclusion they state "[...] cortical RSC projections that carry visual landmark information converge on layer 3 pyramidal cells of the dorsal presubiculum". However, they never measured the nature of the signals coming from ATN and RSC to L3 PrS (or signals sent to downstream regions). Their claim is somewhat reasonable with respect to ATN, where the majority of neurons encode HD, but neurons in RSC encode a vast array of spatial and non-spatial variables other than landmark information (e.g., head direction, egocentric boundaries, allocentric position, spatial context, task history to name a few), so making strong claims about the nature of the incoming signals is unwarranted.

      Our study was motivated by the seminal work from Yoder et al., 2011 and 2015, indicating that visual landmark information is processed in PoS and from there transmitted to the LMN.  Based on that, and in the interest of readability, we may have used an oversimplified shorthand for the type of signal carried by RSC axons. There are numerous studies indicating a role for RSC in encoding visual landmark information (Auger et al., 2012; Jacob et al., 2017; Lozano et al., 2017; Fischer et al., 2020; Keshavarzi et al., 2022; Sit and Goard, 2023); we agree of course that this is certainly not the only variable that is represented. Therefore we change the text to make this point clear:

      Abstract, line 17: removed the word ‘landmark’

      Introduction, line 69: added “...and supports an array of cognitive functions including memory, spatial and non-spatial context and navigation (Vann et al., 2009; Vedder et al., 2017). ”

      Introduction, line 82: changed “...designed to examine the convergence of visual landmark information, that is possibly integrated in the RSC, and vestibular based thalamic head direction signals”.

      Discussion, line 522-523: added “In our slice work we are blind to the exact nature of the signal that is carried by ATN and RSC axons.”

      (2) Related to the first point, the authors hint at, but never explain, how coincident firing of ATN and RSC inputs would help anchor HD signals to visual landmarks. Although the lesion data (Yoder et al., 2011 and 2015) support their claims, it would be helpful if the proposed circuit mechanism was stated explicitly (a schematic of their model would be helpful in understanding the logic). For example, how do neurons integrate the "right" sets of landmarks and HD signals to ensure stable anchoring? Moreover, it would be helpful to discuss alternative models of HD-to-landmark anchoring, including several studies that have proposed that the integration may (also?) occur in RSC (Page & Jeffrey, 2018; Yan, Burgess, Bicanski, 2021; Sit & Goard, 2023). Currently, much of the Discussion simply summarizes the results of the study, this space could be better used in mapping the findings to the existing literature on the overarching question of how HD signals are anchored to landmarks.

      We suggest a physiological mechanism for inputs to be selectively integrated and amplified, based on temporal coincidence. Of course there are still many unknowns, including the divergence of connections from a single thalamic or retrosplenial input neuron. The anatomical connectivity of inputs will be critical, as well as the subcellular arrangement of synaptic contacts. Neuromodulation and changes in the balance of excitation and inhibition will need to be factored in. While it is premature to provide a comprehensive explanation for landmark anchoring of HD signals in PrS, our results have led us to include a schematic, to illustrate our thinking (Figure 1, see below).

      Do HD tuned inputs from thalamus converge on similarly tuned HD neurons only? Is divergence greater for the retrosplenial inputs? If so, thalamic input might pre-select a range of HD neurons, and converging RSC input might narrow down the precise HD neurons that become active (Figure 1). In the future, the use of activity dependent labeling strategies might help to tie together information on the tuning of pre-synaptic neurons, and their convergence or divergence onto functionally defined postsynaptic target cells. This critical information is still lacking, for principal cells, and also for interneurons. 

      Interneurons may have a key role in HD-to-landmark anchoring. SST interneurons support stability of HD signals (Simonnet et al., 2017) and VIP interneurons flexibly disinhibit the system (Nassar et al., 2024). Could disinhibition be a necessary condition to create a window of opportunity for updating the landmark anchoring of the attractor? Single PV interneurons might receive thalamic and retrosplenial inputs non-specifically. We need to distinguish the conditions for when the excitation-inhibition balance in pyramidal cells may become tipped towards excitation, and the case of coincident, co-tuned thalamic and retrosplenial input may be such a condition. Elucidating the principles of hardwiring of inputs, as for example, selective convergence, will be necessary. Moreover, neuromodulation and oscillations may be critical for temporal coordination and precise temporal matching of HD-to-landmark signals.

      We note that matching directional with visual landmark information based on temporal coincidence as described here does not require synaptic plasticity. Algorithms for dynamic control of cognitive maps without synaptic plasticity have been proposed (Whittington et al., 2025, Neuron): information may be stored in neural attractor activity, and the idea that working memory may rely on recurrent updates of neural activity might generalize to the HD system. We include these considerations in the discussion (line 497-501; 521-531) and hope that our work will spur further experimental investigations and modeling work.

      While the focus of our work has been on PrS, we agree that RSC also treats HD and landmark signals. Possibly the RSC registers a direction to a landmark rather than comparing it with the current HD (Sit & Goard, 2023). We suggest that this integrated information then reaches PrS. In contrast to RSC, PrS is uniquely positioned to update the signal in the LMN (Yoder et al., 2011), cf. discussion (line 516-520).

      Minor comments:

      (1) Fig 1 - Supp 1: It appears there is a lot of input to PrS from higher visual regions, could this be a source of landmark signals?

      Yes, higher visual regions projecting to PrS may also be a source of landmark information, even if the visual signal is not integrated with HD at that stage (Sit & Goard 2023). The anatomical projection from the visual cortex was first described by Vogt & Miller (1983), but not studied on a functional level so far.

      (2) Fig 2F, G: Although the ATN and RSC measurements look quite similar, there are no stats included. The authors should use an explicit hypothesis test.

      We now compare the distributions of amplitudes and of latencies, using the Mann-Whitney U test. No significant difference between the two groups were found. Added in the figure legend: 2F, “Mann-Whitney U test revealed no significant difference (p = 0.95)”. 2G, “Mann-Whitney U test revealed no significant difference (p = 0.13)”.

      (3) Fig 2 - Supp 2A, C: Again, no statistical tests. This is particularly important for panel A, where the authors state that the latencies are similar but the populations appear to be different.

      Inputs from ATN and RSC have a similar ‘jitter’ (latency standard deviation) and ‘tau decay’. We added in the Fig 2 - Supp 2 figure legend: A, “Mann-Whitney U test revealed no significant difference (p = 0.26)”. C, “Mann-Whitney U test revealed no significant difference (p = 0.87)”.

      As a complementary measure for the reviewer, we performed the Kolmogorov-Smirnov test which confirmed that the populations’ distributions for ‘jitter’ were not significantly different, p = 0.1533.

      (4) Fig 4E, F: The statistics reporting is confusing, why are asterisks above the plots and hashmarks to the side?

      Asterisks refer to a comparison between ‘dual’ and ‘sum’ for each of the 5 stimulations in a Sidak multiple comparison test. Hashmarks refer to comparison of the nth stimulation to the 1st one within dual stimulation events (Friedman + Dunn’s multiple comparison test). We mention the two-way ANOVA p-value in the legend (Sum v Dual, for both Amplitude and Surface).

      (5) Fig 5C: I was confused by the 2*RSC manipulation. How do we know if there is amplification unless we know what the 2*RSC stim alone looks like?

      We now label the right panel in Fig 5C as “high light intensity” or “HLI”. Increasing the activation of Chrimson increases the amplitude of the summed EPSP that now exceeds the threshold for amplification of synaptic events. Amplification refers to the shape of the plateau-like prolongation of the peak, most pronounced on the second EPSP, now indicated with an arrow.  We clarify this also in the text (line 309-310).

      (6) Fig 6D (supplement 1): Typo, "though" should be "through"

      Yes, corrected (line 1015).

      (7) Fig 6G (supplement 1): Typo, I believe this refers to the dotted are in panel F, not panel A.

      Yes, corrected (line 1021).

      (8) Fig 7: The effect of muscarine was qualitatively described in the Results, but there is no quantification and it is not shown in the Figure. The results should either be reported properly or removed from the Results.

      We remove the last sentence in the Results.

      (9) Methods: The age and sex of the mice should be reported. Transgenic mouse line should be reported (along with stock number if applicable).

      We used C57BL6 mice with transgenic background (Ai14 mice, Jax n007914  reporter line) or C57BL6 wild type mice. This is now indicated in the Methods (lines 566-567).

      (10) Methods: If the viruses are only referred to with their plasmid number, then the capsid used for the viruses should be specified. For example, I believe the AAV-CAG-tomato virus used the retroAAV capsid, which is important to the experiment.

      Thank you for pointing this out. Indeed the AAV-CAG-tdTom virus used the retroAAV capsid, (line 575).

      (11) Data/code availability: I didn't see any sort of data/code availability statement, will the data and code be made publicly available?

      Data are stored on local servers at the SPPIN, Université Paris Cité, and are made available upon reasonable request. Code for intrinsic properties analysis is available on github (https://github.com/schoki0710/Intrinsic_Properties). This information is now included (line 717-720).

      (12) Very minor (and these might be a matter of opinion), but I believe "records" should be "recordings", and "viral constructions" should be "viral constructs".

      The text had benefited from proofreading by Richard Miles, who always preferred “records” to “recordings” in his writings. We choose to keep the current wording.

      Reviewer #2 (Recommendations For The Authors):

      Below are two major points that require clarification.

      (1) In the last set of experiments presented by the authors (Figs 6 onwards) they focus on 'putative L4' PrS cells. For several lines of evidence (outlined below), I am convinced that these neurons are not presubicular, but belong to the subiculum. I think this is a major point that requires substantial clarification, in order to avoid confusion in the field (see also suggestions on how to address this comment at the end of this section).

      Several lines of evidence support the interpretation that, what the authors call 'L4 PrS neurons', are distal subicular cells:

      (1.1) The anatomical location of the retrogradely-labelled cells (from mammillary bodies injections), as shown in Figs 6B, C, and Fig. 6_1B, very clearly indicates that they belong to the distal subiculum. The subicular-to-PrS boundary is a sharp anatomical boundary that follows exactly the curvature highlighted by the authors' red stainings. The authors could also use specific subicular/PrS markers to visualize this border more clearly - e.g. calbindin, Wfs-1, Zinc (though I believe this is not strictly necessary, since from the pattern of AD fibers, one can already draw very clear conclusions, see point 1.3 below).

      Our criteria to delimit the presubiculum are the following: First and foremost, we rely on the defining presence of antero-dorsal thalamic fibers that target specifically the presubiculum and not the neighbouring subiculum (Simonnet et al., 2017, Nassar et al., 2018, Simonnet and Fricker, 2018; Jiayan Liu et al., 2021). This provides the precise outline of the presubicular superficial layers 1 to 3. It may have been confusing to the reviewer that our slicing angle gives horizontal sections. In fact, horizontal sections are favourable to identify the layer structure of the PrS,  based on DAPI staining and the variations in cell body size. The work by Ishihara and Fukuda (2016) illustrates in their Figure 12 that the presubicular layer 4 lies below the presubicular layer 3, and forms a continuation with the subiculum (Sub1). Their Figure 4 indicates with a dotted line the “generally accepted border between the (distal) subiculum and PreS”, and it runs from the proximal tip of superficial cells of the PrS toward the white matter, among the radial direction of the cortical tissue.  We agree with this definition. Others have sliced coronally (Cembrowski et al., 2018) which renders a different visualization of the border region with the subiculum.

      Second, let me explain the procedure for positioning the patch electrode in electrophysiological experiments on horizontal presubicular slices. Louis Richevaux, the first author, who carried out the layer 4 cell recordings, took great care to stay very close (<50 µm) to the lower limit of the zone where the GFP labeled thalamic axons can be seen. He was extremely meticulous about the visualization under the microscope, using LED illumination, for targeting. The electrophysiological signature of layer 4 neurons with initial bursts (but not repeated bursting, in mice) is another criterion to confirm their identity (Huang et al., 2017). Post-hoc morphological revelation showed their apical dendrites, running toward the pia, sometimes crossing through the layer 3, sometimes going around the proximal tip, avoiding the thalamic axons (Figure 6D). For example the cell in Figure 6, suppl. 1 panel D, has an apical dendrite that runs through layer 3 and layer 1. 

      Third, retrograde labeling following stereotaxic injection into the LMN is another criterion to define PrS layer 4. This approach is helpful for visualization, and is based on the defining axonal projection of layer 4 neurons (Yoder and Taube, 2011; Huang et al., 2017). Due to the technical challenge to stereotaxically inject only into LMN, the resultant labeling may not be limited to PrS layer 4. We cannot entirely exclude some overflow of retrograde tracers (B) or retrograde virus (C) to the neighboring MMN. This would then lead to co-labeling of the subiculum. In the main Figure 6, panels B and C, we agree that for this reason the red labelled cell bodies likely include also subicular neurons, on the proximal side, in addition to L4 presubicular neurons. We now point out this caveat in the main text (line 324-326) and in the methods (line 591-592).

      (1.2) Consistent with their subicular location, neuronal morphologies of the 'putative L4 cells' are selectively constrained within the subicular boundaries, i.e. they do not cross to the neighboring PrS (maybe a minor exception in Figs. 6_1D2,3). By definition, a neuron whose morphology is contained within a structure belongs to that structure.

      From a functional point of view, for the HD system, the most important criterion for defining presubicular layer 4 neurons is their axonal projection to the LMN (Yoder and Taube 2011). From an electrophysiological standpoint, it is the capacity of layer 4 neurons to fire initial bursts (Simonnet et al., 2013; Huang et al., 2017).  Anatomically, we note that the expectation that the apical dendrite should go straight up into layer 3 might not be a defining criterion in this curved and transitional periarchicortex. Presubicular layer 4 apical dendrites may cross through layer 3 and exit to the side, towards the subiculum (This is the red dendritic staining at the proximal end of the subiculum, at the frontier with the subiculum, Figure 6 C).

      (1.3) As acknowledged by the authors in the discussion (line 408): the PrS is classically defined by the innervation domain of AD fibers. As Figure 6B clearly indicates, the retrogradely-labelled cells ('putative L4') are convincingly outside the input domain of the AD; hence, they do not belong to the PrS.

      The reviewer is mistaken here, the deep layers 4 and 5/6 indeed do not lie in the zone innervated by the thalamic fibers (Simonnet et al., 2017; Nassar et al., 2018; Simonnet and Fricker, 2018) but still belong to the presubiculum. The presubicular deep layers are located below the superficial layers, next to, and in continuation of the subiculum. This is in agreement with work by Yoder and Taube 2011; Ishihara and Fukuda 2016; Boccara, … Witter, 2015; Peng et al., 2017 (Fig 2D); Yoshiko Honda et al., (Marmoset, Fig 2A) 2022; Balsamo et al., 2022 (Figure 2B).

      (1.4) Along with the above comment: in my view, the optogenetic stimulation experiments are an additional confirmation that the 'putative L4 cells' are subicular neurons, since they do not receive AD inputs at all (hence, they are outside of the PrS); they are instead only indirectly driven upon strong excitation of the PrS. This indirect activation is likely to occur via PrS-to-Subiculum 'back-projections', the existence of which is documented in the literature and also nicely shown by the authors (see Figure 1_1 and line 109).

      See above. Only superficial layers 1-3 of the presubiculum receive direct AD input.

      (1.5) The electrophysiological properties of the 'putative L4 cells' are consistent with their subicular identity, i.e. they show a sag current and they are intrinsically bursty.

      Presubicular layer 4 cells also show bursting behaviour and a sag current (Simonnet et al., 2013; Huang et al., 2017).

      From the above considerations, and the data provided by the authors, I believe that the most parsimonious explanation is that these retrogradely-labelled neurons (from mammillary body injections), referred to by the authors as 'L4 PrS cells', are indeed pyramidal neurons from the distal subiculum.

      We agree that the retrograde labeling is likely not limited to the presubicular layer 4 cells, and we now indicate this in the text (line 324-326). However, the portion of retrogradely labeled neurons that is directly below the layer 3 should be considered as part of the presubiculum.

      I believe this is a fundamental issue that deserves clarification, in order to avoid confusion/misunderstandings in the field. Given the evidence provided, I believe that it would be inaccurate to call these cells 'L4 PrS neurons'. However, I acknowledge the fact that it might be difficult to convincingly and satisfactorily address this issue within the framework of a revision. For example, it is possible that these 'putative L4 cells' might be retrogradely-labelled from the Medial Mammillary Body (a major subicular target) since it is difficult to selectively restrict the injection to the LMN, unless a suitable driver line is used (if available). The authors should also consider the possibility of removing this subset of data (referring to putative L4), and instead focus on the rest of the story (referring to L3)- which I think by itself, still provides sufficient advance.

      We agree with the reviewer that it is difficult to provide a satisfactory answer. To some extent, the reviewer’s comments target the nomenclature of the subicular region. This transitional region between the hippocampus and the entorhinal cortex has been notoriously ill defined, and the criteria are somewhat arbitrary for determining exactly where to draw the line. Based on the thalamic projection, presubicular layers 1-3 can now be precisely outlined, thanks to the use of viral labeling. But the presubicular layer 4 had been considered to be cell-free in early works, and termed ‘lamina dissecans’ (Boccara 2010), as the limit between the superficial and deep layers. Then it became of great interest to us and to the field, when the PrS layer 4 cells were first identified as LMN projecting neurons (Yoder and Taube 2011). This unique back-projection to the upstream region of the HD system is functionally very important, closing the loop of the Papez circuit (mammillary bodies - thalamus - hippocampal structures).

      We note that the reviewer does not doubt our results, rather questions the naming conventions. We therefore maintain our data. We agree that in the future a genetically defined mouse line would help to better pin down this specific neuronal population.

      We thank the reviewer for sharing their concerns and giving us the opportunity to clarify our experimental approach to target the presubicular layer 4. We hope that these explanations will be helpful to the readers of eLife as well.

      (2) The PrS anatomy could be better clarified, especially in relation to its modular organization (see e.g. Preston-Ferrer et al., 2016; Ray et al., 2017; Balsamo et al., 2022). The authors present horizontal slices, where cortical modularity is difficult to visualize and assess (tangential sections are typically used for this purpose, as in classical work from e.g. barrel cortex). I am not asking the authors to validate their observations in tangential sections, but just to be aware that cortical modules might not be immediately (or clearly) apparent, depending on the section orientation and thickness. The authors state that AD fibers were 'not homogeneously distributed' in L3 (line 135) and refer to 'patches of higher density in deep L3' (line 136). These statements are difficult to support unless more convincing anatomy and  . I see some L3 inhomogeneity in the green channel in Fig. 1G (last two panels) and also in Fig. 1K, but this seems to be rather upper L3. I wonder how consistent the pattern is across different injections and at what dorsoventral levels this L3 modularity is observed (I think sagittal sections might be helpful). If validated, these observations could point to the existence of non-homogeneous AD innervation domains in L3 - hinting at possible heterogeneity among the L3 pyramidal cell targets. Notably, modularity in L2 and L1 is not referred to. The authors state that AD inputs 'avoid L2' (line 131) but this statement is not in line with recent work (cited above) and is also not in line with their anatomy data in Fig. 1G, where modularity is already quite apparent in L2 (i.e. there are territories avoided by the AD fibers in L2) and in L1 (see for example the last image in Fig. 1G). This is the case also for the RSC axons (Fig. 1H) where a patchy pattern is quite clear in L1 (see the last image in panel H). Higher-mag pictures might be helpful here. These qualitative observations imply that AD and RSC axons probably bear a precise structural relationship relative to each other, and relative to the calbindin patch/matrix PrS organization that has been previously described. I am not asking the authors to address these aspects experimentally, since the main focus of their study is on L3, where RSC/AD inputs largely converge. Better anatomy pictures would be helpful, or at least a better integration of the authors' (qualitative) observations within the existing literature. Moreover, the authors' calbindin staining in Fig. 1K is not particularly informative. Subicular, PaS, MEC, and PrS borders should be annotated, and higher-resolution images could be provided. The authors should also check the staining: MEC appears to be blank but is known to strongly express calb1 in L2 (see 'island' by Kitamura et al., Ray et al., Science 2014; Ray et al., frontiers 2017). As additional validation for the staining: I would expect that the empty L2 patches in Figs. 1G (last two panels) would stain positive for Calbindin, as in previous work (Balsamo et al. 2022).

      We now provide a new figure showing the pattern of AD innervation in PrS superficial layers 1 to 3, with different dorso-ventral levels and higher magnification (Figure 2). Because our work was aimed at identifying connectivity between long-range inputs and presubicular neurons, we chose to work with horizontal sections that preserve well the majority of the apical dendrites of presubicular pyramidal neurons. We feel it is enriching for the presubicular literature to show the cytoarchitecture from different angles and to show patchiness in horizontal sections. The non-homogeneous AD innervation domains (‘microdomains’) in L3 were consistently observed across different injections in different animals.

      Author response image 1.

      Thalamic fiber innervation pattern. A, ventral, and B, dorsal horizontal section of the Presubiculum containing ATN axons expressing GFP. Patches of high density of ATN axonal ramifications in L3 are indicated as “ATN microdomains”. Layers 1, 2, 3, 4, 5/6 are indicated.  C, High magnification image (63x optical section)(different animal).<br />

      We also provide a supplementary figure with images of horizontal sections of calbindin staining in PrS, with a larger crop, for the reviewer to check (Figure 3, see below). We thank the reviewer for pointing out recent studies using tangential sections. Our results agree with the previous observation that AD axons are found in calbindin negative territories (cf Fig 1K). Calbindin+ labeling is visible in the PrS layer 2 as well as in some patches in the MEC (Figure 3 panel A). Calbindin staining tends to not overlap with the territories of ATN axonal ramification. We indicate the inhomogeneities of anterior thalamic innervation that form “microdomains” of high density of green labeled fibers, located in layer 1 and layer 3 (Figure 3, Panel A, middle). Panel B shows another view of a more dorsal horizontal section of the PrS, with higher magnification, with a big Calbindin+ patch near the parasubiculum.

      The “ATN+ microdomains” possess a high density of axonal ramifications from ATN, and have been previously documented in the literature. They are consistently present. Our group had shown them in the article by Nassar et al., 2018, at different dorsoventral levels (Fig 1 C (dorsal) and 1D (ventral) PrS). See also Simonnet et al., 2017, Fig 2B, for an illustration of the typical variations in densities of thalamic fibers, and supplementary Figure 1D. Also Jiayan Liu et al., 2021 (Figure 2 and Fig 5) show these characteristic microzones of dense thalamic axonal ramifications, with more or less intense signals across layers 1, 2, and 3.  While it is correct that thalamic axons can be seen to cross layer 2 to ramify in layer 1, we maintain that AD axons typically do not ramify in layer 2. We modify the text to say, “mostly” avoiding L2 (line 130).

      The reviewer is correct in pointing out that the 'patches of higher density in deep L3' are not only in the deep L3, as in the first panel in Fig 1G, but in the more dorsal sections they are also found in the upper L3. We change the text accordingly (line 135-136) and we provide the layer annotation in Figure 1G. We further agree with the reviewer that RSC axons also present a patchy innervation pattern. We add this observation in the text (line 144).

      It is yet unclear whether anatomical microzones of dense ATN axon ramifications in L3 might fulfill the criteria of a functional modularity, as it is the case for the calbindin patch/matrix PrS organization (Balsamo et al., 2022). As the reviewer points out, this will require more information on the precise structural relationship of AD and RSC axons relative to each other, as well as functional studies. Interestingly, we note a degree of variation in the amplitudes of oEPSC from different L3 neurons (Fig. 2F, discussion line 420; 428), which might be a reflection of the local anatomo-functional micro-organization.

      Minor points:

      (1) The pattern or retrograde labelling, or at least the way is referred to in the results (lines 104ff), seems to imply some topography of AD-to-PreS projections. Is it the case? How consistent are these patterns across experiments, and individual injections? Was there variability in injection sites along the dorso-ventral and possibly antero-posterior PrS axes, which could account for a possibly topographical AD-to-PrS input pattern? It would be nice to see a DAPI signal in Fig. 1B since the AD stands out quite clearly in DAPI (Nissl) alone.

      Yes, we find a consistent topography for the AD-to-PrS projection, for similar injection sites in the presubiculum. The coordinates for retrograde labeling were as indicated -4.06 (AP), 2.00 (ML) and -2.15 mm (DV) such that we cannot report on possible variations for different injection sites.

      (2) Fig. 2_2KM: this figure seems to show the only difference the authors found between AD and RS input properties. The authors could consider moving these data into main Fig. 2 (or exchanging them with some of the panels in F-O, which instead show no difference between AD and RSC). Asterisks/stats significance is not visible in M.

      For space reasons we leave the panels of Fig. 2_2KM in the supplementary section. We increased the size of the asterisk in M.

      (3) The data in Fig. 1_1 are quite interesting, since some of the PrS projection targets are 'non-canonical'. Maybe the authors could consider showing some injection sites, and some fluorescence images, in addition to the schematics. Maybe the authors could acknowledge that some of these projection targets are 'putative' unless independently verified by e.g. retrograde labeling. Unspecific white matter labelling and/or spillover is always a potential concern.

      We now include the image of the injection site for data in Fig. 1_1 as a supplementary Fig. 1_2. The Figure 1_1 shows the retrogradely labeled upstream areas of Presubiculum.

      Author response image 2.

      Retrobeads were injected in the right Presubiculum.<br />

      (4) The authors speculate that the near-coincident summation of RS + AD inputs in L3 cells could be a potential mechanism for the binding of visual + HD information in PrS. However, landmarks are learned, and learning typically implies long-term plasticity. As the authors acknowledge in the discussion (lines 493ff) GluR1 is not expressed in PrS cells. What alternative mechanics could the authors envision? How could the landmark-update process occur in PrS, if is not locally stored? RSC could also be involved (Jakob et al) as acknowledged in the introduction - the authors should keep this possibility open also in the discussion.

      A similar point has been raised by Reviewer 1, please check our answer to their point 2. Briefly, our results indicate that HD-to-landmark updating is a multi-step process. RSC may be one of the places where landmarks are learned. The subsequent temporal mapping of HD to landmark signals in PrS might be plasticity-free, as matching directional with visual landmark information based on temporal coincidence does not necessarily require synaptic plasticity.  It seems likely that there is no local storage and no change in synaptic weights in PrS. The landmark-anchored HD signals reach LMN via L4 neurons, sculpting network dynamics across the Papez circuit. One possibility is that the trace of a landmark that matches HD may be stored as patterns of neural activity that could guide navigation (cf. El-Gaby et al., 2024, Nature) Clearly more work is needed to understand how the HD attractor is updated on a mechanistic level. Recent work in prefrontal cortex mentions “activity slots” and delineates algorithms for dynamic control of cognitive maps without synaptic plasticity (Whittington et al., 2025, Neuron): information may be stored in neural attractor activity, and the idea that working memory may rely on recurrent updates of neural activity might generalize to the HD system. We include these considerations in the discussion (line 499-503; 523-533) and also point to alternative models (line 518 -522) including modeling work in the retrosplenial cortex.

      (5) The authors state that (lines 210ff) their cluster analysis 'provided no evidence for subpopulations of layer 3 cells (but see Balsamo et al., 2022)' implying an inconsistency; however, Balsamo et al also showed that the (in vivo) ephys properties of the two HD cell 'types' are virtually identical, which is in line with the 'homogeneity' of L3 ephys properties (in slice) in the authors' data. Regarding the possible heterogeneity of L3 cells: the authors report inhomogeneous AD innervation domains in L3 (see also main comment 2) and differences in input summation (some L3 cells integrate linearly, some supra-linearly; lines 272) which by itself might already imply some heterogeneity. I would therefore suggest rewording the statements to clarify what the lack of heterogeneity refers to.

      We agree. In line 212 we now state “cluster analysis (Figure 2D) provided no evidence for subpopulations of layer 3 cells in terms of intrinsic electrophysiological properties (see also Balsamo et al., 2022).”

      (6) n=6 co-recorded pairs are mentioned at line 348, but n=9 at line 366. Are these numbers referring to the same dataset? Please correct or clarify

      Line 349 refers to a set of 6 co-recorded pairs (n=12 neurons) in double injected mice with Chronos injected in ATN and Chrimson in RSC (cf. Fig. 7E). The 9 pairs mentioned in line 367 refer to another type of experiment where we stimulated layer 3 neurons by depolarizing them to induce action potential firing while recording neighboring layer 4 neurons to assess connectivity. Line 367  now reads: “In n = 9 paired recordings, we did not detect functional synapses between layer 3 and layer 4 neurons.”

      Reviewer #3 (Recommendations For The Authors):

      Questions for the authors/points for addressing:

      I found that the slice electrophysiology experiments were not reported with sufficient detail. For example, in Figure 2, I am assuming that the voltage clamp experiments were carried out using the Cs-based recording solution, while the current clamp experiments were carried out using the K-Gluc intracellular solution. However, this is not explicitly stated and it is possible that all of these experiments were performed using the K-Gluc solution, which would give slightly odd EPSCs due to incomplete space/voltage clamp. Furthermore, the method states that gabazine was used to block GABA(A) receptor-mediated currents, but not when this occurred. Was GABAergic neurotransmission blocked for all measurements of EPSC magnitude/dynamics? If so, why not block GABA(B) receptors? If not blocking GABAergic transmission for measuring EPSCs, why not? This should be stated explicitly either way.

      The addition of drugs or difference of solution is indicated in the figure legend and/or in the figure itself, as well as in the methods. We now state explicitly: “In a subset of experiments, the following drugs were used to modulate the responses to optogenetic stimulations; the presence of these drugs is indicated in the figure and figure legend, whenever applicable.” (line 632). A Cs-based internal solution and gabazine were used in Figure 5, this is now indicated in the Methods section (line 626). All other experiments were performed using K-Gluc as an internal solution and ACSF.

      Methods: The experiments involving animals are incompletely reported. For example, were both sexes used? The methods state "Experiments were performed on wild‐type and transgenic C57Bl6 mice" - what transgenic mice were used and why is this not reported in detail (strain, etc)? I would refer the authors to the ARRIVE guidelines for reporting in vivo experiments in a reproducible manner (https://arriveguidelines.org/).

      We now added this information in the methods section, subsection “Animals” (line 566-567). Animals of both sexes were used. The only transgenic mouse line used was the Ai14 reporter line (no phenotype), depending on the availability in our animal facility.

      For experiments comparing ATN and RSC inputs onto the same neuron (e.g. Figure 2 supplement 2 G - J), are the authors certain that the observed differences (e.g. rise time and paired-pulse facilitation on the ATN input) are due to differences in the synapses and not a result of different responses of the opsins? Refer to https://pubmed.ncbi.nlm.nih.gov/31822522/ from Jess Cardin's lab. This could easily be tested by switching which opsin is injected into which nucleus (a fair amount of extra work) or comparing the Chrimson synaptic responses with those evoked using Chronos on the same projection, as used in Figure 2 (quite easy as authors should already have the data).

      We actually did switch the opsins across the two injection sites. In Figure 2 - supplement 2G-J, the values linked by a dashed line result from recordings in the switched configuration with respect to the original configuration (in full lines, Chronos injected in RSC and Chrimson in ATN). The values from switched configuration followed the trend of the main configuration and were not statistically different (Mann-Whitney U test).

      Statistical reporting: While the number of cells is generally reported for experiments, the number of slices and animals is not. While slice ephys often treat cells as individual biological replicates, this is not entirely appropriate as it could be argued that multiple cells from a single animal are not independent samples (some sort of mixed effects model that accounts for animals as a random effect would be better). For the experiments in the manuscript, I don't think this is necessary, but it would certainly reassure the reader to report how many animals/slices each dataset came from. At a bare minimum, one would want any dataset to be taken from at least 3 animals from 2 different litters, regardless of how many cells are in there.

      Our slice electrophysiology experiments include data from 38 successfully injected animals: 14 animals injected in ATN, 20 animals injected in RSC, and 4 double injected animals. Typically, we recorded 1 to 3 cells per slice. We now include this information in the text or in the figure legends (line 159, 160, 297, 767, 826, 831, 832, 839, 845, 901, 941).

      For the optogenetic experiments looking at the summation of EPSPs (e.g. figure 4), I have two questions: why were EPSPs measured and not EPSCs? The latter would be expected to give a better readout of AMPA receptor-mediated synaptic currents. And secondly, why was 20 Hz stimulation used for these experiments? One might expect theta stimulation to be a more physiologically-relevant frequency of stimulation for comparing ATN and RSC inputs to single neurons, given the relevance with spatial navigation and that the paper's conclusions were based around the head direction system. Similarly, gamma stimulation may also have been informative. Did the authors try different frequencies of stimulation?

      Question 1. The current clamp configuration allows to measure  EPSPamplification/prolongation by NMDA or persistent Na currents (cf.  Fricker and Miles 2000), which might contribute to supralinearity.

      Question 2. In a previous study from our group about the AD to PrS connection (Nassar et al., 2018), no significant difference was observed on the dynamics of EPSCs between stimulations at 10 Hz versus 30 Hz. Therefore we chose 20 Hz. This value is in the range of HD cell firing (Taube 1995, 1998 (peak firing rates, 18 to 24 spikes/sec in RSC; 41 spikes/sec in AD)(mean firing rates might be lower), Blair and Sharp 1995). In hindsight, we agree that it would have been useful to include 8Hz or 40Hz stimulations. 

      The GABA(A) antagonist experiments in Figure 5 are interesting but I have concerns about the statistical power of these experiments - n of 3 is absolutely borderline for being able to draw meaningful conclusions, especially if this small sample of cells came from just 1 or 2 animals. The number of animals used should be stated and/or caution should be applied when considering the potential mechanisms of supralinear summation of EPSPs. It looks like the slight delay in RSC input EPSP relative to ATN that was in earlier figures is not present here - could this be the loss of feedforward inhibition?

      The current clamp experiments in the presence of QX314 and a Cs gluconate based internal solution were preceded by initial experiments using puff applications of glutamate to the recorded neurons (not shown). Results from those experiments had pointed towards a role for TTX resistant sodium currents and for NMDA receptor activation as a factor favoring the amplification and prolongation of glutamate induced events. They inspired the design of the dual wavelength stimulation experiments shown in Figure 5, and oriented our discussion of the results. We agree of course that more work is required to dissect the role of disinhibition for EPSP amplification. This is however beyond the present study.

      Concerning the EPSP onset delays following RSC input stimulation:  In this set of experiments, we compensated for the notoriously longer delay to EPSP onset, following RSC axon stimulation, by shifting the photostimulation (red) of RSC fibers to -2 ms, relative to the onset of photostimulation of ATN fibers (blue). This experimental trick led to an improved  alignment of the onset of the postsynaptic response, as shown in the figure below for the reviewer.

      Author response image 3.

      In these experiments, the onset of RSC photostimulation was shifted forward in time by -2 ms, in an attempt to better align the EPSP onset to the one evoked by ATN stimulation.<br />

      We insert in the results a sentence to indicate that experiments illustrated in Figure 5 were performed in only a small sample of 3 cells that came from 2 mice (line 297), so caution should be applied. In the discussion we  formulate more carefully, “From a small sample of cells it appears that EPSP amplification may be facilitated by a reduction in synaptic inhibition (n = 3; Figure 5)” (line 487).

      Figure 7: I appreciate the difficulties in making dual recordings from older animals, but no conclusion about the RSC input can legitimately be made with n=1.

      Agreed. We want to avoid any overinterpretation, and point out in the results section that the RSC stimulation data is from a single cell pair. The sentence now reads : “... layer 4 neurons occurred after firing in the layer 3 neuron, following ATN afferent stimuli, in 4 out of 5 cell pairs. We also observed this sequence when RSC input was activated, in one tested pair.” line (347-349)

      Minor points:

      Line 104: 'within the two subnuclei that form the anterior thalamus' - the ATN actually has three subdivisions (AD, AV, AM) so this should state 'two of the three nuclei that form the anterior thalamus...'

      Corrected, line 103

      Line 125: should read "figure 1F" and not "figure 2F".

      Corrected, line 124

      Line 277-280: Why were two different posthoc tests used on the same data in Figures 3E & F?

      We used Sidak’s multicomparison test to compare each event Sum vs. Dual (two different configurations at each time point - asterisks) and Friedman’s and Dunn’s to compare the nth EPSP amplitude to the first one for Dual events (same configuration between time points - hashmarks). We give two-way ANOVA results in the legend.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      In this study, the authors identify an insect salivary protein participating viral initiate infection in plant host. They found a salivary LssaCA promoting RSV infection by interacting with OsTLP that could degrade callose in plants. Furthermore, RSV NP bond to LssaCA in salivary glands to form a complex, which then bond to OsTLP to promote degradation of callose.

      The story focus on tripartite virus-insect vector-plant interaction and is interesting. However, the study is too simple and poor-conducted. The conclusion is also overstated due to unsolid findings.

      We thank the reviewer for their constructive feedback. We have conducted additional experiments to strengthen our results and conclusions as detailed below:

      (1) The comparison between vector inoculation and microinjection involves multiple confounding factors that could affect the experimental results, including salivary components, RSV inoculation titers, and the precision of viral deposition. The differential outcomes could be attributed to these various factors rather than definitively demonstrating the necessity of salivary factors. Therefore, we have removed this comparison from the revised manuscript and instead focused on elucidating the specific mechanisms by which LssaCA facilitates viral infection.

      (2) We conducted new experiments to assess the function of LssaCA enzymatic activity in mediating RSV infection. Additional experiments revealed that OsTLP enzymatic activity is highly pH-dependent, with increased activity as pH decreases from 7.5 to 5.0 (Fig. 3H). However, the LssaCA-OsTLP interaction at pH 7.4 significantly enhanced OsTLP enzymatic activity without requiring pH changes. These results demonstrate that LssaCA-OsTLP protein interactions are crucial for mediating RSV infection. In contrast to pH-dependent mechanisms, our study demonstrated that LssaCA's biological function in mediating RSV infection is at least partially, if not completely, independent of its enzymatic activity. We have added these new resulted into the revised manuscript (Lines 220-227). We have also added a comprehensive discussion comparing the aphid CA mechanism described by Guo et al. (2023 doi.org/10.1073/pnas.2222040120) with our findings in the revised manuscript (Lines 350-371).

      (3) We have repeated majority of callose deposition experiments, providing clearer images (Figures 5-6). In addition to aniline blue staining, we quantified callose concentrations using a plant callose ELISA kit to provide more precise measurements (Figure 5A, I, 6A, C and S8A). We utilized RT-qPCR to measure callose synthase expression in both feeding and non-feeding areas, confirming that callose synthesis was induced specifically in feeding regions, leading to localized callose deposition (Figures 5D-G and S8B-E). For sieve plate visualization, we examined longitudinal sections, which revealed callose deposition in sieve plates during SBPH feeding and RSV infection (Figure S7).

      (4) We generated OsTLP mutant rice seedlings (ostlp) and use this mutant to directly demonstrate that LssaCA mediates callose degradation in planta through enhancement of OsTLP enzymatic activity (Lines 288-302 and Figure 6).

      (5) We produced LssaCA recombinant proteins in sf9 cells to ensure full enzymatic activity and constructed a comprehensive CA mutant protein, in which all seven residues constituting the enzymatic active center mutated (LssaCA<sup>H111D</sup>,LssaCA<sup>N139H</sup>,LssaCA<sup>H141D</sup>, LssaCA<sup>H143D</sup>, LssaCA<sup>E153H</sup>, LssaCA<sup>H166D</sup>, LssaCA<sup>T253E</sup>) (Fig. S1B). This LssaCA mutant protein demonstrated complete loss of enzymatic activity (Fig. 1C).

      Major comments:

      (1) The key problem is that how long the LssCA functioned for in rice plant. Author declared that LssCA had no effect on viral initial infection, but on infection after viral inoculation. It is unreasonable to conclude that LssCA promoted viral infection based on the data that insect inoculated plant just for 2 days, but viral titer could be increased at 14 days post-feeding. How could saliva proteins, which reached phloem 12-14 days before, induce enough TLP to degrade callose to promote virus infection? It was unbelievable.

      We appreciate your insightful comment and acknowledge that our initial description may have been unclear. We agree that salivary proteins would not present in plant tissues for two weeks post-feeding or post-injection. Our intention was to clarify that when salivary proteins enhance RSV infection, this initial enhancement leads to sustained high viral loads. We measured viral burden at 14 days post-feeding or post-injection because this is the common measurement time point when viral titers are sufficiently high for reliable detection by qRT-PCR or western blotting. We have clarified this rationale in the revised manuscript (Lines 155-157).

      To determine the actual persistence of LssaCA in plant tissues, we conducted additional experiments where insects were allowed to feed on a defined aera of rice seedlings for two days. We then monitored LssaCA protein levels at 1 and 3 days after removing the insects. Western blotting analysis revealed that LssaCA protein levels decreased post-feeding and remained detectable at 3 days post-feeding. These results are presented in Figure 2H and described in detail in Lines 184-193.

      (2) Lines 110-116 and Fig. 1, the results of viruliferous insect feeding and microinjection with purified virus could not conclude the saliva factor necessary of RSV infection, because these two tests are not in parallel and comparable. Microinjection with salivary proteins combined with purified virus is comparable with microinjection with purified virus.

      We thank the reviewer’s insightful comment. We agree that “the results of viruliferous insect feeding and microinjection with the purified virus could not conclude the saliva factor necessary of RSV infection”. However, due to the technical difficulty in collecting sufficient quantities of salivary proteins to conduct the microinjection experiment, we have removed these results from the revised manuscript.

      (3) The second problem is how many days post viruliferous insect feeding and microinjection with purified virus did author detect viral titers? in Method section, authors declared that viral titers was detected at 7-14 days post microinjection. Please demonstrate the days exactly.

      We thank the reviewer’s insightful comment. We typically measured RSV infection levels at both 7- and 14-days post-microinjection. However, since the midrib microinjection experiments have been removed from the revised manuscript, this methodology has also been removed accordingly.

      (4) The last problem is that how author made sure that the viral titers in salivary glands of insects between two experiments was equal, causing different phenotype of rice plant. If not, different viral titers in salivary glands of insects between two experiments of course caused different phenotype of rice plant.

      We thank the reviewer’s comment. When we compared the effects of LssaCA deficiency on RSV infection of rice plants, we have compared the viral titers in the insect saliva and salivary glands. The results indicated that the virus titers in both tissues have not changed by LssaCA deficiency, suggesting that the viruses inoculated into rice phloem by insects of different treatments were comparable. Please refer to the revised manuscript Figures 2D-G and Lines 161-173.

      (5) The callose deposition in phloem can be induced by insect feeding. In Fig. 5H, why was the callose deposition increased in the whole vascular bundle, but not phloem? Could the transgenic rice plant directional express protein in the phloem? In Fig. 5, why was callose deposition detected at 24 h after insect feeding? In Fig. 6A, why was callose deposition decreased in the phloem, but not all the cells of the of TLP OE plant? Also in Fig.6A and B, expression of callose synthase genes was required.

      We thank the reviewer for these insightful comments.

      (1) Figure 5. The callose deposition increased in multiple cells within the vascular bundle, including sieve tubes, parenchymatic cells, and companion cells. While callose deposition was detected in other parts of the vascular bundle, no significant differences were observed between treatments in these regions, indicating that in response to RSV infection and other treatments, altered callose deposition mainly occurred in phloem cells. Please refer to the revised 5B, 5J, 6B, and 6D.

      (2) Transgenic plant expression. The OsTLP-overexpressing transgenic rice plants express TLP proteins in various cells under the control of CaMV 35S promoter, rather than being directionally expressed in the phloem. However, since TLP proteins are secreted, they are potentially transported and concentrated in the phloem where they can degrade callose.

      (3) Figure 5. The 24-hour time point for callose deposition detection was selected based on established protocols from previous studies. According to Hao et al. (Plant Physiology 2008), callose deposition increased during the first 3 days of planthopper infestation and decreased after 4 days. Additionally, Ellinger and Voigt (Ann Bot 2014) demonstrated that callose visualization typically begins 18-24 hours after treatment, making 24 hours an optimal detection time point.

      (4) Figure 6, Phloem-specific changes. Similar to Figure 5, while callose deposition was detected in other parts of vascular bundle, significant differences between treatments were mainly observed in phloem cells, indicating that RSV infection specifically affects callose deposition in phloem tissue.

      (5) Callose synthase gene expression. We performed RT-qPCR analysis to measure the expression levels of callose synthase genes. The results indicated that OsTLP overexpression did not significantly alter the mRNA levels of these genes, regardless of RSV infection status in SBPH.

      Reviewer #2 (Public Review):

      There is increasing evidence that viruses manipulate vectors and hosts to facilitate transmission. For arthropods, saliva plays an essential role for successful feeding on a host and consequently for arthropod-borne viruses that are transmitted during arthropod feeding on new hosts. This is so because saliva constitutes the interaction interface between arthropod and host and contains many enzymes and effectors that allow feeding on a compatible host by neutralizing host defenses. Therefore, it is not surprising that viruses change saliva composition or use saliva proteins to provoke altered vector-host interactions that are favorable for virus transmission. However, detailed mechanistic analyses are scarce. Here, Zhao and coworkers study transmission of rice stripe virus (RSV) by the planthopper Laodelphax striatellus. RSV infects plants as well as the vector, accumulates in salivary glands and is injected together with saliva into a new host during vector feeding.

      The authors present evidence that a saliva-contained enzyme - carbonic anhydrase (CA) - might facilitate virus infection of rice by interfering with callose deposition, a plant defense response. In vitro pull-down experiments, yeast two hybrid assay and binding affinity assays show convincingly interaction between CA and a plant thaumatin-like protein (TLP) that degrades callose. Similar experiments show that CA and TLP interact with the RSV nuclear capsid protein NT to form a complex. Formation of the CA-TLP complex increases TLP activity by roughly 30% and integration of NT increases TLP activity further. This correlates with lower callose content in RSV-infected plants and higher virus titer. Further, silencing CA in vectors decreases virus titers in infected plants.

      (1) Interestingly, aphid CA was found to play a role in plant infection with two non-persistent non-circulative viruses, turnip mosaic virus and cucumber mosaic virus (Guo et al. 2023 doi.org/10.1073/pnas.2222040120), but the proposed mode of action is entirely different.

      We appreciate the reviewer’s insightful comment and have carefully examined the cited publication. The study by Guo et al. (2023) elucidates a distinct mechanism for aphid-mediated transmission of non-persistent, non-circulative viruses (turnip mosaic virus and cucumber mosaic virus). In their model, aphid-secreted CA-II in the plant cell apoplast leads to H<sup>+</sup> accumulation and localized acidification. This trigger enhanced vesicle trafficking as a plant defense response, inadvertently facilitating virus translocation from the endomembrane system to the apoplast.

      In contrast to these pH-dependent mechanisms, our study demonstrated that LssaCA’s biological function in mediating RSV infection is, if not completely, at least partially independent of its enzymatic activity. We performed additional experiments to reveal that OsTLP enzymatic activity is highly pH-dependent and exhibits increased enzymatic activity as pH decreases from 7.5 to 5.0 (Fig. 3H); however, the LssaCA-OsTLP interaction occurring at pH 7.4 significantly enhanced OsTLP enzymatic activity without any change in buffer pH (Fig. 3G). These results demonstrate the crucial importance of LssaCA-OsTLP protein interactions, rather than enzymatic activity alone, in mediating RSV infection.

      We have incorporated these new experimental results and added a comprehensive discussion comparing the aphid CA mechanism described by Guo et al. (2023) with our findings in the revised manuscript. Please refer to Figures 3G-H, Lines 220-227 and 350-371 for detailed information.

      (2) While this is an interesting work, there are, in my opinion, some weak points. The microinjection experiments result in much lower virus accumulation in rice than infection by vector inoculation, so their interpretation is difficult.

      We acknowledge the reviewer's concern regarding the lower virus accumulation observed in microinjection experiments compared to vector-mediated inoculation. We have removed these experiments from the revised manuscript. To address the core question raised by these experiments, we have conducted new experiments that directly demonstrate the importance of LssaCA-OsTLP protein-protein interactions in mediating RSV infection. These results demonstrate the crucial importance of LssaCA-OsTLP protein interactions, rather than enzymatic activity alone, in mediating RSV infection. Additionally, we have incorporated a comprehensive discussion examining carbonic anhydrase activity, pH homeostasis, and viral infection. Please refer to the detailed experimental results and discussion in the sections mentioned in our previous response (Figures 3G-H, Lines 220-227 and 350-371).

      (3) Also, the effect of injected recombinant CA protein might fade over time because of degradation or dilution.

      We appreciate the reviewer’s insightful comment. This is indeed a valid concern that could affect the interpretation of microinjection results. To address the temporal dynamics of CA protein presence in planta, we conducted time-course experiments to monitor the retention of naturally SBPH-secreted CA proteins in rice plants. Our analysis at 1- and 3- days post-feeding (dpf) revealed that CA protein levels decreased progressively following SBPH feeding, but could also been detected at 3dpf (Fig. 2H). Please refer to Figures 2H and lines 184-193 for detailed information.

      (4) The authors claim that enzymatic activity of CA is not required for its proviral activity. However, this is difficult to assess because all CA mutants used for the corresponding experiments possess residual activity.

      We appreciate the reviewer’s insightful comment. We constructed a comprehensive CA mutant protein in which all seven residues constituting the enzymatic active center mutated (LssaCA<sup>H111D</sup>, LssaCA<sup>N139H</sup>, LssaCA<sup>H141D</sup>, LssaCA<sup>H143D</sup>, LssaCA<sup>E153H</sup>, LssaCA<sup>H166D</sup>, LssaCA<sup>T253E</sup>) (Fig. S1B). This LssaCA mutant protein demonstrated complete loss of enzymatic activity (Fig. 1C). However, since we have removed the recombinant CA protein microinjection experiments from the revised manuscript, we lack sufficient direct evidence to definitively demonstrate that CA enzymatic activity is dispensable for its proviral function. To address the core question raised by these experiments, we have conducted new experiments that provide direct evidence for the importance of LssaCA-OsTLP protein-protein interactions in mediating RSV infection. Additionally, we have incorporated a comprehensive discussion examining carbonic anhydrase activity, pH homeostasis, and viral infection. Please refer to the detailed experimental results and discussion in the sections mentioned in our previous response (Figures 3G-H, Lines 220-227 and 350-371).

      (5) It remains also unclear whether viral infection deregulates CA expression in planthoppers and TLP expression in plants. However, increased CA and TLP levels could alone contribute to reduced callose deposition.

      We have compared LssaCA mRNA levels in RSV-free and RSV-infected L.striatellus salivary glands, which indicated that RSV infection does not significantly affect LssaCA expression (Figure 1J). By using RSV-free and RSV-infected L.striatellus to feed on rice seedlings, we clarified that RSV infection does not affect TLP expression in plants (Figure 5H).

      Reviewer #1: (Recommendations For The Authors):

      Other comments:

      (1) Most data proving viral infection and LssaCA expression were derived from qPCR assays. Western blot data are strongly required to prove the change at the protein level.

      We agree that western blot data are required to prove the change at the protein level. In the revised manuscript, we have added western-blotting results (Figures 1F, 1I, 2C, 2J, and S6).

      (2) Line 145, data that LssaCA was significantly downregulated should be shown.

      Thank you and the data has been added to the revised manuscript. Please refer to Line 165 and Figure 2D.

      (3) Lines 159-161, how did authors assure that the dose of recombinant LssCA was closed to the release level of insect feeding, but not was excessive? How did author exclude the possibility of upregulated RSV titer caused by excessive recombinant LssCA?

      We appreciate this important concern regarding dosage controls. While microinjection of recombinant proteins typically yields viral infection levels significantly lower than those achieved through natural insect feeding, higher protein concentrations are often required to achieve high viral infection levels. In this experiment, we compared RSV infection levels following microinjection of BSA+RSV versus LssaCA+RSV, with the expectation that any observed upregulation in RSV titer would be specifically attributable to recombinant LssaCA rather than excessive protein dosing. However, given the low RSV infection levels observed with viral microinjection, we have removed their corresponding results from the revised manuscript.

      (4) Lines 124-125, recombinantly expressed LssaCA protein should be underlined, but not the LssaCA protein itself.

      We have clearly distinguished recombinantly expressed LssaCA from endogenous LssaCA protein throughout the manuscript, ensuring that all references to recombinant proteins are properly labeled as such.

      (5) LssaCA expression in salivary glands of viruliferous and nonviruliferous insects is required. LssaCA accumulation in rice plant exposed to viruliferous and nonviruliferous insects is also required.

      We have measured LssaCA mRNA levels in salivary glands of viruliferous and nonviruliferous insects (Figure 1J), and protein levels in rice plant exposed to viruliferous and nonviruliferous insects (Figure 1I).

      (6) Fig. 4G, the enzymatic activities of OsTLP were too low compared with that in Fig. 4E and Fig. 7E. Why did the enzymatic activities of the same protein show so obvious difference?

      We apologize for the error in Fig. 4G. The original data presented relative fold changes between OsTLP+BSA and OsTLP+LssaCA treatment, with OsTLP+BSA normalized to 1.0 and OsTLP+LssaCA values expressed as fold changes relative to this baseline. However, the Y-axis was incorrectly labeled as “β-1,3-glucanase (units mg<sup>-1</sup>)”, which suggested absolute enzymatic activity values. We have now corrected the figure (revised Figure 3G) to display the actual absolute enzymatic activity values with the appropriate Y-axis label “β-1,3-glucanase (units mg<sup>-1</sup>)”.

      (7) Fig. 7E, was the LssaCA + NP and LssaCA + GST quantified?

      Yes, all proteins were quantified, and enzymatic activity values were calculated and expressed as units per milligram of proteins (units mg<sup>-1</sup>).

      Minor comments:

      (1) The keywords: In fact, the LssaCA functioned during initial viral infection in plant, but not viral horizontal transmission.

      We appreciate the reviewer’s insightful comment. We have revised the manuscript title to “Rice stripe virus utilizes an Laodelphax striatellus salivary carbonic anhydrase to facilitate plant infection by direct molecular interaction” and changed the keyword from “viral horizontal transmission” to “viral infection of plant”.

      (2) Fig. 2A, how about testes? Was this data derived from female insects? Fig. 2C, is the saliva collected from nonviruliferous insects? Fig. 2E, what is the control?

      We appreciate the reviewer’s insightful comments.

      (1) Fig. 2A: The data present mean and SD calculated from three independent experiments, with 5 tissue samples per experiment. Since 3<sup>rd</sup> instar nymphs were used for feeding experiments in this study, we also used 3<sup>rd</sup> instar RSV-free nymphs to measure gene expression in guts, salivary glands and fat bodies. R-body represents the remaining body after removing these tissues. Female insects were used to measure gene expression in ovaries, and gene expression in testes was also added. We have added this necessary information to the revised manuscript (please refer to new Figure 1F and Lines 402-403).

      (2) Fig. 2C: Yes, saliva was collected from nonviruliferous insects.

      (3) Fig. 2E: The control consisted of 100 mM PBS, as described in the experimental section (Lines 643-644): “A blank control consisted of 2 mL of 100 mM PBS (pH 7.0) mixed with 1 mL of 3 mM p-NPA.” In the revised manuscript, we recombinantly expressed LssaCA and its mutant proteins in both sf9 cells and E.coli. Therefore, we have used the mutant proteins as controls to demonstrate specific enzymatic activity. Please refer to Figure 1C, Lines 115-122 and 621-635 for detailed information.

      (3) Some figure labeling appeared unprofessional. For example, "a-RSV", "loading" in Fig. 1, "W-saliva", "G-saliva" in Fig. 2, and so on, the related explanations were absent.

      We appreciate the reviewer’s insightful comments. We have thoroughly reviewed all figures to ensure professional labels. Specifically, we have:

      (1) Used proper protein names to label western blots and clearly explained the antibodies used for protein detection.

      (2) Provided comprehensive explanations for all abbreviations used in figures within the corresponding figure legends.

      (3) Ensured consistent and clear labeling throughout all figures.

      Please refer to the revised Figures 1-3 for these corrections.

      (4) Lines 83-84, please cite references on callose preventing viral movement. I do not think the present references were relevant.

      We have added a more relevant reference (Yue et al., 2022, Line 82), which revealed that palmitoylated γb promotes virus cell-to-cell movement by interacting with NbREM1 to inhibit callose deposition at plasmodesmata.

      (5) The background of transgenic plants of OsTLP OE should be characterized. And the overexpression of OsTLP should be shown. Which generation of OsTLP OE did authors use?

      The background of transgenic plants of OsTLP OE and its generation used have been shown in the “Materials and methods” section (Line 782-786) and has been mentioned in the main text (Line 214). T<sup>2</sup> lines have been selected for further analysis (Line 789).

      (6) Fig. 5A, the blank, which derived from plants without exposure to insect, was absent.

      We appreciate the reviewer’s insightful comments. We have added the non- fed control in the revised Figure 5A-C.

      (7) Fig. 7A, the nonviruruliferous insects were required to serve as a control.

      Immunofluorescence localization of RSV and LssaCA in uninfected L. striatellus salivary glands have been added to the revised manuscript (Figure S2).

      (8) The manuscript needs English language edit.

      The manuscript has undergone comprehensive English language editing to improve clarity, grammar, and overall readability.

      Reviewer #2 (Recommendations For The Authors):

      (1) The first experiment compares vector inoculation vs microinjection of RSV in tissue. I am not sure that your claim (saliva factors are necessary for inoculation) holds, because the vector injects RSV directly into the phloem, whereas microinjection is less precise and you cannot control where exactly the virus is deposed. However, virus deposited in other tissues than the phloem might not replicate, and indeed you observe, compared to natural vector inoculation, highly reduced virus titers.

      We appreciate the reviewer’s insightful comments. We agree that the comparison between vector inoculation and microinjection involves multiple confounding factors that could affect the experimental results, including salivary components, RSV inoculation titers, and the precision of viral deposition. As the reviewer correctly points out, the differential outcomes could be attributed to these various factors rather than definitively demonstrating the necessity of salivary factors. Therefore, we have removed this comparison from the revised manuscript and instead focused on elucidating the specific mechanisms by which LssaCA facilitates viral infection.

      (2) Next the authors show that a carbonic anhydrase (CA) that they previously detected in saliva is functional and secreted into rice. I assume this is done with non-infected insects, but I did not find the information. Silencing the CA reduces virus titers in inoculated plants at 14 dpi, but not in infected planthoppers. At 1 dpi, there is no difference in RSV titer in plants inoculated with CA silenced planthoppers or control hoppers. To see a direct effect of CA in virus infection, purified virus is injected together with a control protein or recombinant CA into plants. At 14 dpi, there is about double as much virus in the CA-injected plants, but compared to authentic SBPH inoculation, titers are 20,000 times lower. Actually, I believe it is not very likely that the recombinant CA is active or present so long after initial injection.

      We appreciate the reviewer’s insightful comments.

      (1) Our previous study identified the CA proteins from RSV-free insects. We have added this information to the revised manuscript (Line 110).

      (2) We acknowledge the reviewer's concern regarding the lower virus accumulation observed in microinjection experiments compared to vector-mediated inoculation. We have removed these experiments from the revised manuscript and instead focused on elucidating the specific mechanisms by which LssaCA facilitates viral infection.

      (3) We didn’t intend to suggest that LssaCA proteins presented for 14 days post-injection. We measured viral titers at 14 days post-feeding or post-injection because this is the common measurement time point when viral titers are sufficiently high for reliable detection by RT-qPCR or western blotting. We have clarified this rationale in the revised manuscript (Lines 155-157). To determine the actual persistence of LssaCA in plant tissues, we monitored LssaCA protein levels at 1 and 3 dpf. Western blotting analysis revealed that LssaCA protein levels decreased post-feeding and remained detectable at 3 dpf. These results are presented in Figure 2H and described in detail in Lines 184-193.

      (3) Then the authors want to know whether CA activity is required for its proviral action and single amino acid mutants covering the putative active CA site are created. The recombinant mutant proteins have 30-70 % reduced activity, but none of them has zero activity. When microinjected together with RSV into plants, RSV replication is similar as injection with wild type CA. Since no knock-out mutant with zero activity is used, it is difficult to judge whether CA activity is unimportant for viral replication, as claim the authors.

      We appreciate the reviewer’s insightful comment. We constructed a comprehensive CA mutant protein in which all seven residues constituting the enzymatic active center mutated (LssaCA<sup>H111D</sup>, LssaCA<sup>N139H</sup>, LssaCA<sup>H141D</sup>, LssaCA<sup>H143D</sup>, LssaCA<sup>E153H</sup>, LssaCA<sup>H166D</sup>, LssaCA<sup>T253E</sup>) (Fig. S1B). This LssaCA mutant protein demonstrated complete loss of enzymatic activity (Fig. 1C). However, since we have removed the recombinant CA proteins microinjection experiments from the revised manuscript, we lack sufficient direct evidence to definitively demonstrate that CA enzymatic activity is dispensable for its proviral function. To address the core question raised by these experiments, we have conducted new experiments that provide direct evidence for the importance of LssaCA-OsTLP protein-protein interactions in mediating RSV infection. Additionally, we have incorporated a comprehensive discussion examining carbonic anhydrase activity, pH homeostasis, and viral infection. Please refer to the detailed experimental results and discussion in the sections mentioned in our previous response (Figures 3G-H, Lines 220-227 and 350-371).

      (4) Next a yeast two hybrid assay reveals interaction with a thaumatin-like rice protein (TLP). It would be nice to know whether you detected other interacting proteins as well. The interaction is confirmed by pulldown and binding affinity assay using recombinant proteins. The kD is in favor of a rather weak interaction between the two proteins.

      We have added a list of rice proteins that potentially interact with LssaCA (Table S1) and have measured interactions with additional proteins (unpublished data). Despite the relatively weak binding affinity, the functional significance of the LssaCA-OsTLP interaction in enhancing TLP enzymatic activity is substantial.

      (5) Then the glucanase activity of TLP is measured using recombinant TLP-MBP or in vivo expressed TLP. It is not clear to me which TLP is used in Fig. 4G (plant-expressed or bacteria-expressed). If it is plant-expressed TLP, why is its basic activity 10 times lower than in Fig. 4F?

      Fig. 4G is the Fig. 3G in the revised manuscript. A E. coli-expressed TLP protein has been used. We apologize for the error in our original Fig. 4G. The original data presented relative fold changes between OsTLP+BSA and OsTLP+LssaCA treatment, with OsTLP+BSA normalized to 1.0 and OsTLP+LssaCA values expressed as fold changes relative to this baseline. However, the Y-axis was incorrectly labeled as “β-1,3-glucanase (units mg<sup>-1</sup>)”, which suggested absolute enzymatic activity values. We have now corrected the figure to display the actual absolute enzymatic activity values with the appropriate Y-axis label “β-1,3-glucanase (units mg<sup>-1</sup>)”.

      (6) There is also a discrepancy in the construction of the transgenic rice plants: did you use TLP without signal peptide or full length TLP? If you used TLP without signal peptide, you should explain why, because the wild type TLP contains a signal peptide.

      We cloned the full-length OsTLP gene including the signal peptide sequence (Line 782 in the revised manuscript).

      (7) The authors find that CA increases glucanase activity of TLP. Next the authors test callose deposition by aniline blue staining. Feeding activity of RSV-infected planthoppers induces more callose deposition than does feeding by uninfected insects. In the image (Fig. 5A) I see blue stain all over the cell walls of xylem and phloem cells. Is this what the authors expect? I would have expected rather a patchy pattern of callose deposition on cell walls. Concerning sieve plates, I cannot discern any in the image; they are easier to visualize in longitudinal sections than in transversal section as presented here.

      We appreciate the reviewer’s insightful comment.

      (1) Callose deposition pattern: While callose deposition was detected in other parts of the vascular bundle, significant differences between treatments were mainly observed in phloem cells, indicating that phloem-specific callose deposition is the primary response to RSV infection and SBPH feeding (Figures 5B and 5J).

      (2) Sieve plate visualization: We have examined longitudinal sections to visualize sieve plates, which revealed callose deposition in sieve plates during SBPH feeding and RSV infection (Figure S7).

      (3) Quantitative analysis: In addition to aniline blue staining, we quantified callose concentrations using a plant callose ELISA kit to provide more precise measurements (Figure 5A, 5I and S8A).

      (4) Gene expression analysis: We utilized RT-qPCR to measure callose synthase expression in both feeding and non-feeding areas, confirming that callose synthesis was induced specifically in feeding regions, leading to localized callose deposition (Figures 5D-H).

      These experimental results collectively demonstrate that RSV infection induces enhanced callose synthesis and deposition, with this response occurring primarily in phloem cells, including sieve plates, within feeding sites and their immediate vicinity.

      (8) I do not quite understand how you quantified callose deposition (arbitrary areas?) with ImageJ. Please indicate in detail the analysis method.

      We have added more detailed information for the methods to quantify callose deposition (Lines 673-678).

      (9) More callose content is also observed by a callose ELISA assay of tissue extracts and supported by increased expression of glucanase synthase genes. Did you look whether expression of TLP is changed by feeding activity and RSV infection? Silencing CA in planthoppers increases callose deposition, which is inline with the observation that CA increases TLP activity.

      We measured OsTLP expression following feeding by RSV-free or RSV-infected SBPH and found that gene expression was not significantly affected by either insect feeding or RSV infection. These results have been added to the revised manuscript (Lines 275-277 and Figure 5H).

      (10) Next, callose is measured after feeding of RSV-infected insects on wild type or TLP-overexpressing rice. Less callose deposition (after 2 days) and more virus (after 14 days) is observed in TLP overexpressors. I am missing a control in this experiment, that is feeding of uninfected insects on wild type or TLP overexpressing rice, where I would expect intermediate callose levels.

      We appreciate the reviewer’s insightful comment and fully agree with the prediction. In the revised manuscript, we have constructed ostlp mutant plants and conducted additional experiments to further clarify how callose deposition is regulated by insect feeding, RSV infection, LssaCA levels, and OsTLP expression. Specifically: 

      (1) Both SBPH feeding and RSV infection induce callose deposition, with RSV-infected insect feeding resulting in significantly higher callose levels compared to RSV-free insect feeding (Fig. 5A-C).

      (2) LssaCA enhances OsTLP enzymatic activity, thereby promoting callose degradation (Fig. 5I-K).

      (3) OsTLP-overexpressing (OE) plants exhibit lower callose levels than wild-type (WT) plants, while ostlp mutant plants show higher callose levels than WT (Fig. 6A-B).

      (4) In ostlp knockout plants, LssaCA no longer affects callose levels, indicating that OsTLP is required for LssaCA-mediated regulation of callose (Fig. 6C-D).

      These additional data address the reviewer’s concern and support the conclusion that OsTLP plays a central role in modulating callose levels in response to RSV infection and insect feeding.

      (11) Next the authors test for interaction between virions and CA. Immunofluorescence shows that RSV and CA colocalize in salivary glands; in my opinion, there is partial and not complete colocalization (Fig. 7A).

      We agree with the reviewer’s observation. CA is primarily produced in the small lobules of the principal salivary glands, while RSV infects nearly all parts of the salivary glands. In regions where RSV and CA colocalize within the principal glands, the CA signal appears sharper than that of RSV, likely due to the relatively higher abundance of CA compared to RSV in these areas. This may explain the partial, rather than complete, colocalization observed in our original Figure 7A. In the revised manuscript, please refer to Figure 1A.

      (12) Pulldown experiments with recombinant RSV NP capsid protein and CA confirm interaction, binding affinity assays indicate rather weak interaction between CA and NP. Likewise in pull-down experiments, interaction between NP, CA and TLP is shown. Finally, in vitro activity assays show that activity of preformed TLP-CA complexes can be increased by adding NP; activity of TLP alone is not shown.

      We performed two independent experiments to confirm the influence on TLP enzymatic activity by LssaCA or by the LssaCA-RSV NP complex. In the first experiment, we compared the enhancement of TLP activity by LssaCA using TLP alone as a control (Figure 3G). In the second experiment examining the LssaCA-RSV NP complex effect on TLP activity, we used the LssaCA-TLP combination as the baseline control rather than TLP alone (Figure 4B), since we had already established the LssaCA enhancement effect in the previous experiment.

      (13) For all microscopic acquisitions, you should indicate the exact acquisition conditions, especially excitation and emission filter settings, kind of camera used and objectives. Use of inadequate filters or of a black & white camera could for example be the reason why you observe a homogeneous cell wall label in the aniline blue staining assays. Counterstaining cell walls with propidium iodide might help distinguish between cell wall and callose label.

      Thank you for your insightful suggestions. We have added the detailed information to the revised manuscript (Lines 656-659 and 673-678).

      (14) You should provide information whether CA is deregulated in infected planthoppers, as this could also modify its mode of action.\

      We have compared LssaCA mRNA levels in RSV-free and RSV-infected L.striatellus salivary glands. The results indicated that RSV infection does not significantly affect LssaCA expression (Figure 1J).

      (15) You should show purity of the proteins used for affinity binding measurements.

      We have included SDS-PAGE results of purified proteins in the revised manuscript (Figure S3).

      (16) L 39: Not all arboviruses are inoculated into the phloem.

      Thank you. We have revised this description (Lines 40, 73, 95 and 97).

      (17) L 76: Watery saliva is also injected in epidermis and mesophyll cells.

      Thank you. We have revised this description (Line 73).

      (18) L 79: What do you mean by "avirulent gene"?

      Thank you for your valuable comments. We have revised this description as “certain salivary effectors may be recognized by plant resistance proteins to induce effector-triggered immunity”. Please refer to Lines 76-77 for detail.

      (19) L 128: Please add delivery method.

      Thank you. We have added the delivery methods (Line 134).

      (20) L 195: Please explain "MST".

      Explained (Line 124). Thank you.

      (21) L 203: Please add the plant species overexpressing TLP.

      Added (Line 214). Thank you.

      (22) L 213: Callose deposition has also a role against phloem-feeding insects.

      We appreciate the reviewer’s insight comment. We have added this information to the revised manuscript (Line 252).

      (23) L 626: What is a "mutein"?

      "mutein" is an abbreviation for mutant proteins. Since the recombinant protein microinjection experiments have been removed from the revised manuscript, the term “mutein” has also been removed. For all other instances, we now use the full term “mutant proteins”.

      (24) Fig. 1E: what is "loading"? You should rather show here and elsewhere (or add to supplement) complete protein gels and Western blot membranes and not only bands of interest.

      Thank you for your valuable suggestion. Although Figure 1E has been removed from the revised manuscript, we have carefully reviewed all figures to ensure that the term “loading” has been replaced with the specific protein names where appropriate.

      (25) Fig. 2C: Please indicate which is the blot and which is the silver stained gel and add mass markers in kDa to the silver stained gel.

      Thank you for your suggestion. We have revised figure to include labeled silver-stained gels with indicated molecular weight markers (Figure 1H in the revised manuscript).

    1. Reviewer #2 (Public review):

      In this paper, Hamid et al present 40 genomes from the Faroe Islands. They use these data (a pilot study for an anticipated larger-scale sequencing effort) to discuss the population genetic diversity and history of the sample, and the Faroes population. I think this is an overall solid paper; it is overall well-polished and well-written. It is somewhat descriptive (as might be expected for an explorative pilot study), but does make good use of the data.

      The data processing and annotation follows a state-of-the-art protocol, and at least I could not find any evidence in the results that would pinpoint towards bioinformatic issues having substantially biased some of the results, and at least preliminary results lead to the identification of some candidate disease alleles, showing that small, isolated cohorts can be an efficient way to find populations with locally common, but globally rare disease alleles.

      I also enjoyed the population structure analysis in the context of ancient samples, which gives some context to the genetic ancestry of Faroese, although it would have been nice if that could have been quantified, and it is unfortunate that the sampling scheme effectively precludes within-Faroes analyses.

      I am unfortunately quite critical of the selection analysis, both on a statistical level and, more importantly, I do not believe it measures what the authors think it does.

      Major comments:

      (1) Admixture timing/genomic scaling/localization:<br /> As the authors lay out, the Faroes were likely colonized in the last 1,000-1,500 years, i.e., 40-60 generations ago. That means most genomic processes that have happened on the Faroese should have signatures that are on the order of ~1-2cM, whereas more local patterns likely indicate genetic history predating the colonization of the islands. Yet, the paper seems to be oblivious to this (to me) fascinating and somewhat unique premise. Maybe this thought is wrong, but I think the authors miss a chance here to explain why the reader should care beyond the fact that the small populations might have high-frequency risk alleles and the Faroes are intrinsically interesting, but more importantly, it also makes me think it leads to some misinterpretations in the selection analysis

      (2) ROH:<br /> Would the sampling scheme impact ROH? How would it deal with individuals with known parental coancestry? As an example of what I mean by my previous comment, 1MB is short enough in that I would expect most/many 1MB ROH-tracts to come from pedigree loops predating the colonization of the Faroes. (i.e, I am actually quite surprised that there isn't much more long ROH, which makes me wonder if that would be impacted by the sampling scheme).

      (3) Selection scan:

      We are talking about a bottlenecked population that is recently admixed (Faroese), compared to a population (GBR) putatively more closely related to one of its sources. My guess would be that selection in such a scenario would be possibly very hard to detect, and even then, selection signals might not differentiate selection in Faroese vs. GBR, but rather selection/allele frequency differences between different source populations. I think it would be good to spell out why XP-EHH/iHS measures selection at the correct time scale, and how/if these statistics are expected to behave differently in an admixed population.

      (4) Similarly, for the discussion of LCT, I am not convinced that the haplotypes depicted here are on the right scale to reflect processes happening on the Faroes. Given the admixture/population history, it at the very least should be discussed in the context of whether the 13910 allele frequency on the Faroes is at odds with what would be expected based on the admixture sources.

      (5) I am lacking information to evaluate the procedure for turning the outliers into p-values. Both iHS and XP-EHH are ratio statistics, meaning they might be heavy-tailed if one is not careful, and the central limit theorem may not apply. It would be much easier (and probably sufficient for the points being made here) to reframe this analysis in terms of empirical outliers.

      (6) Oldest individual predating gene flow: It seems impossible to make any statements based on a single individual. Why is it implausible that this person (or their parents), e.g., moved to the Faroes within their lifetime and died there?

    2. Author response:

      We thank the reviewers for their thoughtful comments and constructive suggestions. We describe how we will address each point below and are grateful for the guidance on areas where our work could be clarified or expanded. In particular, we note the following:

      Selection scan summary statistics: In our revised manuscript, we will include summary statistics from the selection scans. We believe this addition will enhance transparency and provide additional context for readers.

      Reporting of outliers: As highlighted by the editor, the reviewers expressed differing views on the most appropriate way to report outliers. To provide a comprehensive and balanced presentation, we will report both the empirical selection statistics and the corresponding converted p-values. This dual approach will allow readers to fully interpret the results under both perspectives.

      Methodological considerations: We have carefully considered the reviewers' methodological suggestions and will incorporate them into our revisions where possible. These changes strengthen the rigor and clarity of the analyses.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The paper reports an analysis of whole-genome sequence data from 40 Faroese. The authors investigate aspects of demographic history and natural selection in this population. The key findings are that the Faroese (as expected) have a small population size and are broadly of Northwest European ancestry. Accordingly, selection signatures are largely shared with other Northwest European populations, although the authors identify signals that may be specific to the Faroes. Finally, they identify a few predicted deleterious coding variants that may be enriched in the Faroes.

      Strengths:

      The data are appropriately quality-controlled and appear to be of high quality. Some aspects of the Faroese population history are characterized, in particular, by the relatively (compared to other European populations) high proportion of long runs of homozygosity, which may be relevant for disease mapping of recessive variants. The selection analysis is presented reasonably, although as the authors point out, many aspects, for example differences in iHS, can reflect differences in demographic history or population-specific drift and thus can't reliably be interpreted in terms of differences in the strength of selection.

      Weaknesses:

      The main limitations of the paper are as follows:

      (1) The data are not available. I appreciate that (even de-identified) genotype data cannot be shared; however, that does substantially reduce the value of the paper. Minimally, I think the authors should share summary statistics for the selection scans, in line with the standard of the field.

      We agree with the reviewer that sharing the selection scan results is important, so in the next revision of this manuscript we will make the selection scan summary statistics publicly available, and clearly lay out the guidelines and research questions for which the data can be accessed.

      (2) The insight into the population history of the Faroes is limited, relative to what is already known (i.e., they were settled around 1200 years ago, by people with a mixture of Scandinavian and British ancestry, have a small effective population size, and any admixture since then comes from substantially similar populations). It's obvious, for example, that the Faroese population has a smaller bottleneck than, say, GBR.

      More sophisticated analyses (for example, ARG-based methods, or IBD or rare variant sharing) would be able to reveal more detailed and fine-scale information about the history of the populations that is not already known. PCA, ADMIXTURE, and HaplotNet analysis are broad summaries, but the interesting questions here would be more specific to the Faroes, for example, what are the proportions of Scandinavian vs Celtic ancestry? What is the date and extent of sex bias (as suggested by the uniparental data) in this admixture? I think that it is a bit of a missed opportunity not to address these questions.

      We clarify that we did quantify the proportions of various ancestry components as estimated by HaploNet in main text Figure 5 and supplemental figures S5 and S6. In our revisions, we will include the average global ancestry of the various components in the Main Text so that this result is more clear.

      We agree that more fine-scale demographic analyses would be informative. We have begun working on an estimation of the admixture date, for example, but have encountered problems with using different standard date estimation software, which give very inconsistent and unstable results. We suspect this might be due to the strong bottleneck experienced in the history of the Faroe Islands breaking one or more of the assumptions of these methods. We will continue working on this problem in coming months, possibly using simulations to assess where the problem might be. We recognize that our relatively small sample size places limits on the fine-scale demographic analyses that can be performed. We are addressing this in ongoing work by generating a larger cohort, which we hope will enable more detailed inference in the future.

      (3) I don't really understand the rationale for looking at HLA-B allele frequencies. The authors write that "ankylosing spondylitis (AS) may be at a higher prevalence in the Faroe Islands (unpublished data), however, this has not been confirmed by follow-up epidemiological studies". So there's no evidence (certainly no published evidence) that AS is more prevalent, and hence nothing to explain with the HLA allele frequencies?

      We agree that no published studies have confirmed a higher prevalence of ankylosing spondylitis (AS) in the Faroe Islands. Our recruitment data suggest that AS might be more common than in other European populations, but we understand that this is only based on limited, unpublished observations and what we are hearing from the community. We emphasized in our original manuscript that this is based on observational evidence from the FarGen project. However, as this reviewer pointed out, we can be more clear that this prevalence has not been formally studied.

      In our next revision we will clarify in the text that our recruitment data suggest a higher prevalence of AS may be possible, but more formal epidemiological studies are needed to confirm this observation. The reason we study HLA-B allele frequencies is to see if the genetic background of the Faroese population could help explain this possible difference, since HLA-B27 is already known to play a strong role in AS.

      Reviewer #2 (Public review):

      In this paper, Hamid et al present 40 genomes from the Faroe Islands. They use these data (a pilot study for an anticipated larger-scale sequencing effort) to discuss the population genetic diversity and history of the sample, and the Faroes population. I think this is an overall solid paper; it is overall well-polished and well-written. It is somewhat descriptive (as might be expected for an explorative pilot study), but does make good use of the data.

      The data processing and annotation follows a state-of-the-art protocol, and at least I could not find any evidence in the results that would pinpoint towards bioinformatic issues having substantially biased some of the results, and at least preliminary results lead to the identification of some candidate disease alleles, showing that small, isolated cohorts can be an efficient way to find populations with locally common, but globally rare disease alleles.

      I also enjoyed the population structure analysis in the context of ancient samples, which gives some context to the genetic ancestry of Faroese, although it would have been nice if that could have been quantified, and it is unfortunate that the sampling scheme effectively precludes within-Faroes analyses.

      We note that although the ancestry proportions are not specified in the main text, we did quantify ancestry proportions in the modern Faroese individuals and other ancient samples, and we visualized these proportions in Figure 5 and Supplementary Figures S5 and S6. As stated in our response to Reviewer #1, in our revisions, we will more clearly state the average global ancestry of the various components in the Main Text.

      I am unfortunately quite critical of the selection analysis, both on a statistical level and, more importantly, I do not believe it measures what the authors think it does.

      Major comments:

      (1) Admixture timing/genomic scaling/localization:

      As the authors lay out, the Faroes were likely colonized in the last 1,000-1,500 years, i.e., 40-60 generations ago. That means most genomic processes that have happened on the Faroese should have signatures that are on the order of ~1-2cM, whereas more local patterns likely indicate genetic history predating the colonization of the islands. Yet, the paper seems to be oblivious to this (to me) fascinating and somewhat unique premise. Maybe this thought is wrong, but I think the authors miss a chance here to explain why the reader should care beyond the fact that the small populations might have high-frequency risk alleles and the Faroes are intrinsically interesting, but more importantly, it also makes me think it leads to some misinterpretations in the selection analysis

      See response to point #3

      (2) ROH:

      Would the sampling scheme impact ROH? How would it deal with individuals with known parental coancestry? As an example of what I mean by my previous comment, 1MB is short enough in that I would expect most/many 1MB ROH-tracts to come from pedigree loops predating the colonization of the Faroes. (i.e, I am actually quite surprised that there isn't much more long ROH, which makes me wonder if that would be impacted by the sampling scheme).

      The sampling scheme was designed to choose 40 Faroese individuals that were representative of the different regions and were minimally related. There were no pairs of third-degree relatives or closer (pi-hat > 0.125) in either the Faroese cohort or the reference populations. It is possible that this sampling scheme would reduce the amount of longer ROHs in the population, but we should still be able to see overall patterns of ROH reflective of bottlenecks in the past tens of generations. Additionally, based on this reviewer's earlier comment, 1 Mb ROHs would still be relevant to demographic events in the last 40-60 generations given that on average 1 cM corresponds to 1 Mb in humans, though we recognize that is not an exact conversion.

      That said, the “sum total amount of the genome contained in long ROH” as we described in the manuscript includes all ROHs greater than 1Mb. Although we group all ROHs longer than 1Mb into one category in the current manuscript, we can look more specifically at the distribution of the longer ROH in future revisions and add discussion into what this might tell us about the timing of bottlenecks. 

      For now, we share a plot of the distribution in ROH lengths across all individuals for each cohort. As this plot shows, there certainly are ROHs longer than 1Mb in the Faroese cohort, and on average there is a higher proportion of long ROH particularly in the 5-15 Mb range in the Faroese cohort relative to the other cohorts.

      Author response image 1.

      (3) Selection scan:

      We are talking about a bottlenecked population that is recently admixed (Faroese), compared to a population (GBR) putatively more closely related to one of its sources. My guess would be that selection in such a scenario would be possibly very hard to detect, and even then, selection signals might not differentiate selection in Faroese vs. GBR, but rather selection/allele frequency differences between different source populations. I think it would be good to spell out why XP-EHH/iHS measures selection at the correct time scale, and how/if these statistics are expected to behave differently in an admixed population.

      The reviewer brings up good points about the utility of classical selection statistics in populations that are admixed or bottlenecked, and whether the timescale at which these statistics detect selection is relevant for understanding the selective history of the Faroese population. We break down these concerns separately.

      (1) Bottlenecks: Recent bottlenecks result in higher LD within a population. However, demographic events such as bottlenecks affect global genomic patterns while positive selection is expected to affect local genomic patterns. For this reason, iHS and XP-EHH statistics are standardized against the genome-wide background, to account for population-specific demographic history.

      (2) Admixture: The term “admixture” has different interpretations depending on the line of inquiry and the populations being studied. Across various time and geographic scales, all human populations are admixed to some degree, as gene flow between groups is a common fixture throughout our history. For example,

      even the modern British population has “admixed” ancestry from North / West European sources as well, dating to at least as recently as the Medieval & Viking periods (Gretzinger et al. 2022, Leslie et al. 2015), yet we do not commonly consider it an “admixed” population, and we are not typically concerned about applying haplotype-based statistics in this population. This is due to the low divergence between the source populations. In the case of the Faroe Islands, we believe admixture likely occurred on a similar timescale. We see low variance in ancestry proportions estimated by HaploNet, both from the historical Faroese individuals (250BP) and the modern samples. This indicates admixture predating the settlement of the Faroe Islands, where recombination has had time to break up long ancestry tracts and the global ancestry proportions have reached an equilibrium. That is, these ancestry patterns suggest that the modern Faroese are most likely descended from already admixed founders. We mention this as a likely possibility in the main text: “This could have occurred either via a mixture of the original “West Europe” ancestry with individuals of predominantly “North Europe” ancestry, or a by replacement with individuals that were already of mixed ancestry at the time of arrival in the islands (the latter are not uncommon in Viking Age mainland Europe).” And, as with the case of the British population, the closely-related ancestral sources for the Faroese founders were likely not so diverged as to have differences in allele frequencies and long-range haplotypes that would disrupt signals of selection from iHS or XP-EHH.

      (3) Time scale: It is certainly possible, and in fact likely, that iHS measures selection older than the settlement of the Faroe Islands. In our manuscript, we calculated iHS in both the Faroese and the closely related British cohort, and we highlight in the main Main Text that the top signals, with the exception of LCT, are shared between the two cohorts, indicative of selection that began prior to the population split. iHS is a commonly calculated statistic, and it is often calculated in a single population without comparing to others, so we feel it is important to show our result demonstrating these shared selection signals. In future revisions, we will emphasize in the main text that we are not claiming to have identified selection that occurred in the Faroese population post-settlement with the iHS statistic. As far as XP-EHH, it is a statistic designed to identify differentiated variants that are fixed or approaching fixation in one population but not others. The time-scale of selection that XP-EHH can detect would therefore be dependent on the populations used for comparison. As XP-EHH has the best power to identify alleles that are fixed or approaching fixation in one population but not others, it is less likely to detect older selection events / incomplete sweeps from the source populations.

      In our next revision, we will more clearly state limitations of these statistics under various population histories, and clarify the time-scale at which we are detecting selection for iHS vs XP-EHH.

      (4) Similarly, for the discussion of LCT, I am not convinced that the haplotypes depicted here are on the right scale to reflect processes happening on the Faroes. Given the admixture/population history, it at the very least should be discussed in the context of whether the 13910 allele frequency on the Faroes is at odds with what would be expected based on the admixture sources.

      We agree that more investigation into the LCT allele frequency in the other ancient samples may provide some insight into the selection history, particularly in light of ancient admixture. Please note, we did look at the allele frequency of the LCT allele rs4988235 and stated in the main text that it was present at high frequencies in the historical (250BP) Faroese samples. The frequency of this allele in the imputed historical Faroese samples is 82% while the allele is present at ~74% frequency in modern samples. We did not report the exact percentage in the main text because the sample size of the historical samples (11 individuals) is small and coverage of ancient samples is low, leading to potential errors in imputation. However, we can try to calculate the LCT allele frequency in other ancient samples, and assuming that we have good proxies for the sources at the time of admixture, we may calculate the expected allele frequency in the admixed ancestors of the Faroese founders in the next revision.

      (5) I am lacking information to evaluate the procedure for turning the outliers into p-values. Both iHS and XP-EHH are ratio statistics, meaning they might be heavy-tailed if one is not careful, and the central limit theorem may not apply. It would be much easier (and probably sufficient for the points being made here) to reframe this analysis in terms of empirical outliers.

      Given that there are disagreements on the best approach to reporting selection scan results from the reviewers, in our revision, we can additionally supply both the standardized iHS / XP-EHH values in the supplementary information as well as these values transformed to p-values. As the p-values are derived from the empirical distribution, the “significant” p-values are also empirical outliers from the empirical distribution, so the conclusions of the manuscript do not change. We found that the p-value approach and controlling for FDR is more conservative, with fewer signals reaching “significance” than are considered empirical outliers based on common approaches such as IQR or arbitrary percentile cutoffs.

      (6) Oldest individual predating gene flow: It seems impossible to make any statements based on a single individual. Why is it implausible that this person (or their parents), e.g., moved to the Faroes within their lifetime and died there?

      We agree with the reviewer that this is a plausible explanation, and in future revisions we will update the main text to acknowledge this possibility.

    1. AbstractRice (Oryza sativa) is one of the most important staple food crops worldwide, and its wild relatives serve as an important gene pool in its breeding. Compared with cultivated rice species, African wild rice (Oryza longistaminata) has several advantageous traits, such as resistance to increased biomass production, clonal propagation via rhizomes, and biotic stresses. However, previous O. longistaminata genome assemblies have been hampered by gaps and incompleteness, restricting detailed investigations into their genomes. To streamline breeding endeavors and facilitate functional genomics studies, we generated a 343-Mb telomere-to-telomere (T2T) genome assembly for this species, covering all telomeres and centromeres across the 12 chromosomes. This newly assembled genome has markedly improved over previous versions. Comparative analysis revealed a high degree of synteny with previously published genomes. A large number of structural variations were identified between the O. longistaminata and O. sativa. A total of 2,466 segmentally duplicated genes were identified and enriched in cellular amino acid metabolic processes. We detected a slight expansion of some subfamilies of resistance genes and transcription factors. This newly assembled T2T genome of O. longistaminata provides a valuable resource for the exploration and exploitation of beneficial alleles present in wild relative species of cultivated rice.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf074), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Francois Sabot

      The manuscript from Guang et al deals with a T2T assembly for the wild perennial African rice Oryza longistaminata. Using last up to date technologies and approaches, authors provided a high quality assembly for this wild species, rending it a valuable ressource for understanding rice evolution. While the results as assembly are of high quality, the interpretation of some biological results, in particular about the NBS-LRR, are quite weird, in my opinion, and need to be more refined. That's why I think the manuscript should be published, but after major corrections.

      in details:

      -Introduction: not sure the exceptional biomass is a good idea from longistaminata, as this plant has avery high content in silicium, rendering its biomass complex to use. - Methods: We do not have access to most of the command options and command-lines. please provide them at least as a texte file in supp data. In addition, some of the references for tools are missing. Finally, please provide the accession number of the assembled plant. - Assembly in itself: O longistaminata is a outcrossing heterozygous organism. Did you obtained the two haplotypes ? - Comparison with the previous longistaminata genome: is the inversion in middle of Chr6 specific ? or due to an error of previous assembly ? - Table 1: what do you mean "Total size of assembled genomes (bp) 331,045,917" ? What is the residual percentage of N ? - Figure 1 and others: please show the legend in other way, here we may mix it with the main text. in addition, check the legends for spelling and the size of figure (3b eg) for lisibility - Syri/MUMmer analysis: you limit as min size at 1kb ? What was the order of query vs ref ? can we have a bed file with the positions ? - SD: is there a statistical link between chromosome size and number of SD ? It could explain why the first 4 ones have more SD. In general, the data are missing stats. - GO in SD: any statistical validation ? - Genomes comparison: please provide the acc number of the genome you used for comparison. - NBS-LRR: the longistaminata genome has 215 genes for 116 to 289 for other oryza so I cannot see any contraction or expansion. in addition, the text here is weird, starting speaking of onctraction then going to expansion ??? - TF analysis; the african assemblies are quite bad I think, explaining the discrepency. For glaberrima, did you check the one from Tranchant-Dubreuil et al, 2023 ?

    1. AbstractBackground The central bearded dragon (Pogona vitticeps) is widely distributed in central eastern Australia and adapts readily to captivity. Among other attributes, it is distinctive because it undergoes sex reversal from ZZ genotypic males to phenotypic females at high incubation temperatures. Here, we report an annotated telomere to telomere phased assembly of the genome of a female ZW central bearded dragon.Results Genome assembly length is 1.75 Gbp with a scaffold N50 of 266.2 Mbp, N90 of 28.1 Mbp, 26 gaps and 42.2% GC content. Most (99.6%) of the reference assembly is scaffolded into 6 macrochromosomes and 10 microchromosomes, including the Z and W microchromosomes, corresponding to the karyotype. The genome assembly exceeds standard recommended by the Earth Biogenome Project (6CQ40): 0.003% collapsed sequence, 0.03% false expansions, 99.8% k-mer completeness, 97.9% complete single copy BUSCO genes and an average of 93.5% of transcriptome data mappable back to the genome assembly. The mitochondrial genome (16,731 bp) and the model rDNA repeat unit (length 9.5 Kbp) were assembled. Male vertebrate sex genes Amh and Amhr2 were discovered as copies in the small non-recombining region of the Z chromosome, absent from the W chromosome.This, coupled with the prior discovery of differential Z and W transcriptional isoform composition arising from pseudoautosomal sex gene Nr5a1, suggests that complex interactions between these genes, their autosomal copies and their resultant transcription factors and intermediaries, determines sex in the bearded dragon.Conclusion This high-quality assembly will serve as a resource to enable and accelerate research into the unusual reproductive attributes of this species and for comparative studies across the Agamidae and reptiles more generally.Species Taxonomy Eukaryota; Animalia; Chordata; Reptilia; Squamata; Iguania; Agamidae; Amphibolurinae; Pogona; Pogona vitticeps

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf085), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Heiner Kuhl

      Patel et al. present a genome assembly of the bearded dragon Pogona vitticeps a lizard species that is widely distributed as a pet and known for its interesting sex-determination, which may switch from genetic sex-determination (ZW) to temperature dependent sex-reversal. The methods chosen to assemble the genome are very state-of-the-art including HIFI and ONT long reads, Hi-C and suitable bioinformatic tools.

      I have to admit that I have recently been reviewing a similar manuscript for Gigascience (https://www.biorxiv.org/content/10.1101/2024.09.05.611321v1), where a female ZZ P. vitticeps had been sequenced/assembled from long read data of a different nanopore technology and analyses of the ZW-chromosome was done by short read coverage analysis. One of my major comments was that this approach lacked a true assembly of the W-chromosome. Thus, I am happy to see that the assembly of the W-specific region has been achieved here and the sequencing technologies used might even improve the assembly quality over the ZZ assembly in terms of phasing, consensus accuracy etc. The two manuscripts are highly complementary and I think they should be published, if possible, in the very same issue of Gigascience. Surely both groups have invested a lot of efforts. (Reading L. 685, I just have realized that this seems to be the intention of the journal and I very much support this idea.)

      Still there are some minor points that need improvement for the current manuscript:

      Why do you leave the Z and W splitted into PAR, Z- and W-specific scaffolds and do not assemble the full-length chromosomes (L. 676)? Would the Hi-C data not support that?

      Mitochondrial assembly: from ONT only (L. 307), please do a consensus correction with illumina data, or at least show that the MT assembly has a high consensus accuracy (Q40-Q50).

      Genome annotation: show BUSCO scores for annotated proteins (do they fit to BUSCO performed on the whole genome?). If possible, compare to results of the NCBI RefSeq annotation (is it already available?). In this regard please explain the relatively low mapping rates (L. 647) of RNAseq to the annotated sequences.

      Could you provide some expression data for the Z-specific Amh and AmhR2? Is it differentially expressed in testis/ovary (after correction for copy number)?

      Table1, could you show results for the two different ONT library types (ligation vs. ultralong kit). It seems the overall yield was low (5 cells -> 100Gb), any speculation why?

      I think assembly statistics (Table2) should also contain contig N50 length as an additional value to show the high continuity of the assembly.

      L. 488: "48.36 (1 error in 146kb)", I think something is wrong here. Q48.36 would be 1 error in 68.5kb. I would suggest to re-check these values and incorporate them in Table2. The high consensus accuracy is one selling point compared to the competitor's assembly.

      L. 490: "Individual haplotypes were 85.5% complete…". Explain why you are confident that the haplotypes are more complete than the Merqury results suggest (just one sentence).

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reply to the reviewers

      We would like to thank the reviewers for their overall positive evaluations of our manuscript and for their invaluable suggestions that will allow us to reinforce our conclusions. We acknowledge that there is some work to be done and are ready to address most of the reviewers' comments as detailed in our replies below.

      Reviewer #1

      1. The findings that mmDicer is proviral in bat cells relies exclusively on the observation that the depletion of Dicer in M. myotis cells leads to a reduced accumulation of SFV and SINV at the RNA and protein levels (figure 2). Heterologous expression of mmDicer in HEK 293T NoDice doesn't lead to an increase permissivity to viral infections (figure 1) and the accumulation of Dicer foci is only observed in M. myotis cells but not when mmDicer is expressed in HEK 293 NoDice cells (figure 6). Given that the key finding of this manuscript relies on these knockdown experiments, the authors should ensure that the impact on viral infections is due to the specific silencing of mmDicer and not caused by off-target effects of their siRNA-mediated approach. The authors designed a siRNA pool to efficiently knock-down mmDicer. They should validate their findings by using individual Dicer siRNA and verify whether the decrease SFV/SINV accumulation is observed with at least two individual siRNAs targeting Dicer. It would also strengthen their findings if they could show a complementation experiment in which a mmDicer (designed to not be affected by the siRNA-mediated silencing) is introduced exogenously in Dicer-depleted cells and show that it rescues the observed decrease in viral accumulation to demonstrate that the proviral role is strictly dependent on mmDicer. Alternatively, the authors could consider a CRISPR/Cas9 genome editing approach to knockout Dicer in bat cells to test whether this proviral effect is confirmed.

      Reply: We agree with this reviewer that it is important to provide evidence for the specificity of the knock-down and to rule out any off-target effect of the siRNAs. This is the reason for using the siTool technology, which relies on the use of a pool of 30 siRNAs that are transfected at a final concentration of 3 nM. This means that each individual siRNA in the pool is at a concentration of 0.1 nM, so the possibility of off-target effect is largely avoided and the efficiency of silencing is boosted by the cooperative activity of many siRNAs (see https://www.sitoolsbiotech.com/documents/sipools/siPOOLBrochure2019_Web.pdf for more details). This being said, we agree that it would be better to confirm that the observed effect can be recapitulated using a single siRNA and that a complementation experiment would definitely strengthen our findings. For this reason, we will test two individual siRNAs targeting the 3' UTR of mmDicer, which will allow us to complement the knock-down by transfecting a cDNA construct. Regarding the CRISPR/Cas9 genome editing approach, we will give it a try, but Dicer is notoriously difficult to knock-out, so we cannot be sure that this will be successful.

      Figure 2: the authors knock-downed Dicer in M. myotis nasal epithelial cells and carried out infections with SINV-GFP and SFV. The authors conclude that Dicer is proviral as its depletion causes a decrease in SINV-GFP and SFV accumulation. While this conclusion is supported by the decrease levels of viral RNA and protein levels upon Dicer depletion (figure 2D, 2E, 2G), the effect on the viral titers is non-significant for both viruses (Figure 2C and 2F) based on the statistical analysis. This reviewer appreciates that the titers are lower upon Dicer knockdown, which support the authors' findings at the viral RNA and protein levels. However, as these results are central to the core message of the manuscript, the authors should provide evidence that this proviral effect observed is statistically significant on viral titers by perhaps providing additional repeats and/or comment on this observation.

      Reply: Indeed, we agree that even if the effect of Dicer knockdown results in a lowering of the viral titer, it would be better to have a statistically significant effect. We will repeat the experiment to increase the number of replicates and the power of the statistical test.

      a) *In figure 4 and 5, the authors nicely show that mmDicer accumulate to cytoplasmic foci in M. myotis cells upon infection with SFV and SINV and these foci co-localise with double-stranded RNA. The authors used a commercial polyclonal antibody against Dicer (A301-937A, Bethyl according to the Material and Methods section) which is specific to human Dicer to carry out their immunostaining in bat cells. The authors should provide evidence that this antibody indeed recognises/crossreacts with mmDicer as well and that the staining shown is indeed specific to mmDicer localisation especially because the heterologous expression of HA-tagged version of mmDicer in HEK 293T NoDice cells did not show this accumulation of cytoplasmic foci. The authors should verify the specificity of their mmDicer immunostaining by performing the same labelling in bat cells in which Dicer is knock-downed (or knock out) by individual and validated siRNA against mmDicer. The decrease signal of bat Dicer staining using the anti-human Dicer antibody would indicate specificity. *

      Reply: the reviewer is correct in its assertion and it is important to provide evidence that the protein that is detected by the anti-human Dicer antibody in bat cells is indeed Dicer. We will perform the suggested experiment and do an immunostaining using the Dicer antibody in bat cells upon Dicer knockdown.

      b) Another complementary approach would be to test their Dicer staining between HEK NoDice cells (no Dicer present) versus NoDice complemented with either mmDicer or human Dicer constructs, which would then indicate how much the anti-human Dicer antibody recognises bat Dicer.

      Reply: this complementary approach should yield even cleaner result than the previous one as there will be no expression of Dicer at all in the HEK NoDice cells. Therefore, we should be able to measure the increase of signal in the IF upon expression of either human or bat Dicer. We will perform this experiment together with the other one suggested above. In addition, since the constructs are tagged, we might be able to do a double-staining and verify the colocalization of the two signals.

      c) In addition, the authors should overexpress HA-tagged mmDicer in M. myotis nasal epithelial cells and test whether HA-mmDicer accumulate into foci upon infection using an anti-HA immunostaining. This would confirm that these accumulation into foci indeed is specific to mmDicer but also would reinforce the authors' findings that host factors within bat cells are important for this formation into foci since mmDicer expression in HEK 293T No Dice cells didn't show this phenotype upon infection (figure 6). OPTIONAL: it would be interesting to overexpress HA-tagged human Dicer into M. myotis nasal epithelial cells as well to then test using anti-HA staining whether human Dicer in presence of host factors from the bat can accumulate into cytoplasmic foci or not upon viral infection.

      Reply: we could perform the suggested experiment, but we might face the issue that transfected cells might mount an immune response, which makes them resistant to the infection. We have observed indeed that we needed to use a higher MOI to infect cells after they have been transfected. Since we will have controls in place, this might not be too much of a problem, but we will have to keep it in mind. Alternatively, we will perform a lentiviral transduction of the cells.

      This reviewer appreciates that this might be judged as beyond the scope of this study since it is focused on the role of Dicer in M. myotis. However, the observation that mmDicer accumulates into foci containing as well viral dsRNA is very interesting and it would significantly improve the manuscript if the authors would provide further indications that this phenotype is related to the lack of antiviral activity of mmDicer compared to what has been previously shown in other bat species (P.alecto and T. brasiliensis). In other words, is this accumulation of mmDicer into foci responsible for its different impact on virus infection? It would therefore be insightful to compare Dicer localisation upon infection in M. myotis versus P.alecto and/or T. brasiliensis bat cells in which Dicer was shown to be antiviral and test whether this accumulation in foci is only observed in bat cells in which Dicer is proviral (M. myotis) but not in the other bat cells in which Dicer is antiviral (P.alecto and/or T. brasiliensis).

      Reply: this is something that we have been wondering about and we have therefore started to look for the cell lines that have been described in the two published studies. While it proved difficult to find the PaKi cells from P. alecto bats, which is not commercially available, we have obtained the Tblu cells from T. brasiliensis and will look at Dicer localization in this model. However, we have to pay attention to the fact that the published data reported a contribution of RNAi in this cell line upon SARS-CoV-2 infection and that we will be using SINV. In addition, we do not know yet whether the anti-Dicer antibody will cross react with the T. brasiliensis Dicer protein.

      OPTIONAL: Given the difference between the provial role of mmDicer compared to the antiviral activity of Dicer in cells from P.alecto and T. brasiliensis bat cells, it would strengthen the authors' findings. if additional experiments would be conducted in parallel using M. myotis, P.alecto and/or T. brasiliensis cells. Notably knocking down Dicer in both M. myotis, P.alecto and/or T. brasiliensis cells, compare the impact on viral infections with SINV, SFV, VSV and correlate any observed difference in phenotype with putative variations in the formation of foci.

      Reply: it would indeed be really nice to be able to do the Dicer knockdown experiment in several bat cell lines and to correlate the phenotype with the formation of foci. This experiment might take a long time and we are not sure to be able to realize it in a reasonable amount of time. It could however be the subject of another manuscript further down the line.

      *Minor comments *

        • Figure 2I: The authors performed a knockdown of Dicer in M. myotis nasal epithelial cells and monitor the impact on VSV-GFP infection. They found that knocking down Dicer leads to an increase in GFP protein and RNA levels suggesting an antiviral role of Dicer while, in contrast, no effect is observed on the production of infectious particles (figure 2H). On the western blot there is only a slight/weak increase of GFP protein level observed upon Dicer knockdown. Yet, the quantification of the band intensity shows a 4-fold increase relative to tubulin and compared to cells treated with siRNA control. This 4-fold increase seems exaggerated given the low increase in the intensity shown on the blot. This discrepancy is most likely due to the lower intensity of tubulin in the western blot analysis of siDicer-treated cells compared to siNeg-treated cells. The authors should reload their western blot with equal amount of protein extract loaded to ensure that the results shown on the western blot are in line with the quantification.*

      Reply: the signal quantification for this experiment was done across several replicates, but we agree that the observed effect seems exaggerated when compared to the signal seen on the blot. We observed important variations between replicates, but we will make sure that this was not due to a problem in the analysis and reload the western blot if needed.

        • Figure 3D: the authors mention that in both HEK293T cells and M. myotis nasal epithelial cells infected with SINV-GFP, there was an enrichment of 22-nucleotides (nt) paired positive and negative sense reads that overlapped with a 2-nt overhang, typical of Dicer cleavage. In Figure 3D, the data shows indeed that the duplexes are enriched for reads of 22-nt but it is unclear how this analysis reveals a 3' 2nt overhang within these duplexes. Can the authors clarify this point and if the data provided in that particular analysis indeed doesn't allow to detect these overhangs, please rephrase accordingly or provide additional analysis to support that point. *

      Reply: In Figure 3D, the graphs show the probability of pairing of all 22 nucleotides sequence mapping either to the plus or the minus strand of the viral RNA. Thus, for each sequence mapping to the plus strand, the number of sequences mapping to the minus strand with a full or partial overall is counted. A corresponding probability of pairing and Z score is calculated for each number of overlapping nucleotides (for more information on the calculation see Antoniewski (2014) Computing siRNA and piRNA Overlap Signatures. In Animal Endo-SiRNAs: Methods and Protocols, Werner A (ed) pp 135-146. New York, NY: Springer). The Z score peaks for an overlap of 20 nt in both HEK293T and M. myotis nasal epithelial cells infected with SINV. This means that there is a higher probability of two 22 nt sequence to pair along 20 nt, and thus that there are two unpaired nucleotides at the extremities of the duplexes. This higher Z score at 20 nt is not seen in VSV-infected cells. We will rephrase the text in the manuscript to make this point clearer.

        • Typo: page 5, line 152: the authors mention that Dicer knock down had an antiviral effect against VSV-GFP infection at the RNA and protein levels. However, the data in Figure 2I and 2J show an increase in both GFP RNA and proteins levels upon knockdown of Dicer. Although this data suggests that Dicer is antiviral against VSV, the knockdown of Dicer itself is not antiviral but rather proviral/increase virus accumulation. Please rephrase this sentence to avoid confusions. *

      Reply: thank you for spotting this typo. We have corrected it accordingly.

      Reviewer #2.

      1. Figure 1 relies on transduction of cells and antibiotic selection to obtain mmDicer-expressing cells. Although we would expect that every cell expresses the construct of interest, this is not always the case, depending on the cell type and toxicity of the construct. As the constructs are tagged, I suggest that the authors use flow cytometry to measure expression levels in a single cell manner. While doing so, they can infect with SINV-GFP and correlate GFP signal with construct expression in each cell, providing a more accurate measurement of mmDicer effect on viral infection. Alternatively, the authors could use live microscopy, as done in Fig 2, to obtain similar data.

      Reply: the reviewer is correct that we did not go for monoclonal selection of our mmDicer-expressing cells and therefore that there could be some cell-to-cell variation in expression. However, we have done immunostaining of Dicer in these cells and did not see drastic differences in expression, so we do not think this should impact SINV-GFP expression in a major way. We will provide these images and a quantification of the Dicer signal as a supplementary figure.

      For Fig 1C and 1F, it would be great to have growth curves with two different MOIs, instead of a single time point, to ensure that a putative antiviral effect is not missed. Same goes for Fig 2C, especially when the authors document quite a big defect on GFP expression (a proxy for SINV infection) when Dicer is knocked down (Fig 2B). There may be a bigger difference in titers at earlier time points. This matter runs throughout the manuscript. I do not suggest that the authors should provide growth curves every time viral titers are measured, but it is still worth doing it for the 2-3 key experiments of the paper.

      Reply: we will perform growth curves of virus infection for the key experiments in the manuscript as suggested. We already have done kinetic measurements of GFP accumulation at different MOIs, which we can provide as supplementary data, but we agree with the reviewer that GFP signal should not been used as the only proxy for the infection and that measuring viral titers by plaque assay is important as well.

      Figure 4, could the authors provide a proof that the Dicer antibody is specific in the bat context? This can be done by staining Dicer in bat cells knocked down for Dicer and infected with SINV. The apparition of foci upon anti-Dicer antibody staining should be abbrogated or severely impaired by the knock-down.

      Reply: see our reply to point 3 of Reviewer 1.

      Fig 5C, please provide a quantification of the images.

      Reply: these microscopy images have not been quantified because they have been obtained with an epifluorescence microscope. Indeed, the Pearson correlation coefficient can only be obtained using a confocal microscope. In fact, we have tried to use a confocal microscope to take pictures of these FISH images, but the SINV gRNA signal was too weak or the dots too small to be properly visualized. Furthermore, there is a very large difference in signal intensity between HEK293T and M. myotis cells, making it difficult to define a signal threshold compatible for both cell lines.

      l.263, when comparing this work with the recent publications on bat antiviral RNAi, the authors could also provide the percentage identity between Dicers from different species.

      Reply: this is a valid point, we have looked at the percentage identity between Dicer proteins from different bat species but we did not include this in our manuscript. We will provide this analysis in the revised version together with a comparison of Dicer from other mammals as a reference point.

      Reviewer 3.

        • Without direct comparison to the other bat species Dicers (especially where RNAi activity has been suggested as antiviral in previous publications) there is little in this paper that can be concluded about global aspects of bat dicer/RNAi.*

      Reply: see our reply to point 4 of Reviewer 1. We are planning to look at least in Tblu cells whether there is also a relocalization of Dicer upon SINV infection. So far, we could not obtain PaKi cells, but we are still looking and should we get those, we will test them as well.

      *Minor *

      What rules out that the mmDicer re-localization observed in the immortalized mm nasal epithelial is due simply to greater expression levels over the NoDice cells heterologously expressing mmDicer?

      Reply: we will provide an immunoblot to show the level of Dicer expression between HEK NoDice + mmDicer and M. myotis nasal epithelial cells as suggested below to address this point.

      • Although partially addressed in the text stating the generally long half-life of miRNAs, it seems the simplest explanation for this observation is due to some activity of a shorter-lived miRNA is required for optimal alphavirus replication is the mm nasal epithelial cells. *

      Reply: this is an interesting hypothesis that would prove difficult to test in a reasonable amount of time. We thank the reviewer and will mention this possibility in the discussion of the revised manuscript.

      *Suggestions that could enhance the magnitude of conclusions that can be drawn from this work. *

      *Major *

        • Making NoDice cells expressing other bat species Dicers, including those with claims that RNAi is antiviral, would address how universal these current observations are to bats/cell lines.*

      Reply: this could be an alternative to the use of P. alecto or T. brasiliensis cell lines that we have mentioned above. We will try to clone Dicer from the Tblu cells that we have in the laboratory. Since we do not have PaKi cells at the moment, it will be more complicated for the Pteropus Dicer, but one possibility could be to synthesize it. However, Dicer is a big gene so it could prove tricky.

        • Including an immunoblot showing that mm cells express mmDicer no more abundantly than the heterologous NoDice cells would allow ruling out the trivial explanation that foci occur at a certain critical mass of Dicer*

      Reply: yes, we will provide this piece of data as stated in reply to point 2.

      *Minor *

        • I believe line 151 " In contrast, Dicer * *knock down had an ANTIVIRAL effect against VSV-GFP infection at the RNA and protein *

      *levels, but no difference in titers was found (Fig. 2H-J)." should be " In contrast, Dicer *

      *knock down had an PROVIRAL effect against VSV-GFP infection at the RNA and protein *

      *levels, but no difference in titers was found (Fig. 2H-J)." *

      Reply: thank you for spotting this error, which was also mentioned by Reviewer 1, we have corrected this in the text.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Manuscript number: RC-2025-02946

      Corresponding author(s): Margaret, Frame

      Roza, Masalmeh

      [Please use this template only if the submitted manuscript should be considered by the affiliate journal as a full revision in response to the points raised by the reviewers.

      If you wish to submit a preliminary revision with a revision plan, please use our "Revision Plan" template. It is important to use the appropriate template to clearly inform the editors of your intentions.]

      1. General Statements [optional]

      This section is optional. Insert here any general statements you wish to make about the goal of the study or about the reviews.

      We thank the reviewers for recognizing the significance of our work and for their constructive feedback and suggestions, most of which we have implemented in our revised manuscript.

      2. Point-by-point description of the revisions

      This section is mandatory. *Please insert a point-by-point reply describing the revisions that were already carried out and included in the transferred manuscript. *

      Reviewer #1

      Evidence, reproducibility and clarity

      Review of Masalmeh et al. Title: "FAK modulates glioblastoma stem cell energetics..."

      Previous studies have implicated FAK and the related tyrosine kinase PYK2 in glioblastoma growth, cell migration, and invasion. Herein, using a murine stem cell model of glioblastoma, the authors used CRISPR to inactivate FAK, FAK-null cells selected and cloned, and lentiviral re-expression of murine FAK in the FAK-null cells (termed FAK Rx) was accomplished. FAK-/- cells were shown to possess epithelial characteristics whereas FAK Rx cells expressed mesenchymal markers and increased cell migration/invasion in vitro. Comparisons between FAK-/- and FAK Rx cells showed that FAK re-expressed increased mitochondrial respiration and amino acid uptake. This was associated with FAK Rx cells exhibiting filamentous mitochondrial morphology (potentially an OXPHOS phenotype) and decreased levels of MTFR1L S235 phosphorylation (implicated in mito morphology fragmentation). Mito and epithelial cell morphology of FAK-/- cells was reversed by treatment with Rho-kinase inhibitors that also increased mito metabolism and cell viability. Last, FAK-dependent glioblastoma tumor growth was shown by comparisons of FAK-/- and FAK Rx implantation studies.

      The studies by Masalmeh provide interesting findings associating FAK expression with changes in mitochondrial morphology, energy metabolism, and glutamate uptake. According to the authors model, FAK expression is supporting a glioblastoma stem cell like phenotype in vitro and tumor growth in vivo. What remains unclear is the mechanistic connection to cell changes and whether or not these are be dependent on intrinsic FAK activity or as the Frame group has previously published, potentially FAK nuclear localization. The associations with MTFR1L phosphorylation and effects by Rho kinase inhibition are likely indirect and remind this reviewer of long-ago studies with FAK-null fibroblasts that exhibit epithelial characteristics, still express PYK2, exhibited elevated RhoA GTPase activity. Some of these phenotypes were linked to changes in RhoGEF and RhoGAP signaling with FAK and/or Pyk2. At a minimum, it would be informative to know whether Pyk2 signaling is relevant for observed phenotypes and whether the authors can further support their associations with FAK-targeted or FAK-Pyk2-targeted inhibitors or PROTACs.

      Some questions that would enhance potential impact. 1. Cell generation. Please describe the analysis of FAK-/- clones in more detail. The "low viability" phenotype needs further explanation with regard to clonal expansion and growth characteristics?

      Response:

      • We included a better description and a supplementary figure in our revised manuscript to indicate that we have examined several FAK -/- clones and confirmed that our observations were not due to clonal variation; multiple clones displayed similar morphological changes (Figure S1D). We also show that the elongated mesenchymal-like morphology was observed at 48 h after nucleofecting the cells with the FAK‑expressing vector, before beginning G418 selection to enrich for cells expressing FAK (Figure S1C). We also included experiments to acutely modulate FAK signalling (detaching and seeding cells on fibronectin) (Figure S2D, E, F and Figure S3) to exclude the possibility that the profound effects are due to protocols/selection we used for generating FAK-deleted cells.
      • Regarding the term "low viability", we have clarified in the text that there is no significant difference in cell number (Figure S1A) or 'cell viability' when it is assessed by trypan blue exclusion (a non-mitochondria-dependent read-out) (Figure S1B) between FAK-expressing FAK Rx and FAK-/- cells cultured for three days under normal conditions. Therefore, we agree the term 'cell viability' in this context could be confusing and have replace "cell viability" with "metabolic activity as measured by Alamar Blue." in Figure 1D and Figure 5B, and the corresponding text in the original manuscript. This wording more accurately reflects the data.

      Figure 1F: need further support of MET change upon FAK KO and EMT reversion.

      Response: We have added a heatmap (Figure S1E) illustrating the changes in protein expression of core-enriched EMT/MET genes products (by proteomics) after FAK gene deletion (EMT genes as defined in Howe et al., 2018) ; this strengthens the conclusion that the MET reversion morphological phenotype is accompanied by recognised MET protein changes.

      Fig. 2: Need further support if FAK effects impact glycolysis or oxidative phosphorylation in particular as implicated by the stem cell model.

      Response: We show that FAK impacts both glycolysis (Figure 2A, 2E, and 2F) and mitochondrial oxidative phosphorylation on the basis of the oxygen consumption rate (OCR) (Figure 2B, and 2D), showing both are contributing pathways to FAK-dependent energy production. We have clarified this in the text.

      Is there a combinatorial potential between FAKi and chemotherapies used for glioblastoma. Need to build upon past studies.

      Response: Yes, previous studies suggest that inhibiting FAK can sensitize GBM cells to chemotherapy (Golubovskaya et al., 2012; Ortiz-Rivera et al., 2023). We have included a paragraph in the discussion section to make sure this is clearer. Although it is not the subject of this study, we appreciate it is useful context.

      The notation of changes in glucose transporter expression should be followed up with regard to the potential that FAK-expressing cells may have different uptake of carbon sources and other amino acids. Altered uptake could be one potential explanation for increase glycolysis and glutamine flux.

      Response: We agree with the reviewer that glucose uptake could be contributing and we include data that 2 glucose transporters are indeed FAK-regulated namely Glucose transporter 1 (GLUT1, encoded by Slc2a1 gene) and Glucose transporter 3 (GLUT 3, encoded by Slc2a3 gene) (shown in Figure S2B and C).

      It would be helpful to support the confocal microscopy of mitos with EM.

      Response:

      We are concerned (and in our experience) that Electron microscopy (EM) may introduce artefacts during sample preparation. In contrast, immunofluorescence sample preparation is less susceptible to artefacts. The SORA system we used is not a conventional point-scanning confocal microscope, but is a super-resolution module based on a spinning disk confocal platform (CSU-W1; Yokogawa) using optical pixel reassignment with confocal detection. This method enhances resolution in all dimensions with resolution in our samples measured at 120nm. This has been instructive in defining a new level of changes in mitochondrial morphology upon FAK gene deletion.

      Lack of FAK expression with increased MTFR1 phosphorylation is difficult to interpret.

      Response: We do not directly show that this phosphorylation event is causal in our experiments; however, we think it important to document this change since it has been published that phosphorylation of MTFR1 has been causally linked to the mitochondrial morphology we observed in other systems (Tilokani et al., 2022).

      Need to have better support between loss of FAK and the increase in Rho signaling. Use of Rho kinase inhibitors is very limited and the context to FAK (and or Pyk2) remains unclear. Past studies have linked integrin adhesion to ECM as a linkage between FAK activation and the transient inhibition of RhoA GTP binding. Is integrin signaling and FAK involved in the cell and metabolism phenotypes in this new model?

      Response: To better support the antagonistic effect of FAK on Rho-kinase (ROCK) signalling, we included a new experiment in which the integrin-FAK signalling pathway has been disrupted by treating FAK WT cells with an agent that causes detachment from the substratum, Accutase, and growing the cells in suspension in laminin-free medium. We present ROCK activity data, as judged by phosphorylated MLC2 at serine 19 (pMLC2 S19), relating this to induced FAK phosphorylation at Y397 (a surrogate for FAK activity) that is supressed after integrin disengagement. These measurements have been compared with conditions whereby integrin-FAK signalling is activated by growing the cells on laminin coated surfaces. We observed a time-dependent decrease in pFAK(Y397) levels (normalised to total FAK) in suspended cells compared to those spread on laminin, while pMLC2(S19) levels increased in a reciprocal manner over time in detached cells relative to spread cells (S4A and B). There is therefore an inverse relationship between integrin-FAK signalling and ROCK-MLC2 activity, consistent with findings from FAK gene deletion experiments. In the former case, we do not rely on gene deletion cell clones.

      Significance

      The studies by Masalmeh provide interesting findings associating FAK expression with changes in mitochondrial morphology, energy metabolism, and glutamate uptake. According to the authors model, FAK expression is supporting a glioblastoma stem cell like phenotype in vitro and tumor growth in vivo. What remains unclear is the mechanistic connection to cell changes and whether or not these are be dependent on intrinsic FAK activity or as the Frame group has previously published, potentially FAK nuclear localization. The associations with MTFR1L phosphorylation and effects by Rho kinase inhibition are likely indirect and remind this reviewer of long-ago studies with FAK-null fibroblasts that exhibit epithelial characteristics, still express PYK2, exhibited elevated RhoA GTPase activity. Some of these phenotypes were linked to changes in RhoGEF and RhoGAP signaling with FAK and/or Pyk2. At a minimum, it would be informative to know whether Pyk2 signaling is relevant for observed phenotypes and whether the authors can further support their associations with FAK-targeted or FAK-Pyk2-targeted inhibitors or PROTACs.

      __Response: __

      Deleting the gene encoding FAK in mouse embryonic fibroblasts leads to elevated Pyk2 expression (Sieg, 2000). However, in the GBM stem cell model we used here, Pyk2 was not expressed (determined by both transcriptomics and proteomics). We have included Figure S1E to show that PYK2 expression was undetectable in FAK -/- and FAK Rx cells at the RNA level (Figure S1F). We conclude that there is no compensatory increase in Pyk2 upon FAK loss in these cells. In the transformed neural stem cell model of GBM, we do not consistently or robustly detect nuclear FAK.

      Review #2

      Masalmeh and colleagues employ a neural stem/progenitor cell-based glioma model (NPE cells) to investigate the role of Focal Adhesion Kinase (FAK) in GBM, with a focus on potential links between the regulation of morphological/adhesive and metabolic GBM cell properties. For this, the authors employ wt cells alongside newly generated FAK-KO and -reexpressing cells, as well as pharmacological interventions to probe the relevance of specific signaling pathways. The authors´ main claims are that FAK crucially modulates glioma cell morphology, cell-cell and cell-substrate interactions and motility, as well as their metabolism, and that these effects translate to changes to relevant in vivo properties such as invasion and tumor growth.

      My main issues are with the model chosen by the authors.

      As per the methods section, generation of FAK-KO and -"Rx" NPE cells entailed protracted selection/expansion processes, which may have resulted in inadvertent selection for cellular/molecular properties unrelated to the desired one (loss or gain of FAK expression) and which may have had cascading effects on NPE cells. The authors nonetheless repeatedly claim the parameters they quantify, such as mitochondrial or cytoskeletal properties or metabolic features, to have directly resulted from FAK loss or reintroduction. Examples of such causal inferences are to be found in lines 123, 134/135, 165, 181. Such causal claims are, in my view, unsupported.

      Acute perturbation of FAK expression/activity, genetically or pharmacologically, followed by a rapid assessment of the processes under investigation, would be needed to begin to assess causality, even if acute genetic perturbations may be technically challenging as sufficient gene expression reduction or restoration to physiologically relevant levels may be hard to achieve.

      Response:

      We would like to first comment on the model we used here, which we think will clarify the validity of our approach. The model is a transformed stem cell model of GBM that was published in (Gangoso et al., Cell, 2021) and is now used regularly in the GBM field. As mentioned in the response to Reviewer 1, we have added text (page 4 and 5 in the revised manuscript) and a new supplementary figure (Figure S1D) clarifying that the morphological changes we observed were consistent across multiple FAK -/- clones, showing this was not due to any inter-clonal variability. We also added images showing that the morphological changes were apparent at 48 h after nucleofecting FAK -/- cells with the FAK‑expressing vector specifically (not the empty vector), prior to starting G418 selection to enrich for FAK‑expressing cells (Figure S1C), addressing the worry that clonal variation and selection was the cause of the FAK-dependent phenotypes we observed. We believe that our model provides a type of well controlled, clean genetic cancer cell system of a type that is commonly used in cancer cell biology, allowing us to attribute phenotypes to individual proteins.

      We have also carried out a more acute treatment by using the FAK inhibitor VS4718 to perturb FAK kinase activity and assessed the effects on glycolysis and glutamine oxidation after 48h treatment (Figure S2D, E and F). We found that treating the transformed neural stem cells (parental population) with FAK inhibitor (300nM VS4718) decreases glucose incorporation into glycolysis intermediates and glutamine incorporation into TCA cycle intermediates, consistent with a role for FAK's kinase activity in maintaining glycolysis and glutamine oxidation.

      The employed pharmacological modulation of ROCK activity is the only approach that, given the presumably acute nature of the treatment, may have allowed the authors to probe the proposed functional links. The methods section of the manuscript does not however comprise details as to the duration of these treatments, which leaves open the possibility of long-term treatment having been carried out (data shown in Figure 5B refers to 72hr treatment).

      __Response: __

      We have added the duration of the treatment to the Methods section and Figure Legends, to clarify that cells were treated with ROCK inhibitors for 24h, before assessing the effects on mictochondria (Figure 4C, D, S4C and D) and glutamine oxidation (Figure 5A, and S5). For metabolic activity by AlamarBlue assay, cells were treated with ROCK inhibitors for 72h (Figure 5B).

      Even in the case of ROCK inhibitor experiments, it is however unclear if and how the effects on cell morphology and adhesion, mitochondrial organization and metabolic activity may be connected to each other and, if at all, to FAK expression.

      Given the above uncertainties due to the nature of the model and experimental approaches, it is hard to assess the reliability and thus the relevance of the findings.

      Response:

      FAK suppresses ROCK activity (as judged by pMLC2 S19, Figure 4A and B). Treating FAK -/- cells with two different ROCK inhibitors restored mesenchymal-like cell morphology, mitochondrial morphology and glutamine oxidation. As mentioned above, to strengthen our evidence for the antagonistic role of FAK in ROCK-MLC2 signalling, we have now introduced an experiment whereby integrin-FAK signalling was disrupted through treatment with a detachment agent (Accutase), and subsequently maintaining the cells in suspension in laminin-free medium. We assessed pMLC2 S19 levels (a measure of ROCK activity) relating this to FAK phosphorylation that is supressed after integrin disengagement. These results were evaluated relative to spread wild type cells growing on laminin where Integrin-FAK signalling was active (Figure S4A and B). We observed an inverse relationship between Integrin-FAK signalling and ROCK-MLC2 activity in keeping with our conclusions (Figure 4A and B).

      Experimental support for the ability of cell-substrate interaction modulation to concomitantly impact cellular metabolism and motility/invasion would be significant both in terms of advancing our understanding of glioma cell biology and of its translational potential, but the evidence being provided is at best compatible with the proposed model.

      Response: We carried out a new experiment to support the ability of cell-substrate interaction modulation to impact metabolism; specifically, we inhibited cell-substrate interactions by plating the cells on Poly-2-hydroxyethyl methacrylate (Poly 2-HEMA)-coated dishes. This suppressed FAK phosphorylation at Y397, as expected, with concomitant reduction in glutamine utilisation in the TCA cycle (Figure S3A, B and C).

      My background/expertise is in developmental and adult neurogenesis, in vivo modelling of gliomagenesis and cell fate control/reprogramming, with a focus on molecular mechanisms of differentiation and quantitative aspects of lineage dynamics; molecular details of the control of cellular metabolism, cell-cell adhesion and cytoskeletal dynamics are not core expertise of mine.

      We appreciate this reviewer's expertise are not necessarily in the cancer cell biology and genetic intervention aspects of our study. We hope that the explanations we have provided satisfy the reviewer that our conclusions are valid.

    2. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Manuscript number: RC-2025-02946

      Corresponding author(s): Margaret, Frame

      Roza, Masalmeh

      [Please use this template only if the submitted manuscript should be considered by the affiliate journal as a full revision in response to the points raised by the reviewers.

      If you wish to submit a preliminary revision with a revision plan, please use our "Revision Plan" template. It is important to use the appropriate template to clearly inform the editors of your intentions.]

      1. General Statements [optional]

      This section is optional. Insert here any general statements you wish to make about the goal of the study or about the reviews.

      We thank the reviewers for recognizing the significance of our work and for their constructive feedback and suggestions, most of which we have implemented in our revised manuscript.

      2. Point-by-point description of the revisions

      This section is mandatory. *Please insert a point-by-point reply describing the revisions that were already carried out and included in the transferred manuscript. *

      Reviewer #1

      Evidence, reproducibility and clarity

      Review of Masalmeh et al. Title: "FAK modulates glioblastoma stem cell energetics..."

      Previous studies have implicated FAK and the related tyrosine kinase PYK2 in glioblastoma growth, cell migration, and invasion. Herein, using a murine stem cell model of glioblastoma, the authors used CRISPR to inactivate FAK, FAK-null cells selected and cloned, and lentiviral re-expression of murine FAK in the FAK-null cells (termed FAK Rx) was accomplished. FAK-/- cells were shown to possess epithelial characteristics whereas FAK Rx cells expressed mesenchymal markers and increased cell migration/invasion in vitro. Comparisons between FAK-/- and FAK Rx cells showed that FAK re-expressed increased mitochondrial respiration and amino acid uptake. This was associated with FAK Rx cells exhibiting filamentous mitochondrial morphology (potentially an OXPHOS phenotype) and decreased levels of MTFR1L S235 phosphorylation (implicated in mito morphology fragmentation). Mito and epithelial cell morphology of FAK-/- cells was reversed by treatment with Rho-kinase inhibitors that also increased mito metabolism and cell viability. Last, FAK-dependent glioblastoma tumor growth was shown by comparisons of FAK-/- and FAK Rx implantation studies.

      The studies by Masalmeh provide interesting findings associating FAK expression with changes in mitochondrial morphology, energy metabolism, and glutamate uptake. According to the authors model, FAK expression is supporting a glioblastoma stem cell like phenotype in vitro and tumor growth in vivo. What remains unclear is the mechanistic connection to cell changes and whether or not these are be dependent on intrinsic FAK activity or as the Frame group has previously published, potentially FAK nuclear localization. The associations with MTFR1L phosphorylation and effects by Rho kinase inhibition are likely indirect and remind this reviewer of long-ago studies with FAK-null fibroblasts that exhibit epithelial characteristics, still express PYK2, exhibited elevated RhoA GTPase activity. Some of these phenotypes were linked to changes in RhoGEF and RhoGAP signaling with FAK and/or Pyk2. At a minimum, it would be informative to know whether Pyk2 signaling is relevant for observed phenotypes and whether the authors can further support their associations with FAK-targeted or FAK-Pyk2-targeted inhibitors or PROTACs.

      Some questions that would enhance potential impact. 1. Cell generation. Please describe the analysis of FAK-/- clones in more detail. The "low viability" phenotype needs further explanation with regard to clonal expansion and growth characteristics?

      Response:

      • We included a better description and a supplementary figure in our revised manuscript to indicate that we have examined several FAK -/- clones and confirmed that our observations were not due to clonal variation; multiple clones displayed similar morphological changes (Figure S1D). We also show that the elongated mesenchymal-like morphology was observed at 48 h after nucleofecting the cells with the FAK‑expressing vector, before beginning G418 selection to enrich for cells expressing FAK (Figure S1C). We also included experiments to acutely modulate FAK signalling (detaching and seeding cells on fibronectin) (Figure S2D, E, F and Figure S3) to exclude the possibility that the profound effects are due to protocols/selection we used for generating FAK-deleted cells.
      • Regarding the term "low viability", we have clarified in the text that there is no significant difference in cell number (Figure S1A) or 'cell viability' when it is assessed by trypan blue exclusion (a non-mitochondria-dependent read-out) (Figure S1B) between FAK-expressing FAK Rx and FAK-/- cells cultured for three days under normal conditions. Therefore, we agree the term 'cell viability' in this context could be confusing and have replace "cell viability" with "metabolic activity as measured by Alamar Blue." in Figure 1D and Figure 5B, and the corresponding text in the original manuscript. This wording more accurately reflects the data.

      Figure 1F: need further support of MET change upon FAK KO and EMT reversion.

      Response: We have added a heatmap (Figure S1E) illustrating the changes in protein expression of core-enriched EMT/MET genes products (by proteomics) after FAK gene deletion (EMT genes as defined in Howe et al., 2018) ; this strengthens the conclusion that the MET reversion morphological phenotype is accompanied by recognised MET protein changes.

      Fig. 2: Need further support if FAK effects impact glycolysis or oxidative phosphorylation in particular as implicated by the stem cell model.

      Response: We show that FAK impacts both glycolysis (Figure 2A, 2E, and 2F) and mitochondrial oxidative phosphorylation on the basis of the oxygen consumption rate (OCR) (Figure 2B, and 2D), showing both are contributing pathways to FAK-dependent energy production. We have clarified this in the text.

      Is there a combinatorial potential between FAKi and chemotherapies used for glioblastoma. Need to build upon past studies.

      Response: Yes, previous studies suggest that inhibiting FAK can sensitize GBM cells to chemotherapy (Golubovskaya et al., 2012; Ortiz-Rivera et al., 2023). We have included a paragraph in the discussion section to make sure this is clearer. Although it is not the subject of this study, we appreciate it is useful context.

      The notation of changes in glucose transporter expression should be followed up with regard to the potential that FAK-expressing cells may have different uptake of carbon sources and other amino acids. Altered uptake could be one potential explanation for increase glycolysis and glutamine flux.

      Response: We agree with the reviewer that glucose uptake could be contributing and we include data that 2 glucose transporters are indeed FAK-regulated namely Glucose transporter 1 (GLUT1, encoded by Slc2a1 gene) and Glucose transporter 3 (GLUT 3, encoded by Slc2a3 gene) (shown in Figure S2B and C).

      It would be helpful to support the confocal microscopy of mitos with EM.

      Response:

      We are concerned (and in our experience) that Electron microscopy (EM) may introduce artefacts during sample preparation. In contrast, immunofluorescence sample preparation is less susceptible to artefacts. The SORA system we used is not a conventional point-scanning confocal microscope, but is a super-resolution module based on a spinning disk confocal platform (CSU-W1; Yokogawa) using optical pixel reassignment with confocal detection. This method enhances resolution in all dimensions with resolution in our samples measured at 120nm. This has been instructive in defining a new level of changes in mitochondrial morphology upon FAK gene deletion.

      Lack of FAK expression with increased MTFR1 phosphorylation is difficult to interpret.

      Response: We do not directly show that this phosphorylation event is causal in our experiments; however, we think it important to document this change since it has been published that phosphorylation of MTFR1 has been causally linked to the mitochondrial morphology we observed in other systems (Tilokani et al., 2022).

      Need to have better support between loss of FAK and the increase in Rho signaling. Use of Rho kinase inhibitors is very limited and the context to FAK (and or Pyk2) remains unclear. Past studies have linked integrin adhesion to ECM as a linkage between FAK activation and the transient inhibition of RhoA GTP binding. Is integrin signaling and FAK involved in the cell and metabolism phenotypes in this new model?

      Response: To better support the antagonistic effect of FAK on Rho-kinase (ROCK) signalling, we included a new experiment in which the integrin-FAK signalling pathway has been disrupted by treating FAK WT cells with an agent that causes detachment from the substratum, Accutase, and growing the cells in suspension in laminin-free medium. We present ROCK activity data, as judged by phosphorylated MLC2 at serine 19 (pMLC2 S19), relating this to induced FAK phosphorylation at Y397 (a surrogate for FAK activity) that is supressed after integrin disengagement. These measurements have been compared with conditions whereby integrin-FAK signalling is activated by growing the cells on laminin coated surfaces. We observed a time-dependent decrease in pFAK(Y397) levels (normalised to total FAK) in suspended cells compared to those spread on laminin, while pMLC2(S19) levels increased in a reciprocal manner over time in detached cells relative to spread cells (S4A and B). There is therefore an inverse relationship between integrin-FAK signalling and ROCK-MLC2 activity, consistent with findings from FAK gene deletion experiments. In the former case, we do not rely on gene deletion cell clones.

      Significance

      The studies by Masalmeh provide interesting findings associating FAK expression with changes in mitochondrial morphology, energy metabolism, and glutamate uptake. According to the authors model, FAK expression is supporting a glioblastoma stem cell like phenotype in vitro and tumor growth in vivo. What remains unclear is the mechanistic connection to cell changes and whether or not these are be dependent on intrinsic FAK activity or as the Frame group has previously published, potentially FAK nuclear localization. The associations with MTFR1L phosphorylation and effects by Rho kinase inhibition are likely indirect and remind this reviewer of long-ago studies with FAK-null fibroblasts that exhibit epithelial characteristics, still express PYK2, exhibited elevated RhoA GTPase activity. Some of these phenotypes were linked to changes in RhoGEF and RhoGAP signaling with FAK and/or Pyk2. At a minimum, it would be informative to know whether Pyk2 signaling is relevant for observed phenotypes and whether the authors can further support their associations with FAK-targeted or FAK-Pyk2-targeted inhibitors or PROTACs.

      __Response: __

      Deleting the gene encoding FAK in mouse embryonic fibroblasts leads to elevated Pyk2 expression (Sieg, 2000). However, in the GBM stem cell model we used here, Pyk2 was not expressed (determined by both transcriptomics and proteomics). We have included Figure S1E to show that PYK2 expression was undetectable in FAK -/- and FAK Rx cells at the RNA level (Figure S1F). We conclude that there is no compensatory increase in Pyk2 upon FAK loss in these cells. In the transformed neural stem cell model of GBM, we do not consistently or robustly detect nuclear FAK.

      Review #2

      Masalmeh and colleagues employ a neural stem/progenitor cell-based glioma model (NPE cells) to investigate the role of Focal Adhesion Kinase (FAK) in GBM, with a focus on potential links between the regulation of morphological/adhesive and metabolic GBM cell properties. For this, the authors employ wt cells alongside newly generated FAK-KO and -reexpressing cells, as well as pharmacological interventions to probe the relevance of specific signaling pathways. The authors´ main claims are that FAK crucially modulates glioma cell morphology, cell-cell and cell-substrate interactions and motility, as well as their metabolism, and that these effects translate to changes to relevant in vivo properties such as invasion and tumor growth.

      My main issues are with the model chosen by the authors.

      As per the methods section, generation of FAK-KO and -"Rx" NPE cells entailed protracted selection/expansion processes, which may have resulted in inadvertent selection for cellular/molecular properties unrelated to the desired one (loss or gain of FAK expression) and which may have had cascading effects on NPE cells. The authors nonetheless repeatedly claim the parameters they quantify, such as mitochondrial or cytoskeletal properties or metabolic features, to have directly resulted from FAK loss or reintroduction. Examples of such causal inferences are to be found in lines 123, 134/135, 165, 181. Such causal claims are, in my view, unsupported.

      Acute perturbation of FAK expression/activity, genetically or pharmacologically, followed by a rapid assessment of the processes under investigation, would be needed to begin to assess causality, even if acute genetic perturbations may be technically challenging as sufficient gene expression reduction or restoration to physiologically relevant levels may be hard to achieve.

      Response:

      We would like to first comment on the model we used here, which we think will clarify the validity of our approach. The model is a transformed stem cell model of GBM that was published in (Gangoso et al., Cell, 2021) and is now used regularly in the GBM field. As mentioned in the response to Reviewer 1, we have added text (page 4 and 5 in the revised manuscript) and a new supplementary figure (Figure S1D) clarifying that the morphological changes we observed were consistent across multiple FAK -/- clones, showing this was not due to any inter-clonal variability. We also added images showing that the morphological changes were apparent at 48 h after nucleofecting FAK -/- cells with the FAK‑expressing vector specifically (not the empty vector), prior to starting G418 selection to enrich for FAK‑expressing cells (Figure S1C), addressing the worry that clonal variation and selection was the cause of the FAK-dependent phenotypes we observed. We believe that our model provides a type of well controlled, clean genetic cancer cell system of a type that is commonly used in cancer cell biology, allowing us to attribute phenotypes to individual proteins.

      We have also carried out a more acute treatment by using the FAK inhibitor VS4718 to perturb FAK kinase activity and assessed the effects on glycolysis and glutamine oxidation after 48h treatment (Figure S2D, E and F). We found that treating the transformed neural stem cells (parental population) with FAK inhibitor (300nM VS4718) decreases glucose incorporation into glycolysis intermediates and glutamine incorporation into TCA cycle intermediates, consistent with a role for FAK's kinase activity in maintaining glycolysis and glutamine oxidation.

      The employed pharmacological modulation of ROCK activity is the only approach that, given the presumably acute nature of the treatment, may have allowed the authors to probe the proposed functional links. The methods section of the manuscript does not however comprise details as to the duration of these treatments, which leaves open the possibility of long-term treatment having been carried out (data shown in Figure 5B refers to 72hr treatment).

      __Response: __

      We have added the duration of the treatment to the Methods section and Figure Legends, to clarify that cells were treated with ROCK inhibitors for 24h, before assessing the effects on mictochondria (Figure 4C, D, S4C and D) and glutamine oxidation (Figure 5A, and S5). For metabolic activity by AlamarBlue assay, cells were treated with ROCK inhibitors for 72h (Figure 5B).

      Even in the case of ROCK inhibitor experiments, it is however unclear if and how the effects on cell morphology and adhesion, mitochondrial organization and metabolic activity may be connected to each other and, if at all, to FAK expression.

      Given the above uncertainties due to the nature of the model and experimental approaches, it is hard to assess the reliability and thus the relevance of the findings.

      Response:

      FAK suppresses ROCK activity (as judged by pMLC2 S19, Figure 4A and B). Treating FAK -/- cells with two different ROCK inhibitors restored mesenchymal-like cell morphology, mitochondrial morphology and glutamine oxidation. As mentioned above, to strengthen our evidence for the antagonistic role of FAK in ROCK-MLC2 signalling, we have now introduced an experiment whereby integrin-FAK signalling was disrupted through treatment with a detachment agent (Accutase), and subsequently maintaining the cells in suspension in laminin-free medium. We assessed pMLC2 S19 levels (a measure of ROCK activity) relating this to FAK phosphorylation that is supressed after integrin disengagement. These results were evaluated relative to spread wild type cells growing on laminin where Integrin-FAK signalling was active (Figure S4A and B). We observed an inverse relationship between Integrin-FAK signalling and ROCK-MLC2 activity in keeping with our conclusions (Figure 4A and B).

      Experimental support for the ability of cell-substrate interaction modulation to concomitantly impact cellular metabolism and motility/invasion would be significant both in terms of advancing our understanding of glioma cell biology and of its translational potential, but the evidence being provided is at best compatible with the proposed model.

      Response: We carried out a new experiment to support the ability of cell-substrate interaction modulation to impact metabolism; specifically, we inhibited cell-substrate interactions by plating the cells on Poly-2-hydroxyethyl methacrylate (Poly 2-HEMA)-coated dishes. This suppressed FAK phosphorylation at Y397, as expected, with concomitant reduction in glutamine utilisation in the TCA cycle (Figure S3A, B and C).

      My background/expertise is in developmental and adult neurogenesis, in vivo modelling of gliomagenesis and cell fate control/reprogramming, with a focus on molecular mechanisms of differentiation and quantitative aspects of lineage dynamics; molecular details of the control of cellular metabolism, cell-cell adhesion and cytoskeletal dynamics are not core expertise of mine.

      We appreciate this reviewer's expertise are not necessarily in the cancer cell biology and genetic intervention aspects of our study. We hope that the explanations we have provided satisfy the reviewer that our conclusions are valid.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      RESPONSE TO REVIEWERS

      We thank the reviewers for their thoughtful and constructive feedback, which has been instrumental in improving the overall quality of our manuscript.

      In response, we have undertaken a substantial revision that includes new experimental data, refined analyses, and clearer presentation of our findings. Specifically, we have addressed concerns about RNAi efficiency and protein-level validation, expanded our genetic models to include loss-of-function contexts, and clarified the interpretation of mitochondrial morphology using both confocal and electron microscopy. We also incorporated new data on Cyclin E regulation and mitochondrial membrane potential to strengthen the mechanistic link between dPGC1 depletion and Yki-driven tumorigenesis. These revisions not only address the specific points raised by the reviewers but also enhance the coherence and impact of the study. We are confident that the revised manuscript presents a more robust and compelling case for the role of dPGC1 as a context-dependent tumor suppressor and that it will be of broad interest to the fields of developmental biology, cancer metabolism, and mitochondrial dynamics.

      Reviewer #1 (Evidence, reproducibility and clarity (Required)): Sew et al. examine the master regulator of mitochondrial biogenesis, dPGC1, in the context of Drosophila wing and larval development. They primarily use confocal imaging to probe the interplay between dPGC1 and an overactive Hippo pathway, driven by overexpression of the main effector protein, Yki. In their study, they find that tumors, driven by overactivity of Yki grow larger when dPGC1 is downregulated, implicating the mitochondrial biogenesis pathway in tumor suppression. Furthermore, in the context of Yki overexpression, they find that levels of Mfn or Opa1 modulate tumor size. Lastly, they show a role of cyclin E in controlling the size of tumors formed by Yki OE + dPGC1 RNAi. The potential role of dPGC1 as a tumor suppressor is interesting because it highlights an emerging recognition of mitochondria in the aetiology of cancer. However, before publication, much of the data in this manuscript should be strengthened by a refinement in the methods/analysis and an increase in orthogonal approaches.

      We addressed concerns regarding RNAi efficiency and wing development by incorporating data from a dPGC1 mutant allele and using a ubiquitous driver for qPCR validation of transgene efficiency. We clarified the rationale for EM use. The manuscript now avoids overinterpretation of mitochondrial morphology and focuses on fusion-specific regulators. We also revised the narrative arc to maintain coherence and added loss-of-function models to support our conclusions.

      Below, we address each of the reviewer’s points in detail.

      Major comments:

      The authors indicate that for example, in lines127-28, that neither downregulating or overexpressing dPGC1 affects wing size. However, the quantification in Fig. 1C shows a significant decrease in wing size following RNAi treatment. This decrease is modest, but it is nevertheless significant. It is worth pointing out, too, that the efficiency of the RNAi in Fig. S1C suggests that the conclusions drawn are premature. While a roughly 55% drop in mRNA levels may be statistically significant, it is unclear whether this drop in transcripts corresponds to a commensurate depletion of protein. Moreover, it is unclear, in this context, how much dPGC1 may indeed be necessary to drive a relatively normal program of mitochondrial biogenesis in wing development. To obtain a clear result, it is necessary to show significant depletion of the dPGC1 protein. (Ultimately, if it is the case that dPGC1 is unnecessary for wing development and function, a more coherent line of inquiry would be to find out the reason for this rather than to pivot the story to studying tumorigenesis in larva.)

      We agree that the interpretation of the RNAi efficiency data requires clarification.

      The qPCR analysis shown in former Fig. S1C was performed using wing discs from flies expressing UAS-dPGC1-RNAi under the control of the MS1096-Gal4 driver. However, as shown in current Fig. 1C, MS1096-Gal4 is not expressed uniformly across the wing disc. Some regions remain RFP-negative, indicating that the RNAi construct is not active in all cells. As a result, the measured mRNA levels likely underestimate the true knockdown efficiency. This is because the qPCR includes mRNA from both RNAi-expressing and non-expressing cells, diluting the apparent reduction in transcript levels.

      To address this limitation and more accurately assess RNAi efficiency, we repeated the qPCR analysis using a ubiquitous driver (actin-Gal4) to ensure uniform expression of the RNAi construct. Under these conditions, we observed a more substantial knockdown, with dPGC1 mRNA levels reduced to approximately 25% of control levels (this is shown in current Fig S2). This result indicates that the RNAi line is more effective than initially suggested by the MS1096-Gal4-based analysis.

      To complement our RNAi-based analysis, we additionally used a mutant strain carrying a characterized allele of dPGC1 (dPGC11, also known as dPGC1KG08646; see FlyBase: https://flybase.org/reports/FBal0148128). This genetically distinct approach allowed us to validate and strengthen our findings regarding dPGC1 function. Flies homozygous for this allele exhibited a modest but statistically significant reduction in both wing disc and adult wing size. These results support the conclusion that dPGC1 is required for normal wing growth and development. The new data are now included in Figure 1 and referenced in the main text (lines 144-153).

      Additionally, as suggested by the reviewer, we have revised the relevant section to maintain a coherent line of inquiry. The updated text can be found in lines 163–172.

      In Figure 3H-K, it is not clear why the authors used electron microscopy to evaluate mitochondrial morphology. The very good confocal images in Figure 3C-G show a clear change in mitochondrial morphology following the knockdown of Mfn, Opa1, and Miro. While it is clear from the electron micrographs in Figure H that the mitochondria are enlarged, it is not obvious that this increase in length is a result of increased mitochondrial fusion. Indeed, if the mean form factor were used to quantify the shape, it is likely that in both conditions, the value would be close to 1, indicating more of a round object, and it not obvious whether there would be a difference between the Yki OE versus the YkI OE + dPGC1 RNAi. Therefore, from this data alone, it cannot be concluded that the YkI OE + dPGC1 RNAi condition leads to mitochondrial hyperfusion.

      Our rationale for including electron microscopy (EM) was to overcome specific limitations in imaging mitochondrial morphology within the main epithelium of the wing disc, where Yki-driven tumors arise. These tumors were generated using ap-Gal4, which drives expression specifically in the main epithelium and is not active in the peripodial membrane. This is an important distinction, as the peripodial membrane—used in Figures 3C–G—has a squamous architecture and larger cytoplasmic volume, making it ideal for high-resolution confocal imaging and for assessing the effects of manipulating dMfn, Opa1, and miro. However, because ap-Gal4 is not expressed in the peripodial membrane, this tissue could not be used to analyze mitochondrial morphology in the actual tumorous context.

      To directly evaluate mitochondria in the main epithelium, we employed EM, which provides the resolution necessary to visualize ultrastructural changes that are not easily captured by confocal microscopy in this densely packed tissue. While EM does not directly measure fusion events, it allowed us to detect changes in mitochondrial size and shape that support our broader findings.

      We acknowledge that mitochondrial enlargement alone does not definitively demonstrate hyperfusion. However, the EM data were interpreted alongside additional evidence: the upregulation of mitochondrial fusion genes (dMfn and Opa1) in Yki + dPGC1-RNAi tumors, and functional data showing that overexpression of these genes promotes fusion in the peripodial membrane. Together, these findings suggest that dPGC1 depletion enhances mitochondrial fusion in Yki-driven tumors.

      To further clarify this point, we also imaged mitochondria in the main epithelium using confocal microscopy. However, the resolution was considerably lower than that achieved with EM, limiting our ability to assess fine mitochondrial structures. We have prepared a representative figure for the reviewer (below), showing representative confocal images of wing discs from three genotypes: (A) ap-Gal4, UAS-GFP (control), (B) ap-Gal4, UAS-Yki, and (C) ap-Gal4, UAS-Yki, UAS-dPGC1-RNAi. We used anti-ATP-synthase (Abcam, ab14748, dilution 1:200), to label the mitochondria for this Figure. Despite the lower resolution, mitochondria in the Yki + dPGC1-RNAi tumors appear elongated (yellow arrows) compared to those in the other conditions, consistent with the changes observed by EM. We believe this example illustrates the limitations of confocal imaging in this tissue and reinforces the need for EM to accurately assess mitochondrial morphology in the tumorous epithelium.

      While our EM analyses reveal mitochondrial enlargement in wing discs co-expressing Yki and PGC1-RNAi, we acknowledge that these structural features alone do not conclusively demonstrate mitochondrial hyperfusion. To address this, we have revised the manuscript to avoid overinterpreting the EM data and instead emphasize the functional relevance of mitochondrial fusion regulators such as dMfn and Opa1 in promoting tumor growth.

      Taken together, the EM analysis provides structural validation in the tumorous epithelium (Fig 4), while the confocal imaging and functional manipulation of fusion genes in the peripodial membrane offer mechanistic insight (Fig 3). This integrated approach strengthens the conclusion that PGC1 depletion in a Yki-overexpressing context promotes changes in mitochondrial morphology and contributes to tumorigenesis, independent of whether these changes reflect hyperfusion.

      Figure 4. refers to changes in mitochondrial fusion and fission in tumor formation; however, the authors do not attempt to alter mitochondrial fission factors, so it is not accurate to mention a role of mitochondrial fission, in this context.

      As we did not directly manipulate fission-related factors in our experiments, we agree that it would be inappropriate to draw conclusions about the role of mitochondrial fission in this context. Our revised figure (current Fig 5) and accompanying text now focus exclusively on the effects of mitochondrial fusion and the genes directly involved in regulating this process.

      It must be noted, too, that the authors have not demonstrated that their genetic interventions have actually affected mitochondrial morphology in these experiments. As noted in the previous figure, the Yki OE + dPGC1 RNAi condition showed enlarged mitochondria, but not necessarily hyperfused organelles. Therefore, the downregulation of Mfn or Opa1 in this set of experiments may not necessarily have altered mitochondrial morphology. Perhaps suppression of Mfn or Opa1 would normalize the areas of these evidently swollen mitochondria, but this is unclear without images. Furthermore, it should be appreciated that both Opa1 and Mfn exhibit pleiotropic attributes - e.g., Opa1 not only regulates IMM fusion, but it also modulates the shape and tightness of cristae membranes, specialized sites of oxidative phosphorylation as well as sequestration of cytochrome c, the release of which influences apoptosis (Frezza et al., 2006). At least in mammalian cells, Mfn2 is thought to regulate contacts between mitochondria and endoplasmic reticulum (Naon et al., 2023), which may serve other functions than OMM fusion, such as stabilization of the MAM.

      To directly address this point, we performed EM to assess mitochondrial ultrastructure in Yki + dPGC1-RNAi wing disc tumors, with and without dMfn1 downregulation, the most upregulated mitochondrial fusion gene in this tumor context. In Yki + dPGC1-RNAi tumors, mitochondria appeared more elongated, consistent with increased fusion. Upon dMfn1 depletion, we observed a dramatic shift in mitochondrial morphology: mitochondria became larger and more rounded, with disrupted cristae and onion-like structures, indicative of compromised mitochondrial integrity and function (see current Fig. 4).

      As the reviewer rightly notes, these morphological changes are consistent with the pleiotropic roles of Mfn and Opa1, which extend beyond outer and inner membrane fusion to include regulation of cristae architecture and ER-mitochondria contacts (Frezza et al., 2006; Naon et al., 2023). We now discuss these broader roles in the revised manuscript (lines 493–497). Taken together, our EM and confocal analyses, combined with targeted genetic manipulations, provide evidence that mitochondrial morphology is indeed altered in response to dPGC1 depletion and fusion gene deregulation in the wing disc.

      Figure 5 highlights a connection between dysregulation of mitochondria and Cyclin E, which allows cells to prematurely enter S phase. The data presented here do not offer clarity on whether the enlargement of the tumors results from increase cellular proliferation and/or cell size. The role of the cell cycle adds a layer of complexity to these results, because it is thought that mitochondria undergo fragmentation during the cell cycle to promote an even distribution of the organelle population after mitosis (Taguchi et al., 2007); however, in this manuscript, the authors contend that the downregulation dPGC1 is promoting mitochondrial hyperfusion. It is unclear how and whether cellular division and proliferation would proceed at an accelerated rate in a situation with mitochondrial hyperfusion.

      To address this point, we started by analyzing whether Yki + dPGC1-RNAi tumors exhibit increased proliferation compared to tumors expressing Yki alone. We quantified mitotic activity using the phospho-Histone H3 (PH3) marker of mitotic cells and observed a significant increase in PH3-positive cells in the Yki + dPGC1-RNAi condition. These results indicate an elevated proliferation rate in these tumors and are now presented in Fig 2O–Q. In the text, can be found in lines 221-228.

      We agree with the reviewer that our findings challenge the conventional view that mitochondrial fragmentation is a prerequisite for mitosis, as we observe increased expression of gene promoting mitochondrial fusion in the context of dPGC1 downregulation alongside signs of accelerated cell cycle entry. It is important to note that we also show that the levels of the oncogene Cyclin E, a key driver of cell cycle progression and S-phase entry, were elevated in Yki + dPGC1-RNAi tumors compared to those expressing Yki alone, suggesting that the increased proliferation observed is at least in part driven by enhanced cycle activity. To further probe Cyclin E’s role, we used the CycE-05306 heterozygous mutant allele, which reduces Cyclin E levels by ~50% without affecting normal development. Notably, this partial reduction strongly suppressed tumor growth in the Yki + dPGC1-RNAi background (Fig 6), underscoring Cyclin E’s functional importance in supporting oncogenic growth in this context.

      These findings support the notion that defects in the expression of mitochondrial genes involved in mitochondrial morphology induced by dPGC1 depletion do not impair but rather coincide with accelerated cell division.

      Minor comments:

      Lines 69-72 contrast the roles of PGC1α and β. It is not clear whether the comparison is of their respective roles in cancer or in normal physiology. In either case, it is important to note that PGC1β has been shown to drive mitochondrial fusion as well as biogenesis through its control of MFN2, among other factors (Liesa et al., 2008).

      In response, we have clarified the comparison between PGC1α and PGC1β in the introduction to specify that it refers to their roles in cancer. Additionally, we now acknowledge that PGC1β has been shown to promote mitochondrial biogenesis and fusion, notably through the regulation of MFN2, as demonstrated by Liesa et al. (2008). This reference has been added to provide a more balanced and accurate representation of PGC1β’s functions. In the text it can be found in lines 77-81.

      Although this study focuses on PGC1, the authors do not seem to site the original literature from the Spiegelman lab.

      In response to the reviewer’s comment, we have added a new section in the introduction that cites key foundational studies from the Spiegelman lab. This addition can be found in the introduction in lines 68-73.

      There are 10-20 grammatical errors throughout the text.

      We apologize for this. We have carefully revised the text, and we are very confident those errors have been corrected.

      **Referee Cross-commenting**

      There is agreement among the referees that the potential role of PGC1 as a tumor suppressor is interesting and significant. However, various aspects of this work require attention prior to publication. For example, there needs to be a complete knock down of PGC1 to come to any conclusion as to its role in wing development. The methods for analyzing mitochondrial morphology need to be clarified and be consistent with standards in the field of mitochondrial dynamics. Also, the authors need to quantify their Western blots to obtain accurate assessments of protein levels. Generally, the study relies too heavily on overexpression experiments; understanding the potential role of mitochondria in regulating the Hippo pathway should include various knockdown and/or knockout models.

      Reviewer #1 (Significance (Required)):

      Overall, the authors show an interesting dampening effect of dPGC1 on growth of Yki-driven tumors. This data could be relevant for elucidating how dysregulation of the Hippo signalling pathway can underlie tumorigenesis.

      The narrative arc of the study, however, appears to lack a focused line of inquiry. Figure 1 highlights an attempt to modulate Drosophila wing size and/or structure by downregulating dPGC1, but to no effect. Although examination of the efficiency of the RNAi revealed that the transcripts were still present in significant quantities; so, the conclusion that dPGC1 is dispensable for wing formation is premature. To have clarity on this point, it would be necessary to completely knockdown the gene, preferably by showing a total loss of protein. This should be feasible for the authors, since they showed Western blotting in Figure 5A. In any event, it seems that this negative data led the authors to study the Hippo pathway in the larval stage. This transition from Figure 1 to 2 seemed somewhat arbitrary and leads to a rather disjointed sense of the main line of inquiry around dPGC1.

      It is important to note, too, that the authors highlight a role of mitochondrial dynamics in the pathway of Yki-driven tumor formation; however, they only directly evaluate mitochondrial dynamics in this context in a single assay, namely, Figure 3H-K, and this quantification is likely inaccurate because the mitochondria in the Yki OE + dPGC1 RNAi condition seem to be substantially enlarged, circular structures. It is critical to keep in mind that mitochondrial enlargement does not necessarily stem from hyperfusion. It could come from a decrease in the activity of Drp1 or result from an imbalance between mitochondrial biogenesis and mitophagy.

      As noted in our responses above, we have addressed these concerns by clarifying the limitations of our mitochondrial morphology analysis. Additionally, we have expanded the discussion (lines 498-504) to explicitly acknowledge that mitochondrial enlargement does not necessarily indicate hyperfusion. In that paragraph, we consider alternative explanations such as reduced fission or imbalances in mitochondrial biogenesis and mitophagy, and we outline the need for future studies using dynamic assays and additional markers to more precisely dissect mitochondrial remodeling in Yki-driven tumors.

      A marked limitation of this study is the overuse of rather artificial manipulations of transcriptional regulatory pathways. The study would benefit a lot from investigation of the loss of function of components of the Hippo pathway rather than just OE of Yki.

      We performed additional experiments using Warts (Wts) mutant clones to assess the role of dPGC1 in a loss-of-function context within the Hippo pathway. While our initial analyses were based on Yki overexpression, which allowed us to robustly probe the interaction between Yki and dPGC1, we agree that this approach may not fully reflect physiological conditions. By generating Wts mutant clones, which endogenously activate Yki through loss of upstream inhibition, we were able to evaluate the impact of dPGC1 depletion in a more physiologically relevant setting. These new results confirm and extend our previous findings, showing that dPGC1 limits tissue overgrowth even when Yki is activated through loss of Wts, thereby strengthening the biological relevance of our conclusions.

      These results are presented in Fig 2F-I. In the text, those results are presented in lines 181-189.

      My expertise is in mitochondrial biology, with specialization in super-resolution imaging, mitochondrial dynamics and membrane architecture. I have also worked in the interface between mitochondrial physiology and cancer. With this perspective, I think that the authors uncover a potentially interesting role of PGC1 as a tumor suppressor.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      Summary In this manuscript the authors the investigate the role of the mitochondrial regulatory transcription factor dPGC1 in tissue growth and oncogenic transformation. They show that dPGC1 limits hyperplasia mediated by overexpression of Yki in the Drosophila wing disc, while having no effect on normal growth. dPGC1 depletion in discs overexpressing Yki results neoplastic overgrowth and hyperfused mitochondria, which was dependent on the increased expression of genes involved in promoting mitochondrial fusion. Additionally, the authors show that dPGC1 limits CycE levels post-transcriptionally in Yki tumors.

      In the revised version of our manuscript, we have clarified the relationship between our findings and prior work by Nagaraj et al., including new experiments that demonstrate the specificity of dPGC1’s role in Yki-driven growth. Specifically, we show that dPGC1 depletion does not enhance tissue overgrowth in EGFR or InR contexts, nor does it affect Yki expression or activity. Furthermore, we tested dPGC1 overexpression in Yki-overexpressing tissues and observed no significant changes in growth or mitochondrial fusion gene expression. Additional controls confirmed that Cyclin E upregulation is specific to the Yki + dPGC1 depletion condition, reinforcing the context-dependent nature of our findings.

      Each of the reviewer’s comments is addressed below.

      Major comments 1) The authors mention several times in passing in the results a manuscript from the Banerjee lab (Nagaraj et al 2012), which shows that many of the genes the authors of the present manuscript show are upregulated upon Yki overexpression + dPGC1-RNAi compared with Yki overexpression alone are in fact upregulated upon Yki overexpression alone compared with control (dMfn/marf, opa1, miro - while interestingly dPGC1 itself is not affected). Nagaraj et al further show that Yki-overexpressing discs have longer mitochondria suggesting increased fusion even in the absence of dPGC1 depletion. The findings from Nagaraj et al should be mentioned explicitly in the introduction and the relationship between this manuscript and the present work clearly outlined in the discussion.

      In the revised manuscript, we have now explicitly referenced the findings of Nagaraj et al. (2012) in the Introduction (lines 106-118), Results (lines 355-360) and Discussion (lines 466-468) sections.

      In the revised Introduction, we summarize their key observations that Yki overexpression alone upregulates mitochondrial fusion genes such as dMfn and Opa1, and leads to mitochondrial elongation, while not affecting dPGC1 expression.

      In the revised Results section, we mention that, building on that work, our study demonstrates that dPGC1 depletion further amplifies this effect, leading to enhanced mitochondrial elongation and tumor growth.

      In the revised Discussion, we now explicitly reference the findings by Nagaraj et al. (2012), which demonstrated that Yki overexpression promotes mitochondrial fusion and upregulates key fusion genes. We build upon this work by showing that dPGC1 depletion in a Yki-overexpressing background further enhances mitochondrial fusion gene expression and tumor growth. This supports a model in which dPGC1 acts as a safeguard against Yki-induced mitochondrial remodeling and oncogenesis, reinforcing its role as a context-dependent tumor suppressor.

      Importantly, we show that this effect is context-dependent and not observed in otherwise normal tissues, highlighting a sensitized mitochondrial response to Yki activation when dPGC1 is lost. These additions help delineate the novel contribution of our study in identifying dPGC1 as a critical modulator of mitochondrial dynamics and tumorigenesis downstream of Yki.

      2) Given that Yki overexpression alone induces mitochondrial fusion and that dMfn/marf and opa1 depletion suppresses Yki-induced overgrowth (Nagaraj et al), does dPGC1 overexpression also suppress Yki-induced overgrowth?

      If so, is this correlated with reduction in dMfn/marf and opa1 compared with Yki overexpression alone?

      In response, we performed additional experiments to assess whether dPGC1 overexpression influences Yki-driven overgrowth. We also analyzed the expression of mitochondrial fusion genes (dMfn and Opa1) in this context. As shown in new Fig. S8, dPGC1 overexpression in Yki-overexpressing wing discs did not significantly affect tissue growth, nor did it alter the mRNA levels of key fusion regulators, dMfn and Opa1. These findings suggest that the transcriptional upregulation of mitochondrial fusion genes observed upon dPGC1 depletion is not a general consequence of altered dPGC1 levels, but rather a specific response that emerges in the context of Yki activation. We now present and discuss these results in the revised manuscript (lines 278-285), highlighting the sensitized nature of mitochondrial remodeling in an oncogenic environment driven by Yki signaling.

      3) One important question raised by this study is: how specific is the effect of dPGC1 depletion to Yki-driven overgrowth? As Yki-driven overgrowth already have increased mitochondrial length, it is possible that Yki-expressing cells are already sensitised to the effects of dPGC1 depletion. Interestingly, Nagaraj et al show that mitochondrial morphology is not affected upon EGFR activation (hyperplasia) or upon scrib and avl depletion (neoplasia). The authors should therefore test if dPGC1 depletion can potentiate the growth of other hyperplasia drivers such as activated EGFR and InR in the wing disc.

      We tested whether the growth-suppressive effect of dPGC1 depletion was specific to Yki-driven overgrowth or could also potentiate tissue growth in other oncogenic contexts. Specifically, we downregulated dPGC1 in wing discs overexpressing either EGFR or InR. In both cases, we did not observe any enhancement of tissue overgrowth upon dPGC1 depletion, in contrast to what we observed in Yki-overexpressing discs. These results suggest that the sensitivity to dPGC1 depletion is specific to Yki-driven overgrowth and is not a general feature of hyperplastic growth induced by other oncogenes.

      These results are shown in Fig S4 and in lines 195-202.

      4) There are a few simple control experiments the authors should provide to clarify the relationship between Yki and dPGC1: - Are Yki levels affected by dPGC1 depletion?

      To address the potential regulation of Yki by dPGC1, we performed quantitative PCR (qPCR) analysis to measure the expression levels of yki and its well-established transcriptional targets—Cyclin E, Diap1, and bantam—in wing discs depleted of dPGC1. As shown in Fig. S3, we did not detect significant changes in the transcript levels of yki or its target genes, suggesting that the enhanced phenotype observed upon dPGC1 depletion is unlikely to be driven by increased Yki expression or activity. These results indicate that dPGC1 does not strongly influence Yki expression or activity. These new results are presented in lines 190-194.

      • Does dPGC1 knockdown alone modify the expression of the genes tested in Fig.3A? In other words, is this upregulation specific of the Yki-overexpression context?

      We have conducted this analysis, and the results are now presented in new Fig S7. While the trend is similar to that observed in tumors with both Yki depletion and dPGC1 depletion, the magnitude of change is smaller compared to the context of Yki overexpression. This is described in the text in lines 273-277.

      • Does dPCG1 knockdown also stabilise CycE in the absence of Yki overexpression or does the stabilisation of CycE occur only in Yki tumors?

      To address this, we examined Cyclin E levels in wing imaginal discs mutant for dPGC1 alone. Our analysis did not reveal any detectable changes in Cyclin E levels under these conditions. These findings suggest that the upregulation of Cyclin E is not a general consequence of dPGC1 loss, but rather a feature specific to the context of Yki overactivation. The corresponding data are now included in Fig S14 of the revised manuscript. In the text, it can be found in lines 442-448.

      5) Figure 3C-G: it is not clear how the authors can quantify the length of 3D structures like mitochondria from 2D TEM images (unless they have done volume reconstruction from consecutive sections) and no details are provided in the methods. The quantification of mitochondrial length has to be performed rigorously as it is a key part of the paper.

      We agree that TEM provides only 2D profiles of 3D mitochondrial structures, and that this does not allow for precise volumetric reconstruction. In our study, we measured the longest axis of mitochondria visible in thin TEM sections, which is a commonly used 2D proxy for mitochondrial length in the literature (e.g., PMID: 36367943 and PMID: 38637532). To avoid misunderstandings, we have clarified in the Material and Methods section that the reported values represent apparent mitochondrial length in 2D sections, not true 3D length. To enhance the accuracy of these estimates, we measured more than three tissues per genotype, multiple regions per tissue, several cells per region, and various fields of view per cell.

      Minor Comments:

      1) Line 51: "Mitochondria are highly dynamics organelles." should be "Mitochondria are highly dynamic organelles."

      We have corrected that mistake. Thanks!

      2) Introduction: the authors should summarise the known physiological functions of PGC1α in order to put their findings in context.

      We have added a section in the introduction (lines 66-81) summarizing the known physiological functions of PGC1α

      3) lines: 121-3: "...depletion of dPGC1...did not have a major impact on adult wing size and shape (Fig 1B, C)." There is a small but statistically significant difference so the authors should state this in the text.

      We have revised the text to acknowledge that dPGC1 depletion leads to a modest but statistically significant reduction in wing size. In addition to the original analysis, we have now included further experiments to strengthen this point. Specifically, we analyzed wings from flies homozygous for the dPGC11 allele (also known as dPGC1KG08646; see FlyBase: https://flybase.org/reports/FBal0148128) and confirmed a small but significant reduction in both wing disc and adult wing size compared to controls (this can be found in Fig. 1 and Fig. S1). These results support the conclusion that, although dPGC1 is dispensable for viability and gross morphology, it contributes to normal wing growth. These new results can be found in lines 144-153.

      4) Figure 5A (Cyclin E western blot): the authors should show molecular weight markers. In the revised version of our manuscript, we are including the molecular markers as indicated by the reviewer. These can be found in Fig S12.

      Reviewer #2 (Significance (Required)):

      The manuscript by Sew et al builds on the previous work by Nagaraj et al to explore the role of mitochondrial function in tumors driven by disruption of the Hippo pathway. In particular, the authors identify dPGC1 as a transcription factor that limits Yki-driven mitochondrial fusion and tissue growth. Interestingly, they further show that Yki/PGC1-depleted tumors are highly sensitive to Cyclin E levels, due to post-transcriptional Cyclin E increase. These results further our knowledge of how Yki drives growth and how mitochondria participate in oncogenic transformation. With appropriate revision as outlined above (for example exploring whether the mechanism proposed is Yki-specific), the manuscript will be of broad interest to developmental and cancer biologists.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      The manuscript presents compelling evidence that dPGC1 acts as a context-dependent tumor suppressor in Drosophila by modulating mitochondrial dynamics and limiting Yorkie (Yki)-induced oncogenic growth. By leveraging the Drosophila wing imaginal disc as a model, the authors investigate how dPGC1 depletion exacerbates Yki-driven tissue overgrowth, mitochondrial hyperfusion, Cyclin E upregulation, and DNA damage, leading to tumorigenesis. The study provides valuable insights into the interplay between mitochondrial dynamics and cancer, with implications for understanding metabolic regulation in oncogenesis. While the findings are significant and well-aligned with the field, certain aspects of the experimental design, data presentation, and mechanistic insights require further attention to enhance clarity, reproducibility, and impact. Below, I outline my major concerns and recommendations.

      We addressed concerns about RNAi efficiency and protein-level validation with new qPCR data and mutant analysis. We provided EM and confocal evidence of mitochondrial changes. We clarified non-autonomous effects and quantified Mmp1 and F-actin and added data on miro and Opa1 manipulations. Cyclin E quantification was expanded using multiple Western replicates and a validated mutant allele, and we included new data on mitochondrial membrane potential to assess functional consequences.

      Our detailed responses to each point raised by the reviewer are provided below.

      Major Points

      1. One point is the knock-down efficiency of dPGC1 on the mRNA level, which is between 30 to >50% (Fig. S1C). This is not too strong, so the question arises how severly the protein levels are affected. If possible, an antibody staining with quantification should be performed. From these data it cannot be concluded dPGC1 is not required for normal development, half the dose could be sufficient. How do wings look like when the ap-GAL4 driver is used for dPGC1 knock-down, as this is the driver used in the subsequent experiments? Reviewer 1 also raised concerns about the potential inefficiency of the RNAi treatment in revealing a function during normal wing growth. We agree with both reviewers that the interpretation of the RNAi efficiency data requires clarification.

      The qPCR analysis shown in former Fig. S1C was performed using wing discs from flies expressing UAS-dPGC1-RNAi under the control of the MS1096-Gal4 driver. However, as shown in current Fig. 1C, MS1096-Gal4 is not expressed uniformly across the wing disc. Some regions remain RFP-negative, indicating that the RNAi construct is not active in all cells. As a result, the measured mRNA levels likely underestimate the true knockdown efficiency. This is because the qPCR includes mRNA from both RNAi-expressing and non-expressing cells, diluting the apparent reduction in transcript levels.

      To address this limitation and more accurately assess RNAi efficiency, we repeated the qPCR analysis using a ubiquitous driver (actin-Gal4) to ensure uniform expression of the RNAi construct. Under these conditions, we observed a more substantial knockdown, with dPGC1 mRNA levels reduced to approximately 25% of control levels (this is shown in current Fig S2). This result indicates that the RNAi line is more effective than initially suggested by the MS1096-Gal4-based analysis.

      To complement our RNAi-based analysis, we additionally used a mutant strain carrying a characterized allele of dPGC1 (dPGC11, also known as dPGC1KG08646; see FlyBase: https://flybase.org/reports/FBal0148128). This genetically distinct approach allowed us to validate and strengthen our findings regarding dPGC1 function. Flies homozygous for this allele exhibited a modest but statistically significant reduction in both wing disc and adult wing size. These results support the conclusion that dPGC1 is required for normal wing growth and development. The new data are now included in Figure 1 and referenced in the main text (lines 144-151).

      Unfortunately, we cannot perform antibody staining due to the unavailability of antibodies against dPGC1.

      How does the wing disc look like when dPGC1 is overepressed together with Yki?

      In response, we performed additional experiments to assess whether dPGC1 overexpression influences Yki-driven overgrowth. We also analyzed the expression of mitochondrial fusion genes (dMfn and Opa1) in this context. As shown in new Fig. S8, dPGC1 overexpression in Yki-overexpressing wing discs did not significantly affect tissue growth, nor did it alter the mRNA levels of key fusion regulators, dMfn and Opa1. These findings suggest that the transcriptional upregulation of mitochondrial fusion genes observed upon dPGC1 depletion is not a general consequence of altered dPGC1 levels, but rather a specific response that emerges in the context of Yki activation. We now present and discuss these results in the revised manuscript (lines 278-285), highlighting the sensitized nature of mitochondrial remodeling in an oncogenic environment driven by Yki signaling.

      In Fig 2D (but also in Fig. 2C) not only cells in the dorsal but also in the ventral comparmtent seem to overproliferate. Either this is a mis-conception or it is a non-autonomous effect from interfering with Yki and dPGC1 in the vertrnal compartment. In either cases, this has to be clarified.

      Ventral cells are not labelled by GFP. Fig 3D shows a tumor in which GFP-negative cells are not present, suggesting that they are not overproliferating but rather being eliminated. This phenomenon is consistent with cell competition, a well-characterized process in which transformed or tumorigenic cells outcompete and eliminate neighboring wild-type cells. We have previously described this behavior in wing disc tumors (PMID: 26853367; DOI: 10.1016/j.cub.2015.12.042), and it likely contributes to the expansion of the tumor mass by removing surrounding normal tissue also in this context.

      In Fig. 2F-H quantification of Mmp1 and F-actin is missing. Mmp1 is a JNK target, so the authors could do in addition an anti-phospho JNK antibody staining.

      In response, we have performed those quantifications. They are now included in Fig 2M, N.

      In Fig. 3: how does the mitochondrial network look like in the wing disc periopodial epithelium using the Gug>Yki+dPGC1 genotype? Is it similar to Gug>dMfn or Gug>miro?

      We attempted to perform this analysis; however, Yki overexpression under the control of Gug-GAL4 resulted in larval lethality, likely due to GAL4 activity in essential tissues such as the central nervous system. As a result, we were only able to induce transgene expression for 24 hours before lethality occurred.

      At this early point, no detectable changes in mitochondrial morphology were observed in the peripodial membrane, likely because the duration of transgene expression was insufficient to elicit phenotypic alterations in this specific tissue. Therefore, while we aimed to compare this genotype to Gug>dMfn and Gug>miro, the technical limitations prevented a conclusive analysis.

      We have prepared a representative figure for the reviewer (below), showing representative confocal images of wing discs showing mito-GFP and Dapi in the three genotypes indicated in the Fig.

      In Fig. 3I: what is really the mitochondrion? It would be good to outline the region(s) that was/were measured.

      To improve clarity, we have repeated the electron microscopy (EM) analysis and now provide representative images that more clearly illustrate mitochondrial morphology in the different genotypes analyzed. These updated images presented in Fig 4 better highlight the structural alterations observed upon genetic manipulation and help clarify the basis for our morphological assessments.

      We have extended our analysis and have assessed mitochondrial ultrastructure in Yki + dPGC1-RNAi wing disc tumors, with and without dMfn1 downregulation—the most upregulated mitochondrial fusion gene in this tumor context. In Yki + dPGC1-RNAi tumors, mitochondria appeared more elongated, consistent with increased fusion. Upon dMfn1 depletion, we observed a dramatic shift in mitochondrial morphology: mitochondria became larger and more rounded, with disrupted cristae and onion-like structures, indicative of compromised mitochondrial integrity and function (see new Fig 4).

      A quantification of RNAi and overexpression efficiencies of the different transgenes in Fig. 3 is required.

      To assess the efficiency of RNAi-mediated knockdown and transgene overexpression, we performed quantitative PCR (qPCR) using the ubiquitous Actin-Gal4 driver. While we acknowledge that this driver does not replicate the spatial specificity of the periodic membrane Gal4 driver used in the experiments shown in Figure 3 (Gug-Gal4), the latter targets a very limited number of cells within the imaginal disc, making reliable qPCR quantification unfeasible.

      Using Actin-Gal4 allows us to obtain a relative and informative measure of transgene efficiency across the different constructs. These data confirm effective knockdown and overexpression of the relevant genes and are now included in Figure S2.

      In Fig. 4: what is the phenotype when miro is over-expressed in combination with Yki? Or when it is knocked down in the ap>Yki-dPGC1 background? This was the gene tested in Fig. 3 with a clear mitochondrial phenotype

      To address whether miro contributes to Yki-mediated tumor growth, we performed the requested experiments and now include the results in the revised manuscript (see updated Results section, lines 374-377, and new Fig. S11).

      Our data show that overexpression of miro in combination with Yki does not lead to a significant increase in tissue growth or tumor-like phenotypes, in contrast to the effects observed with dMfn or Opa1 overexpression. Similarly, knockdown of miro in the ap>Yki-dPGC1-RNAi background did not suppress tumor growth, indicating that miro is not required for the enhanced proliferation observed in this context.

      These findings suggest that, although miro influences mitochondrial morphology in normal wing discs (as shown in Fig. 3), its role in tumorigenesis is distinct from that of dMfn and Opa1. We have revised the manuscript to clarify the gene-specific contributions of mitochondrial fusion regulators to Yki-driven tumorigenesis. This distinction underscores the complexity of mitochondrial dynamics and highlights that not all fusion-related genes exert the same functional impact in oncogenic settings.

      How does the mitochondrial morphology in the wing disc peripodial epithelium look like in Gug>Opa1RNAi or Gug>Opa1 discs?

      To assess the impact of Opa1 on mitochondrial morphology in the peripodial epithelium of the wing disc, we used the Gug-GAL4 driver to either overexpress or knock down Opa1. Our analysis revealed that Opa1 overexpression led to slightly elongated mitochondria, but did not result in extensive network formation, suggesting a modest enhancement of inner membrane fusion. In contrast, Opa1 knockdown caused clear mitochondrial fragmentation, closely resembling the phenotype observed upon dMfn depletion. These results shown in Fig 3 are consistent with the distinct roles of Opa1 and dMfn in regulating mitochondrial fusion: Opa1 primarily modulates inner membrane fusion and cristae architecture, while dMfn drives outer membrane fusion and network connectivity.

      The corresponding data are presented in Figure 3F, G, and quantified in Figure S9, alongside experiments manipulating other genes involved in mitochondrial dynamics.

      Why have the authors switched between the ap>Yki+dPGCRNAi and the ap>Yki+dPGC1shRNA lines? It would be important to have this series of experiments in the same backgrounds, as KD efficiencies are different (Fig. S1C).

      The primary reason for switching between the dPGC1-RNAi and dPGC1-shRNA lines was practical: the chromosomal insertion sites of the transgenes made certain genetic combinations more feasible with one line over the other. This flexibility significantly facilitated our experimental design and analysis.

      To address concerns regarding knockdown efficiency, we performed a comparative analysis using the ubiquitous actin-GAL4 driver, rather than MS1096-GAL4, which exhibits patchy and dynamic expression in the wing imaginal disc. This allowed us to obtain a more consistent and interpretable measure of mRNA downregulation for both transgenes. Our results show that both lines achieve comparable levels of knockdown, as shown in Figure S2.

      Fig. 5A: proper quantification of Western Blot signals is required. I do not agree that Cyclin E protein levels are elevated in ap>Yki or ap>Yki+dPGC1 discs. Even at the mRNA levels the increase in expression is rather weak. From these results nothing can be concluded.

      We have repeated the Western blot analysis using seven independent membranes to ensure robust quantification of Cyclin E levels in ap>Yki and ap>Yki+dPGC1-RNAi wing discs (Fig 6).

      Although the increase in Cyclin E protein levels is subtle, it is consistent across replicates and statistically significant. We have now included the quantification of these Western blot signals in the revised Figure 6, which supports the conclusion that Cyclin E levels are elevated in ap>Yki+dPGC1 discs.

      We hope this additional data addresses the reviewer’s concern and strengthens the interpretation of our results.

      Knock-down efficiencies for dap and CycE needs to be quantifiec (Fig. 5H-N). Although the rescue experiment with CycE knock down is from the phenotype convincing, it is nonetheless puzzling, as CycE is accodring to Fig. 5A+B hardly upregulated. An independent CycE RNAi line would be useful.

      We have quantified the knockdown efficiency of the dap-RNAi line, and the results are included in Figure S13.

      Regarding Cyclin E, we would like to clarify that we did not use an RNAi line in this experiment. Instead, we employed the CycE-05306 mutant allele in a heterozygous background, which is expected to reduce Cyclin E levels by approximately 50%. The CycE-05306 allele in Drosophila melanogaster is a loss-of-function allele of the Cyclin E gene. This allele carries a P-element insertion in the first intron of the CycE gene, which disrupts normal transcription and reduces Cyclin E expression. In a heterozygous background, as used in your experiments, CycE-05306/+ is expected to reduce Cyclin E levels by approximately 50%, which is typically sufficient to observe genetic interactions or sensitized phenotypes without affecting normal development. This makes it a valuable tool for studying gene dosage effects, particularly in tumor models where Cyclin E activity may be rate-limiting.

      Importantly, this partial reduction does not impair normal tissue growth, but it strongly limits tumor growth in the context of Yki overexpression combined with dPGC1 downregulation, as shown in Figure 6. This selective sensitivity highlights the functional importance of Cyclin E in supporting oncogenic growth driven by Yki and dPGC1 depletion. We believe this provides compelling evidence for Cyclin E’s role in this tumor model.

      Reviewer #3 (Significance (Required)):

      Strengths and Limitations of the Study Strengths Innovative Focus on Mitochondrial Dynamics and Oncogenesis: The study provides compelling evidence linking mitochondrial dynamics, particularly hyperfusion, to tumorigenesis in Drosophila. The identification of dPGC1 as a context-dependent tumor suppressor adds novel insights into the interplay between metabolism and oncogenesis. Comprehensive Use of Drosophila as a Model System: The study leverages the genetic tractability of Drosophila, allowing precise manipulation of mitochondrial regulators and signaling pathways. The use of wing imaginal discs as a model for tumor growth is well-established and appropriate. Integration of Morphological and Genetic Data: The manuscript combines confocal imaging, electron microscopy, and genetic tools to demonstrate the role of dPGC1 in regulating mitochondrial dynamics, Cyclin E levels, and tissue overgrowth. Relevance to Cancer Biology: The findings address key hallmarks of cancer, including deregulated metabolism, genomic instability, and cell cycle misregulation. The study's exploration of these processes in a simple model organism provides a strong basis for translating findings to mammalian systems.

      Limitations Validation of RNAi and Overexpression Efficiency: The knockdown efficiency of dPGC1 on the mRNA level is only moderate (30-50%), and protein-level validation is missing. Without this, the study cannot conclusively demonstrate the role of dPGC1 in normal development or tumorigenesis. Incomplete Mechanistic Insights: The manuscript identifies Cyclin E as a potential driver of tumor growth but does not adequately explore how mitochondrial hyperfusion leads to Cyclin E regulation (e.g., post-transcriptional mechanisms or protein stability). Inconsistencies in Experimental Backgrounds: The study uses different RNAi/shRNA lines and driver combinations inconsistently across experiments, making it difficult to compare results directly. This variability undermines the robustness of the conclusions. Limited Functional Analysis of Mitochondria: While mitochondrial morphology is well-characterized, functional assays (e.g., membrane potential or ATP production) are missing. These would confirm the impact of hyperfusion on cellular energetics and oncogenesis.

      In the revised manuscript, we have addressed each of the concerns raised.

      In addition to that, in the revised version of the manuscript, we have included new experiments to assess mitochondrial functionality in tumors co-expressing Yki and dPGC1-RNAi. Specifically, we analyzed the Mitochondrial Membrane Potential (MMP). We used TMRE staining to evaluate MMP, a key indicator of mitochondrial integrity and oxidative phosphorylation capacity. Our analysis revealed no significant differences in MMP between Yki tumors and Yki + dPGC1-RNAi tumors, suggesting that mitochondrial membrane potential is preserved despite the observed morphological abnormalities. These results are shown in Fig S6. In the text it is discussed in lines 233-243.

      Contribution to Existing Literature The study makes a significant contribution to the growing body of literature on the metabolic regulation of cancer by identifying dPGC1 as a tumor suppressor modulating mitochondrial dynamics. Previous work has established the dual roles of mammalian PGC1α in promoting or suppressing cancer depending on context. This study adds depth by demonstrating similar context-dependent effects in a simpler model organism, facilitating further exploration of the molecular mechanisms involved.

      By linking mitochondrial fusion, Yki signaling, and Cyclin E regulation, the manuscript aligns with and expands upon research on Hippo pathway regulation, cancer metabolism, and mitochondrial biology. The findings highlight the importance of integrating metabolic and signaling networks in understanding oncogenesis.

      Community Selection The current form of the manuscript is best suited for a specialized audience, particularly mitochondrial biologists, Drosophila researchers, and Hippo pathway specialists. To engage a broader community, additional work linking these findings to mammalian models or human cancer biology would be necessary.

  3. drive.google.com drive.google.com
    1. In this section, we review research thatsuggests that, whereas massing practice might promoterapid performance gains during training, distributingpractice facilitates long-term retention of that skill.

      Although massed practice may be useful for understanding material within a limited duration (short term memory), retention is not as effective using this method compared to distributive learning. This is why cramming material before a test is not the most effective way to actually retain the information learned from that short studying period, whereas distributive practice allows for breaks between practice to strengthen retention and retrieval (practice makes perfect!) We should think of studying strategies that discourage cramming in learning.

    1. or by the teacher with input from students

      I think this is something I'll be taking with me into my future classrooms because it isn't something I ever thought about before we talked about it in class. Kids and teenagers are more apt to do what their friends are doing or what their friends think is best, and if we come up with rules and norms as an entire class which includes aforementioned friends they'll feel more likely to listen and abide by them. When the rules are just from the teacher some of the more rebellious teens could feel the need to push the limit some. It could also help by bringing insight into what they may or may not understand or already abide by at home.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Manuscript number: RC- 2025-03073

      Corresponding author(s): Shaul Yogev

      1. General Statements [optional]

      We kindly thank our reviewers for their enthusiasm, thoughtful feedback, and constructive suggestions on how to strengthen our manuscript. Below, we provide a point-by-point response to reviewer comments and outline the experiments we will do to address every concern that has been raised.

      2. Description of the planned revisions

      • *

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      This interesting study uses an unbiased genetic screen in C. elegans to identify SAX-1/NDR kinase as a regulator of dendritic branch elimination. Loss of SAX-1 results in an excess branching phenotype that is striking and highly penetrant. The authors identify several additional regulators of branch elimination (SAX-2, MOB-1, RABI-1, RAB-11.2) by using a candidate genetic screen aimed at factors that interact physically or genetically with SAX-1. They propose that SAX-1 acts by promoting membrane retrieval based on the nature of these interactors and the results of an imaging-based in vivo assay for endocytic puncta.

      Major comments.

      1. My biggest concern is that the phenotypes are only observed in temperature-sensitive dauer-constitutive mutant backgrounds, and not in wild-type dauers. That is, wild-type animals exiting dauer do not require SAX-1 for dendrite elimination. While this does not undermine the importance of the results, it does require more explanation. The authors write that "the requirement for sax-1... relies on specific physiological states of the dauer stage," but I do not understand what this means. Are they saying that daf-7 and daf-2 dauers are in a different "physiological state" than wild-type dauers? In what way? What is the evidence for this? A more rigorous explanation is needed. We agree that this is puzzling, and we thank the reviewer for recognizing that this does not undermine the importance of the results. There is ample evidence that daf-2 and daf-7 differ from starvation-induced dauers. For example, a recent preprint finds that the transcriptomes of these two mutants at dauer cluster much closer to each other than to starvation-induced dauers (Corchado et al. 2024). Older work has noted other differences, such as the time the dauer entry decision is made (Swanson and Riddle 1981), the synchronicity of dauer exit, the ability to force dauer entry in daf-d mutants, as well as additional dauer-unrelated phenotypes (reviewed in Karp 2018). We agree with the reviewer that this merits further clarifications and will perform the experiments suggested by the reviewer below:

      To me, the simplest genetic explanation is that daf-7 and daf-2 are partially required for branch retraction in a manner redundant with sax-1, and the ts mutants are not fully wild-type at 15C. Thus, the sax-1 requirement is revealed only in these mutant backgrounds. Can the authors examine starvation-induced dauers of daf-7 or daf-2 raised continuously at 15C?

      We will do this experiment.

      daf-7 and daf-2 ts strains can form "partial dauers" that have a dauer-like appearance but are not SDS resistant. Could the difference between partial dauers and full dauers account for the difference in sax-1-dependence? The authors could use SDS selection of the daf-7 strain at 25C to ensure they are examining full dauers.

      We tested daf-7 mutants with 1% SDS when we set up the system – they are fully dauer at 25°C and are SDS sensitive after exit. We will repeat this important control with daf-7; sax-1 double mutants.

      The Bargmann lab has created a daf-2 FLP-OUT strain (ky1095ky1087) that allows cell-type-specific removal of daf-2. Could this be used to test for a cell-autonomous role of daf-2 in IL2Q related to branch elimination?

      We can attempt this experiment. However, since IL2 promoters turn on prior to dauer, the interpretation would not be straightforward – it would be hard to exclude that a cell autonomous defect in dauer entry does not account for the IL2 dauer exit phenotype, even if branching appears normal.

      These ideas are not a list of specific experiments the authors need to complete, rather they are meant to illustrate some possible approaches to the question. Whatever approach they use, it is important for them to more rigorously explain why SAX-1 is not required for branch removal in wild-type animals.

      We completely agree. We will carry out the 15°C experiment, examine morphological characteristics and test SDS resistance. In addition, we will test neuronal markers that differ between dauers and non-dauers to determine whether the mutants are full or partial dauers at the relevant timepoints.

      The SAX-2 localization (Fig. 4) and endocytosis assay (Fig. 6) results were not clear to me from the data shown. Overall a more rigorous analysis and presentation of the data would be important to make these conclusions convincing. This may involve refining the data presentation in the figures, modifying the claims (e.g., "we propose" vs "we find"), or saving some of the data to be more fully explored in a future paper. In my view, these figures are the biggest weak point of the manuscript and also are not important for the central conclusions (which are well supported and convincing), indeed these results are barely mentioned in the Abstract or last paragraph of Introduction.

      We agree that the analysis and presentation of Figures 4 and 6 need to be improved. The presentation has already been updated, and the figures are clearer now. In the revision, we will increase sample size to provide stronger conclusions, consolidate some of the analysis and further improve presentation. While we agree with the reviewer that conclusions from these figures are not as strong as those drawn from genetic experiments, they do complement and support the conclusions of those other figures.

      • In Fig. 4D, why is SAX-2 visible throughout the entire neuron and why is the "punctum" marked with an arrow also seen in the tagRFP channel? One gets the impression that some of the puncta may be background, bleed-through, or artifacts due to cell varicosities.

      There is no bleed-through: this is most evident by looking at the brightest signals in the cell body (now labelled with an asterisk in a zoomed-out image) and noting that they do not bleed between channels. In sax-1 mutants, the SAX-2::GFP puncta are very obvious and distinguishable from the tagRFP channel. In control, SAX-2::GFP is very faint in the dendrite, so we increased the contrast to allow visualization. The reviewer is correct that under these conditions, some puncta look like the cytosolic fill. In the revision, we will re-analyze the data and will not consider these as bona-fide SAX-2 puncta, but rather cytosolic SAX-2 that accumulates due to constrictions and varicosities in the dendrite.

      • Related to both Fig. 4 and Fig. 6, where does SAX-1 localize in IL2Q in dauer and post-dauer? Does its expression or localization change during branch retraction? Does it co-localize with SAX-2 or endocytic puncta?

      We generated an endogenously tagged sax-1 with a 7xspGFP11 tag; however, this was below detection in the IL2s. For the revisions, we can test an overexpressed cDNA construct.

      **Referee cross-commenting**

      I think we all touched on similar points. I wanted to follow up on Reviewer 3's comment, "Is the failure to eliminate branches an indication of incomplete dauer recovery? Do sax-1 mutants retain additional characteristics of dauer morphology in post dauer adults." I thought this was an excellent point. It made me wonder if that might explain why the defect is only seen in daf-7 and daf-2 mutant backgrounds - maybe these strains retain partial dauer traits even after exit. Is there a specific experiment that they could do? Did you have specific characteristics of dauer morphology in mind for them to check? (Ideally something in the nervous system that can be scored quantitatively.)

      Please see response to point #1 regarding experiments we will do to confirm the “dauer state” of daf-7 and daf-7; sax-1 double mutants.

      Reviewer #1 (Significance (Required)):

      A major strength of this work is the pioneering use of a novel system to study neuronal branch retraction. C. elegans has provided a powerful model for studying how dendrite branches form, but much less attention has been paid to how excess neuronal branches are removed. The post-dauer remodeling of IL2Q neurons provides an exciting and dramatic physiological example to explore this question.

      This paper is notable for taking the first steps towards developing this innovative model. It does exactly what is needed at the outset of a new exploration - a forward genetic screen to discover the main regulators of the process. Using a combination of classical and modern genetic approaches, the authors bootstrap their way to a sizeable list of factors and a solid understanding of the properties of this system, for example that retraction of higher vs lower order dendrites show different genetic requirements.

      We thank the reviewer for recognizing the novelty and significance of our work.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      In this manuscript, the authors establish C. elegans IL2 neurons as a system in which to study dendrite pruning. They use the system to perform a genetic screen for pruning regulators and find an allele of sax-1. Unexpectedly sax-1 is only required for post-dauer pruning in two different genetic backgrounds that induce dauer formation, but not starvation-induced dauer formation. Sax-1/NDR kinase reduction has previously been associated with increased outgrowth and branching in other systems, so this is a new role for this protein. However, the authors show that proteins that work with Sax-1 in other systems, like sax-2/fry, also play a role in this pathway. The genetic experiments are beautiful and the findings are all clearly explained and strongly supported. The authors also examine sax-2 localization, which localizes sax-1 in other systems, and show it in puncta in dendrites that increase with dauer exit, consistent with function at the time of pruning. They also show that membrane trafficking regulators associated with NDR kinases function in the same pathway here, hinting that endocytosis may play a role during pruning as in Drosophila. The link to endocytosis was a little weak (see Major point below). Overall, this study describes a new system to study pruning and identifies NDR/fry/Rabs as regulators of pruning during dauer exit. The work is very high quality and both the imaging and genetics are extremely well done.

      We thank the reviewer for their positive assessment of the manuscript.

      Major points

      1. The only place where there were any questions about the data was the last figure (6G and I). Here they use uptake of GFP secreted from muscle as a readout of endocytosis in IL2 neurons. They nicely show that more internalized puncta accumulate as animals exit dauer. The claim that this is reduced in sax-1 mutants doesn't seem to match the images shown well. In the image there are many more puncta in the GFP channel and much more accumulation of the RFP-tagged receptor everywhere. It seems like some additional analysis of this data is important to fully capture what is going on and whether this really represents an endocytic defect. We agree and will provide additional data in Figure 6. The specific discrepancy between the image and the quantification is because we showed a single focal plane rather than a projection. This does not capture all the puncta in a neurite. The current version shows a projection, making it evident that the mutants has fewer puncta compared to the control.

      Reviewer #2 (Significance (Required)):

      Neurite pruning is important in all animals with neurons. Genetic approaches have primarily been applied to the problem using Drosophila, so identifying a new model system in which to study it is an important step. Using this system, a pathway known to function in a different context is linked to pruning. Thus the study provides new insights into both pruning and this pathway.

      We thank the reviewer for the positive assessment of our study’s significance.

      __Reviewer #3 (Evidence, reproducibility and clarity (Required)): __

      Summary: Figueroa-Delgado et al. use a C. elegans neuro plasticity model to examine how dendrites are eliminated upon recovery from the stress induced larval stage, dauer. The authors performed a mutagenesis screen to identify novel regulators of dendrite elimination and revealed some surprising results. Branch elimination mechanism varies between 2{degree sign}, 3{degree sign}, and 4{degree sign} branches. The NDR kinase, SAX-1 and it's interactors (SAX-2 and MOB-2) are required for elimination of second and third order branches but not fourth order branches. Interestingly they showed that branch elimination varies depending on the stimulus of dendrite outgrowth such that the NDR kinase is required for branch elimination after genetically inducing the dauer stage but is not required if dauers are produced through food deprivation. The authors go a step further to include a small candidate screen looking at various pathways of membrane remodeling and identify additional regulators of dendrite elimination related to membrane trafficking including RABI-1, RAB-8, RAB-10, and RAB-11.2.

      We thank the reviewer for their time and suggestions below

      Major comments:

      • While I find the data promising and exciting, several of the experiments have concerningly low sample sizes. Fig 3G, Fig 4G, Fig 5J and L, and Fig 6I all contain data sets that are fewer than 10 animals. Sample sizes should be stated specifically in the figure legends for all data represented in the graphs. We thank the reviewer for finding the data exciting. We agree that the sample sizes in some panels is low and will increase it in the revised version. Sample sizes are now specifically listed in the figure legends.

      • All statements based on data not shown should be amended to include the data as a supplemental figure or edited to omit the statement based on withheld data. We agree. Some “not shown” data are already added to the current version of the manuscript and the rest will be added to the fully revised version, or the statements will be omitted.

      • Rescue experiments (Fig 2J) should demonstrate failure to rescue from neighboring tissue types (hypodermis and muscle) to conclude cell autonomous rescue rather than a broadly acting factor. Thank you for the suggestion. We will use a hypodermal promoter and a muscle promoter driving SAX-1 cDNA expression to strengthen the claim of cell autonomy.

      • Fig 4 needs quantification of higher order branches and SAX-2 proximity to branch nodes as these are discussed in the text. We will add this quantification.

      Minor comments:

      • Fig 1C-F, It appears like the shy87 allele produces animals of significantly different body sizes. It would improve rigor to normalize the dendrite coverage to body size in the quantification. We do not see a biologically meaningful size difference between shy87 and control, it may be the specific image shown. We will confirm this by measuring animal size for the final revision.

      • Is the failure to eliminate branches an indication of incomplete dauer recovery? Do sax-1 mutants retain additional characteristics of dauer morphology in post dauer adults. This important point was also raised by Reviewer 1. We will test SDS sensitivity, morphological markers, and molecular markers to determine the dauer “state” of the mutants used in this study. The results will be included in the final revision.

      • The text references multiple transgenic lines tested in Fig 2I-J but only one line is shown. Additional lines were visually examined under a fluorescent compound microscope but not imaged or quantified. We will add this quantification to the final revision.

      • Fig 4F, Additional timepoints would enhance the sax-1 localization result and might provide insight into mechanism of action for sax-1. We will add the localization in post-dauer adults.

      • Fig 6I Control and sax-1(ky491) example images should be provided in the supplement. We will add these images to the final revision.

      **Referee cross-commenting**

      I agree that we shared many of the same concerns.

      There are several general assays for dauer characteristics that could be used here to determine if the post-dauer animals retain other characteristics of the dauer stage in addition to IL2 branches (SDS resistance, alae remodeling, pharyngeal bulb morphology, nictation behavior). The nictation behavior has been connected very nicely with IL2 neurons (Junho Lee's group). Additionally, FLP dendrites occupy the same space as the IL2 branches and outgrowth in post-dauers occurs in coordination with IL2 branch elimination - this might be another optional experiment, to check if FLP growth is impeded by persistent IL2 branches. All of these could be quantified similar to how the authors have already established with their IL2 model (FLP dendrite branches) or with a binary statistic.

      Please see responses to Reviewer 1 and 3 above for the list of experiments to determine whether the animals fail to completely enter or exit dauer.

      Reviewer #3 (Significance (Required)):

      SIGNIFICANCE ============ These results describe a new role for the NDR kinase complex in dendrite pruning that has clinical significance to our understanding of human brain development and human health concerns in which pruning is dysregulated, such as observed in the case of autism. The authors use an established neuro-plasticity, C. elegans model (Schroeder et al. 2013) which provides a tractable and reproduceable platform for discovering the mechanism of dendrite pruning. These results would influence future work in the fields of cell biology of the neuron and disease models of brain development.

      My expertise is in the field of C. elegans neuroscience and stress biology and have sufficient expertise to evaluate all aspects of this work.

      3. Description of the revisions that have already been incorporated in the transferred manuscript

      Reviewer #1

      • In Fig. 4C, the distinction between puncta in the primary or higher-order dendrites is not clear to me, and several puncta that I would have scored as primary are marked as higher-order.

      We apologize for a mistake in the arrowhead color and overall presentation of this figure. It has been fixed in the current version.

      • Related to this, in Fig. 4B are the two arrows meant to be white as in the top panel, or yellow as in the bottom panel?

      We thank Reviewer #1 for their observation, and we apologize for our oversight. We fixed this in the current version.

      • In Fig. 4, where in the head are we looking? It would help to show a more low-magnification view of the entire cell.

      We added zoomed-out images and indicated where the zoomed in insets are taken from. We thank the reviewer for helping us improve the clarity of the data.

      • The main sax-1 phenotype is increased SAX-2 puncta in dauer, but the branch retraction defect is in post-dauers. How is this relevant to the phenotype?

      This is a very good point. The increase in SAX-2 puncta in sax-1 mutants is stronger during dauer-exit than in dauer, consistent with this being the time when SAX-1 functions. We agree that some earlier activity of SAX-1 cannot be excluded, and we do not assume that the effect on SAX-2 completely accounts for the pruning defects. This is now acknowledged in the text. However, given that both proteins function together in pruning, and given that the effect is strongest during dauer exit, we do believe that this data is informative and worth showing.


      • The number of SAX-2 puncta in sax-1 mutants decreases almost to normal in post dauers. Is there a correlation between the number of remaining branches and the number of SAX-2 puncta? That is, do the many wild-type animals with "excess" SAX-2 puncta also fail to retract branches?

      There is no correlation. In other words, the number of SAX-2 puncta does not instruct the extent of pruning. Please note the quantifications underestimate the number of SAX-2 puncta in the mutants, since they were only done on the primary dendrite. This is necessary because the mutant and control have different arbor size, so only branch order that can be appropriately compared are primary dendrites.

      • The control post-dauer data in Fig. 4F and 4H are identical (re-used data) but the corresponding control dauer data in Fig. 4F and 4G are different. What is going on here?

      We thank the reviewer for raising this point and apologize for the oversight in data presentation. In the revised manuscript, we now show all control and experimental data integrated into a single graph, ensuring that each dataset is represented accurately to provide a comparison between dauer and post dauer recovery conditions.


      • Why are sample sizes so small for both strains in Fig. 4G compared to Fig. 4F and 4H?

      We sincerely apologize for this mistake, some of the data was erroneously grouped in the original submission. The revised version contains an updated number of neurons, presented on the same graph, and in the final revision we will further increase sample size. We apologize again for this error.

      • In Fig. 6C, why are the tagRFP (blue) puncta larger than the neurite? Aren't these meant to represent vesicles inside the surrounding neurite? One gets the impression that this is bleed-through from the GFP channel.

      Based on EM, both an endocytic punctum and the diameter of the neuron are smaller than a single pixel. The apparent difference in size in fluorescence microscopy is because the puncta are brighter (they contain more membrane) and thus appear larger. In the current version, the improved presentation of the figure contains zoomed out images that clearly show that there is no bleed-through.

      • In Fig. 6E and 6F, why are there no tagRFP (blue) puncta? Is CD8 not endocytosed at all if it lacks the nanobody sequence? One would expect the tagRFP (blue) signal to be the same in both strains and simply to lack yellow if the nanobody is not present.

      CD8 lacks clear endocytosis motifs, which is why it is advantageous for labelling neurites and testing endocytosis when paired with an endocytic signal (Lee and Luo 1999; Kozik et al. 2010). Conversely, extracellular GFP binding to a membrane GFP antibody can induce endocytosis (for example, see (Tang et al., 2020)), likely by inducing clustering, although we are not familiar with work that explored the mechanism. In the updated version we included a rare example of an mCD8 punctum.

      • The authors report a decrease in endocytic events in sax-1, but qualitatively it looks like there are vastly more puncta inside the neuron in Fig. 6H than in 6G.

      We apologize for the presentation in the original version of Figure 6. This impression was because we showed single focal planes that only captured some of the signal. In the revised version we show projections, which makes it evident that there are fewer endocytic events in the mutant.

      • In Fig. 6E and 6H, why are there so many GFP (yellow) puncta outside the neuron? What are these structures and why are they absent in the strain with the nanobody?

      These puncta are secreted or muscle-associated GFP that has not been internalized by IL2Q neurons. They are present in all strains in this figure, this can be clearly seen in the zoomed-out images that have been added to the updated figure.

      • What is the large central blue structure in Fig. 6H - is this the soma? - and why are puncta in this region not counted?

      This is indeed the soma. In the updated version this can be clearly seen in the zoom-out. The large puncta in the soma were not counted because they may arise from the fusion of an unknown number of smaller puncta, and their precise number cannot be determined at the resolution of fluorescence microscopy.

      • minor: there is text reading "40-" in the bottom panel of Fig. 6H. It is visible when printed but not on screen - adjust levels in Photoshop to reveal it.

      We thank the reviewer for catching this oversight, it is now fixed.

      Minor points:

      1. At several points the authors emphasize the relationship of neurite remodeling to stress, e.g. Abstract and Discussion: "we adapted C. elegans IL2 sensory dendrites as a model [of...] stress-mediated dendrite pruning". It seems unnecessary and potentially misleading to treat this as a neuronal stress response. First, it conflates organismal and cellular stress - there is no reason to think that IL2 neurons are under cellular stress in dauer. In fact parasitic nematodes go through dauer-like stages as part of healthy development and probably have similar remodeling of IL2. Second, dendrite pruning occurs during dauer exit, which is the opposite of a stress response - it reflects a return to favorable conditions. We agree. We modified the abstract and discussion to avoid conflating organismal stress (the alleviation of which is relevant for triggering pruning) and cellular stress. Thank you for pointing this out.

      In Fig. 1A, C. elegans is shown going directly from L1 to dauer in response to unfavorable conditions, which is incorrect. Animals proceed through L2 (in many cases actually an alternative L2d pre-dauer) and then molt into dauer (an alternative L3 stage) after completing L2.

      We updated the schematic to include the L2d stage where commitment to dauer entry or resumption to reproductive development is made.

      In Fig. 1B, please check if it is correct that hypodermis contacts the pharynx basement membrane as drawn. The schematic in the top panel makes it look like there is a single secondary branch and the quaternary branches are similar in length to the primary dendrite. The schematic in the bottom panel makes it look like the entire neuron is a small fraction of the length of the pharynx. Could these be drawn closer to scale?

      The hypodermis does contact the pharynx basement membrane. We redrew the schematic for clarity.

      Reviewer #2

      For context, it might be helpful to know whether branching of other dendrites is increased in sax-1 mutants (as expected based on phenotypes in other animals) or decreased like IL2 neurons.

      We examined the branching pattern of PVD, a polymodal nociceptive neuron (new Supplemental Figure 3). We find no significant difference between control and sax-1 or sax-2 mutants, suggesting that these genes function in the context of pruning. Recent work (Zhao et al. 2022) confirms that sax-1 is not required for PVD branching.

      Minor:

      "shy87 mutant dauers showed a minor reduction in secondary and tertiary branches compared to control (Figure 1G). These results indicate that shy87 is specifically required for the elimination of dauer-generated dendrite branches." Maybe temper the specificity claim some as the reduction in branches is definitely there.

      We agree, the claim was tempered.

      "three complimentary approaches" should be complementary

      Thank you for noticing. We fixed this.

      "In control animals, SAX-2 was mostly concentrated in the cell body (data not shown)" It might be nice to include some overview images that show the cell body for completeness.

      We added zoomed-out images to the revised figure, thank you for the suggestion.

      Reviewer #3


      Minor comments:


      • Fig 1G-H, are shy87 second and third order branch counts statistically different between dauer and post dauer adults? This comparison would strengthen the claim that these order branches fail to eliminate all together rather than undergo a partial elimination. We added this to Figure S2. The shy87 mutants show a complete failure in eliminating secondary branches (i.e. no difference between dauer and post-dauer) and a strong but incomplete defect in eliminating tertiary branches.

      • Fig 4B-E Indicate branch order in the images, this is unclear and a point that is focused on in the text. Done.

      • Discussion of Fig 1G from the text claims that shy87 is specifically required for branch elimination yet the data shows significant defects in branch outgrowth as well. This raises the question, are the branches abnormally stabilized that results in early underdevelopment and late atrophy? Authors should acknowledge alternative hypotheses. We agree and will revise the text accordingly. The difference between shy87 and control dauers, while statistically significant, is relatively minor and can only be detected by careful quantification, it is not apparent from looking at the images (in contrast for example to rab-8 and rab-10 mutants, where we acknowledge in the text that their branching defects might affect subsequent pruning.

      • Authors reference a branch elimination process but don't outline what this would entail and where their results fit in. We apologize for being unclear. Given that sax-1 and sax-2 function together, one would intuitively expect to see SAX-2 being reduced in sax-1 mutants, yet the opposite is observed. On potential explanation is that SAX-1 does not directly control SAX-2 abundance, but that clearance of SAX-2 is part of the pruning process that both proteins regulate. This would explain the enrichment of SAX-2 in sax-1 mutants. However, additional models cannot be excluded, and we acknowledge this in the revised text.

      References:

      Corchado, Johnny Cruz, Abhishiktha Godthi, Kavinila Selvarasu, and Veena Prahlad. 2024. “Robustness and Variability in Caenorhabditis Elegans Dauer Gene Expression.” Preprint, bioRxiv, August 26. https://doi.org/10.1101/2024.08.15.608164.

      Karp, Xantha. 2018. “Working with Dauer Larvae.” WormBook, August 9, 1–19. https://doi.org/10.1895/wormbook.1.180.1.

      Kozik, Patrycja, Richard W Francis, Matthew N J Seaman, and Margaret S Robinson. 2010. “A Screen for Endocytic Motifs.” Traffic (Copenhagen, Denmark) 11 (6): 843–55. https://doi.org/10.1111/j.1600-0854.2010.01056.x.

      Lee, T., and L. Luo. 1999. “Mosaic Analysis with a Repressible Cell Marker for Studies of Gene Function in Neuronal Morphogenesis.” Neuron 22 (3): 451–61.

      Swanson, M. M., and D. L. Riddle. 1981. “Critical Periods in the Development of the Caenorhabditis Elegans Dauer Larva.” Developmental Biology 84 (1): 27–40. https://doi.org/10.1016/0012-1606(81)90367-5.

      Tang, Rui, Christopher W Murray, Ian L Linde, et al. n.d. “A Versatile System to Record Cell-Cell Interactions.” eLife 9: e61080. https://doi.org/10.7554/eLife.61080.

      Zhao, Ting, Liying Guan, Xuehua Ma, Baohui Chen, Mei Ding, and Wei Zou. 2022. “The Cell Cortex-Localized Protein CHDP-1 Is Required for Dendritic Development and Transport in C. Elegans Neurons.” PLOS Genetics 18 (9): e1010381. https://doi.org/10.1371/journal.pgen.1010381.


      4. Description of analyses that authors prefer not to carry out

    2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #1

      Evidence, reproducibility and clarity

      This interesting study uses an unbiased genetic screen in C. elegans to identify SAX-1/NDR kinase as a regulator of dendritic branch elimination. Loss of SAX-1 results in an excess branching phenotype that is striking and highly penetrant. The authors identify several additional regulators of branch elimination (SAX-2, MOB-1, RABI-1, RAB-11.2) by using a candidate genetic screen aimed at factors that interact physically or genetically with SAX-1. They propose that SAX-1 acts by promoting membrane retrieval based on the nature of these interactors and the results of an imaging-based in vivo assay for endocytic puncta.

      Major comments.

      1. My biggest concern is that the phenotypes are only observed in temperature-sensitive dauer-constitutive mutant backgrounds, and not in wild-type dauers. That is, wild-type animals exiting dauer do not require SAX-1 for dendrite elimination.

      While this does not undermine the importance of the results, it does require more explanation. The authors write that "the requirement for sax-1... relies on specific physiological states of the dauer stage," but I do not understand what this means. Are they saying that daf-7 and daf-2 dauers are in a different "physiological state" than wild-type dauers? In what way? What is the evidence for this? A more rigorous explanation is needed.

      To me, the simplest genetic explanation is that daf-7 and daf-2 are partially required for branch retraction in a manner redundant with sax-1, and the ts mutants are not fully wild-type at 15C. Thus, the sax-1 requirement is revealed only in these mutant backgrounds. Can the authors examine starvation-induced dauers of daf-7 or daf-2 raised continuously at 15C?

      daf-7 and daf-2 ts strains can form "partial dauers" that have a dauer-like appearance but are not SDS resistant. Could the difference between partial dauers and full dauers account for the difference in sax-1-dependence? The authors could use SDS selection of the daf-7 strain at 25C to ensure they are examining full dauers.

      The Bargmann lab has created a daf-2 FLP-OUT strain (ky1095ky1087) that allows cell-type-specific removal of daf-2. Could this be used to test for a cell-autonomous role of daf-2 in IL2Q related to branch elimination?

      These ideas are not a list of specific experiments the authors need to complete, rather they are meant to illustrate some possible approaches to the question. Whatever approach they use, it is important for them to more rigorously explain why SAX-1 is not required for branch removal in wild-type animals. 2. The SAX-2 localization (Fig. 4) and endocytosis assay (Fig. 6) results were not clear to me from the data shown. Overall a more rigorous analysis and presentation of the data would be important to make these conclusions convincing. This may involve refining the data presentation in the figures, modifying the claims (e.g., "we propose" vs "we find"), or saving some of the data to be more fully explored in a future paper. In my view, these figures are the biggest weak point of the manuscript and also are not important for the central conclusions (which are well supported and convincing), indeed these results are barely mentioned in the Abstract or last paragraph of Introduction.

      • In Fig. 4, where in the head are we looking? It would help to show a more low-magnification view of the entire cell.
      • In Fig. 4D, why is SAX-2 visible throughout the entire neuron and why is the "punctum" marked with an arrow also seen in the tagRFP channel? One gets the impression that some of the puncta may be background, bleed-through, or artifacts due to cell varicosities.
      • In Fig. 4C, the distinction between puncta in the primary or higher-order dendrites is not clear to me, and several puncta that I would have scored as primary are marked as higher-order.
      • Related to this, in Fig. 4B are the two arrows meant to be white as in the top panel, or yellow as in the bottom panel?
      • The main sax-1 phenotype is increased SAX-2 puncta in dauer, but the branch retraction defect is in post-dauers. How is this relevant to the phenotype?
      • The number of SAX-2 puncta in sax-1 mutants decreases almost to normal in post dauers. Is there a correlation between the number of remaining branches and the number of SAX-2 puncta? That is, do the many wild-type animals with "excess" SAX-2 puncta also fail to retract branches?
      • The control post-dauer data in Fig. 4F and 4H are identical (re-used data) but the corresponding control dauer data in Fig. 4F and 4G are different. What is going on here?
      • Why are sample sizes so small for both strains in Fig. 4G compared to Fig. 4F and 4H?
      • In Fig. 6C, why are the tagRFP (blue) puncta larger than the neurite? Aren't these meant to represent vesicles inside the surrounding neurite? One gets the impression that this is bleed-through from the GFP channel.
      • In Fig. 6E and 6F, why are there no tagRFP (blue) puncta? Is CD8 not endocytosed at all if it lacks the nanobody sequence? One would expect the tagRFP (blue) signal to be the same in both strains and simply to lack yellow if the nanobody is not present.
      • In Fig. 6E and 6H, why are there so many GFP (yellow) puncta outside the neuron? What are these structures and why are they absent in the strain with the nanobody?
      • What is the large central blue structure in Fig. 6H - is this the soma? - and why are puncta in this region not counted?
      • The authors report a decrease in endocytic events in sax-1, but qualitatively it looks like there are vastly more puncta inside the neuron in Fig. 6H than in 6G.
      • minor: there is text reading "40-" in the bottom panel of Fig. 6H. It is visible when printed but not on screen - adjust levels in Photoshop to reveal it.
      • Related to both Fig. 4 and Fig. 6, where does SAX-1 localize in IL2Q in dauer and post-dauer? Does its expression or localization change during branch retraction? Does it co-localize with SAX-2 or endocytic puncta?

      Minor points:

      1. At several points the authors emphasize the relationship of neurite remodeling to stress, e.g. Abstract and Discussion: "we adapted C. elegans IL2 sensory dendrites as a model [of...] stress-mediated dendrite pruning". It seems unnecessary and potentially misleading to treat this as a neuronal stress response. First, it conflates organismal and cellular stress - there is no reason to think that IL2 neurons are under cellular stress in dauer. In fact parasitic nematodes go through dauer-like stages as part of healthy development and probably have similar remodeling of IL2. Second, dendrite pruning occurs during dauer exit, which is the opposite of a stress response - it reflects a return to favorable conditions.
      2. In Fig. 1A, C. elegans is shown going directly from L1 to dauer in response to unfavorable conditions, which is incorrect. Animals proceed through L2 (in many cases actually an alternative L2d pre-dauer) and then molt into dauer (an alternative L3 stage) after completing L2.
      3. In Fig. 1B, please check if it is correct that hypodermis contacts the pharynx basement membrane as drawn. The schematic in the top panel makes it look like there is a single secondary branch and the quartenary branches are similar in length to the primary dendrite. The schematic in the bottom panel makes it look like the entire neuron is a small fraction of the length of the pharynx. Could these be drawn closer to scale?

      Referee cross-commenting

      I think we all touched on similar points. I wanted to follow up on Reviewer 3's comment, "Is the failure to eliminate branches an indication of incomplete dauer recovery? Do sax-1 mutants retain additional characteristics of dauer morphology in post dauer adults." I thought this was an excellent point. It made me wonder if that might explain why the defect is only seen in daf-7 and daf-2 mutant backgrounds - maybe these strains retain partial dauer traits even after exit. Is there a specific experiment that they could do? Did you have specific characteristics of dauer morphology in mind for them to check? (Ideally something in the nervous system that can be scored quantitatively.)

      Significance

      A major strength of this work is the pioneering use of a novel system to study neuronal branch retraction. C. elegans has provided a powerful model for studying how dendrite branches form, but much less attention has been paid to how excess neuronal branches are removed. The post-dauer remodeling of IL2Q neurons provides an exciting and dramatic physiological example to explore this question.

      This paper is notable for taking the first steps towards developing this innovative model. It does exactly what is needed at the outset of a new exploration - a forward genetic screen to discover the main regulators of the process. Using a combination of classical and modern genetic approaches, the authors bootstrap their way to a sizeable list of factors and a solid understanding of the properties of this system, for example that retraction of higher vs lower order dendrites show different genetic requirements.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      I am not currently convinced by the principal interpretations and think that other explanations based on known phenomena could account for key results. Specifically the authors have not resolved whether oxidative modification to 5mC and 3mC, or chemical attack to ssDNA that is transiently exposed in the repair processing of 5mC and 3mC is the principal source of the observed genotoxicity.

      (1) Original query which still stands: As noted in the manuscript, AlkB repairs alkylation damage by direct reversal (DNA strands are not cut). In the absence of AlkB, repair of alklylation damage/modification is likely through BER or other processes involving strand excision and resulting in single stranded DNA. It has previously been shown that 3mC modification from MMS exposure is highly specific to single stranded DNA (PMID:20663718) occurring at ~20,000 times the rate as double stranded DNA. Consequently the introduction of DNMTs is expected to introduce many methylation adducts genome-wide that will generate single stranded DNA tracts when repaired in an AlkB deficient background (but not in an AlkB WT background), which are then hyper-susceptible to attack by MMS. Such ssDNA tracts are also vulnerable to generating double strand breaks, especially when they contain DNA polymerase stalling adducts such as 3mC. The generation of ssDNA during repair is similarly expected follow the H2O2 or TET based conversion of 5mC to 5hmC or 5fC neither of which can be directly repaired and depend on single strand excision for their removal. The potential importance of ssDNA generation in the experiments has not been [adequately] considered.

      We thank the reviewer for expanding on their previous comment.  We completely agree with the possibility that they raise and have added an extra paragraph in the discussion to expand on our consideration of the role of ssDNA in DNMT-induced DNA damage, which we reproduce here:

      "The observation that TET overexpression sensitizes cells expressing DNMTs to oxidative stress strongly suggests that the site of DNA damage is the modified cytosine itself.  However, we do not currently have definitive evidence supporting this.  As mentioned in the results section, the presence of unrepaired 3mC may lead to increased levels of ssDNA; it is also possible that 5mC itself may increase ssDNA levels.  Loss of alkB would be expected to increase the amount of ssDNA.  Thus DNA damage surrounding modification sites, but not specifically localised to it, might be the cause of the increased sensitivity.  These two different models make different predictions.  If modified cytosines are the source of the damage, mutations arising would be predominantly located at CG dinucleotides.  Alternatively, ssDNA exposure would result in distributed mutations that would not necessarily be located at CG sites.  The highly biased spectrum of mutations that can be screened through the Rif resistance assay does not allow us to address this currently.  However, future experiments to create mutation accumulation lines could allow us to address the question systematically on a genome-wide level. "

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      We thank the reviewers for their positive comments. Our manuscript is to our knowledge the first to investigate the role of VAIL (V-ATPase—ATG16L1 induced LC3 lipidation), a form of CASM (Conjugation of ATG8s to single membranes) in SARS-CoV-2 replication. We demonstrate that SARS-CoV-2 Envelope (E) induces VAIL and this contributes to viral replication, including by using a reverse genetics system to make an E mutant virus. There have been many high quality studies examining the role of canonical autophagy in SARS-CoV-2 replication and our manuscript does not argue that all or even most LC3 lipidation during infection is via VAIL. We will try to make this point more clearly in the text. We do not think this detracts from the novelty and importance of our manuscript.

      *Reviewer #1 (Evidence, reproducibility and clarity (Required)): *

      • Figueras-Novoa et al present a short report demonstrating the induction of LC3 lipidation on single membranes by SARS-CoV-2 through a noncanonical autophagy pathway referred to as VAIL. The authors utilize elegant genetic tools to show that the induction of LC3 lipidation upon viral infection is mainly due to VAIL rather than canonical autophagy. They demonstrate that the activity of the viral E protein that can cause neutralization of acidic vesicles leads to the activation of non-canonical LC3 lipidation on single membranes. Interestingly, the authors also conclude that the impairment of VAIL leads to a reduction of viral load as a result of a defect in later stages of viral infection, although the underlying mechanism was not further explored. *

      • Overall, this is an elegant and well controlled study that provides a clear conclusion. I only have some minor comments.*

      We thank the reviewer for their assessment of our manuscript.

      In some experiments, LC3 lipidation does not appear to be fully disrupted upon VAIL inhibition (e.g. Fig.'s 1H, 3D, 4A). As other labs have shown that SARS-CoV2 blocks autophagic flux, this could be further clarified in this manuscript as both VAIL and autophagy may be co-induced upon viral infection.

      We agree with the reviewer that there is a contribution of canonical macroautophagy to the LC3B lipidation observed in SARS-CoV-2. We will extend the discussion in the manuscript to clarify this point for the readers.

      Can the authors test the induction of LC3 lipidation in cells expressing K490 mutant of ATG16L1 in ATG16L1 KO cells to compare them with ATG16L1-ATG13 double knockouts?

      The western blot in figure 3F (quantified in Figure 3G) shows LC3B lipidation in response to E expression in ATG16L1-ATG13 double knock out cells reconstituted with wild type ATG16L1 but not in cells complimented with ATG16L1 K490A mutant. We agree that the referee’s suggestion to perform these experiments in the context of infection would be informative. However in spite of numerous attempts, we have so far been unable to generate a cell clone fully devoid of ATG16L1 in a cell line that can be productively infected with SARS-CoV-2. For reasons unclear to us there appears to be a very low level of residual ATG16L1 activity despite multiple different CRISPR/Cas9 targeting attempts. The suggested complementation experiments might still be informative in the context of low level ATG16L1 expression so we will pursue this. Alternatively, as a contingency we can try to produce SARS-CoV-2 infectable cells with mutations in ATG16L1’s binding partner V1H, this interaction is required for VAIL. A further contingency could be to assess LC3B lipidation during infection and treatment with a Vps34 inhibitor, which inhibits canonical autophagy.

      Minor points: * * The difference between Fig. 1F&G is unclear and why the authors are including both analyses. Similarly figures 4G&H.

      We included both metrics to show that the decrease in LC3B lipidation in cells expressing SopF during infection is robust and observed in two separate readouts. While spot area measures the area of infected cells covered by GFP-LC3B fluorescence, spot intensity is a reading of the intensity of the area defined in an infected cell as being LC3 positive. Theoretically, these measurements could change in different ways. For example, if the same amount of lipidated LC3 were to distribute over a larger area of the cell. We prefer to keep both measurements in the manuscript.

      The authors should show boxed colocalisation of all images, including negative controls. For examples, the authors have shown boxed magnifications in only the lowest panel in Figure 2A but not the upper two panels. Figures 4E&F should include boxed examples. This serves to clarify both positive and negative colocalisation events.

      Boxed magnifications will be added to all images.

      • Reviewer #1 (Significance (Required)): *

      • Overall an elegant and well controlled study demonstrating the induction of non-canonical LC3 conjugation on single membranes (VAIL) during SARS-CoV2 infection. A further exploration of canonical autophagy (as previously published by others) in addition to VAIL would enhance this study.*

      As the reviewer noted, several excellent studies have explored canonical autophagy during SARS-CoV-2 infection, many of which we cite in our manuscript. Our focus, however, is to demonstrate that SARS-CoV-2 E induces LC3 lipidation via VAIL. We believe that exploring the diverse roles of canonical autophagy mechanisms in SARS-CoV-2 infection is beyond the scope of this study.

      *This study is of interest to researchers studying autophagy, viruses, immunology, single membrane LC3 lipidation, and lysosomes as well as potentially clinicians treating SARS-CoV2 infecteted individuals. *

      • This reviewer is experienced in autophagy research.*

      We thank the reviewer for this assessment of our manuscript.

      *Reviewer #2 (Evidence, reproducibility and clarity (Required)): *

      • Major Comments *

      • Figure 1D does not very clearly show an overlap between V1D and LC3B. Both proteins seem broadly present across the cell and there is no easily identifiable change in V1D distribution upon infection. As such the overlay may be purely stochastic. The authors should quantify the observed co-localization events across multiple cells and biological replicates and compare them to other protein(s) with a similar cellular distribution pattern.*

      We agree there is no obvious change in V1D staining on infection. The images in Figure 1D are purely intended to illustrate that LC3 and the V-ATPase can colocalise, not to demonstrate a change in V-ATPase distribution or to suggest a direct interaction. We will make this point more clearly in the text. We will also carry out analyses of the kind (see also response to the first two Minor Comments). We would be happy to provide an alternative method of visualising the V-ATPase (we could use any suitable antibody to the V-ATPase, or the bacterial effector SidK) if required. In response to reviewer 3’s comments, we will carry out a pull-down experiment to test the association of the V-ATPase and ATG16L1 during E expression, as this is a key interaction during VAIL activation.

      Based on Figure 2F the authors suggest that virus entry is unaffected by the inhibition of VAIL in early timepoints. However, according to the figure legend, the timepoint used is 7hpi, while 2D uses 24hpi. Some SARS-CoV-2 papers suggest 7-10 hours is sufficient time to release new virions (Ban-On et al., 2020). As such 7hpi can not necessarily be seen as an early time point. Did the authors test earlier ones? Also, based on this, would it be possible that the effects observed at 24hpi are actually secondary infections, meaning that the virus utilizes pathway components for virion production and a lack thereof reduces infectivity of newly formed virions? In this case it would be interesting to set up an assay that can distinguish between primary and secondary infection to study both individually more closely.

      Whereas 7 hours may be sufficient to release new virions, it is not sufficient to establish infections in other cells – this is why we chose that time point. The observation that there is no difference in the percentage of infected cells at 7 h p.i. (figure 2F) led us to suggest that viral entry is unaffected . We then confirmed this through the pseudovirus assay in Figure 2G, where no difference is found between SopF and mCherry expressing cells. For this assay, GFP-expressing, replication incompetent, lentiviral particles pseudotyped with Spike from different SARS-CoV-2 lineages were used to transduce mCherry and SopF expressing cells. A change in the percentage of GFP-positive cells would indicate an effect on viral entry, but no such change was observed in SopF-expressing cells.

      We agree with the reviewer that the effects observed at 24 hpi are likely due to a defect in subsequent rounds of infection, since no difference was observed at 7 hpi or with our pseudovirus assay. We will attempt to make this point in the text as clearly as possible.

      The authors nicely show in their study an involvement of VAIL in SARS-CoV-2 mediated LC3 lipidation. However, the observed effects are relatively moderate in several experiments, indicating that there may be another contributor to the observed phenotype. It would be nice to highlight this in the discussion and debate potential mechanisms that are causing the observed effects during infection.

      We agree with the reviewer’s analysis. We have discussed the contribution of canonical autophagy in the second paragraph of the discussion, but we will expand on this in a revised manuscript. E expression levels are moderate during infection, other structural proteins such as N and M are present in much higher amounts. Since E is the key protein in VAIL initiation, a moderate effect of VAIL inhibition in perhaps expected. Nonetheless this still plays a crucial role in the viral life cycle.

      *Minor Comments *

      • The re-localization events shown in Fig 3A should be quantified.*

      This quantification of GFP-LC3 relocalisation will be carried out and included.

      • The co-localization events displayed in Fig 4A should be quantified.*

      The quantification of V1D, E and GFP-LC3 will be carried out and included.

      For Figure 2H-K the authors perform KDs of ATG16L1 and ATG13. While the results for the two specific proteins are certainly convincing, the authors would strengthen their argument by testing additional proteins in the autophagy pathway to support their claim that VAIL but not autophagy affects protein abundance of N (OPTIONAL).

      As discussed in response to reviewer 1, we will attempt to infect ATG16L1 KO cells reconstituted with a K490A ATG16L1 mutant, which is an established tool and has been validated to be deficient in VAIL but not canonical autophagy.

      ***Referee cross-commenting** *

      • Overall I agree with the comments of my co-reviewers and I think the suggested experiments/comments are sensible. *
      • I in part already eluted to it my analysis, but I tend to agree with reviewer 3 on the limited effect VAIL seems to have on LC3b lipidation.*

      As outlined above in response to reviewer 1 and below to reviewer 3, we agree that there is a modest contribution of VAIL to overall LC3 lipidation, which correlates with a modest amount of E expression in SARS-CoV-2 infection. VAIL is clearly important for the viral life cycle, thus whatever the proportion of LC3 lipidation attributable to this pathway it must be biologically significant.

      *Reviewer #2 (Significance (Required)): *

      • While previous publications have shown interaction between SARS-CoV2 and autophagy, the authors of this manuscript demonstrate that V-ATPase-ATG16L1 induced LC3 lipidation (VAIL) is activated during infection and affects viral replication. *

      • This study provides an interesting new aspect to host-SARS_CoV-2 interactions. *

      • The manuscript is of interest for people studying virus-host cell interaction, as well as for researchers in the fields of infectious diseases, specifically SARS-CoV2, and autophagy/VAIL*.

      We thank the reviewer for their assessment of our manuscript.

      R*eviewer #3 (Evidence, reproducibility and clarity (Required)): *

      • The interaction of SARS-CoV-2 with canonical autophagy has been well documented. However, whether SARS-CoV-2 infection induces and benefits from non-canonical autophagy is unclear. In this manuscript, the authors demonstrated that SARS-CoV-2 infection induces V-ATPase-ATG16L1-induced LC3 lipidation (VAIL), a form of non-canonical autophagy in which LC3 is conjugated to single membranes. The SARS-CoV-2 envelope protein, through its ion channel activity, triggers the V-ATPase proton pump and induces VAIL during SARS-CoV-2 infection. Inhibiting VAIL during SARS-CoV-2 infection with SopF, a Salmonella effector, attenuates SARS-CoV-2 egress. *

      • While these findings are interesting and demonstrate that SARS-CoV-2 infection triggers VAIL for its own benefit, the mechanism by which VAIL promotes SARS-CoV-2 replication remains unclear. Moreover, the contribution of VAIL to LC3 lipidation during SARS-CoV-2 infection appears to be minimal, as blocking VAIL through SoPF expression only marginally reduced LC3B lipidation (Fig. 1H). Therefore, the contribution of VAIL to LC3 lipidation during SARS-CoV-2 infection is minimal.*

      We thank the reviewer for their assessment of our manuscript. As we have already alluded to in our response, we agree that only part of the LC3 lipidation observed during infection can be attributed to VAIL. There is a reproducible effect on viral replication which we have demonstrated in multiple ways, therefore the contribution of VAIL is of biological importance.

      *Comments: *

      • The authors show that the ion channel activity of E is essential for VAIL induction during SARS-CoV-2 infection. Since V-ATPase recruits the ATG16L complex to induce VAIL, and to clarify how SARS-CoV-2 infection triggers VAIL, the authors should examine whether SARS-CoV-2 infection or the expression of E induces V-ATPase-ATG16L interaction and whether this interaction is disrupted when SopF is expressed.*

      We agree with the reviewer that this would be an informative experiment. We can carry out this experiment in an E expression system, rather than infection. This is due to the difficulty of getting enough material to carry out this kind of pull-down experiment in infected cells (at the time of writing these experiments still have to be carried out in CL3).

      • Since the authors suggest that expression of SopF attenuates viral exit, one would expect that the number of N-positive cells will increase in SopF-expressing cells compared to the mCherry control cells. However, as shown in Figure 2D, this is not the case. Could the authors discuss why N-positive cells will be reduced in SopF-expressing cells when viral egress is impeded in these cells*?

      This is a reflection of multi-cycle kinetics. N is still very strongly expressed in infected cells, even after virions have egressed. SARS-CoV-2 can infect VAIL-deficient cells and expresses the same levels of N prior to subsequent rounds of infection (at 7 hours after infection for example). Egress in VAIL-deficient, SopF-expressing cells is defective. Therefore, fewer cells will be infected in subsequent rounds of infection in SopF expressing cells, resulting in fewer N-positive cells in the SopF expressing cell population (most obvious after 24 hours).

      Figure 2H. The authors show that knockdown of ATG16L1 reduces the expression of N during SARS-CoV-2 infection compared to the controls. To confirm that knockdown of ATG16L1, which is required for both canonical autophagy and VAIL, reduces N staining via VAIL, the authors should examine the impact of SopF expression on N levels in ATG16L KD cells. This experiment will confirm if the reduction in N staining in ATG16L1 KD cells is due to VAIL.

      As stated in the response to reviewer 1, we can attempt this experiment in an ATG16L1 KO system complemented with K490A ATG16L1, which is deficient in VAIL and not canonical autophagy.

      • Figure 2J. The quality of the Western blot data is poor.*

      In this western the exposure is deliberately turned up to show that minimal ATG13 was left after knock down. We will also show the full blot with less exposure – this will demonstrate high quality.

      Also, N appears as a single band in Figure 2J, but appears as double bands in Figures 2A and H. Could the authors explain this?

      An extra band can be seen in 2J for N. However, as the reviewer points out, the intensity of the lower band is fainter than in 2A or 2H. The biology of SARS-CoV-2 N is interesting and complicated, with different truncated isoforms and phosphorylation patterns observed (see for example Mears et al., 2025 PMID:39836705). We observed changes in abundance of the second band between experiments, but this did not obviously depend on VAIL. We therefore consider this to be beyond the scope of this investigation.

      *Reviewer #3 (Significance (Required)): *

      • This manuscript proposes a role for VAIL in LC3 lipidation during SARS-CoV-2 infection. While the findings are interesting, VAIL only marginally contributes to LC3 lipidation during SARS-CoV-2 infection. Therefore, the significance of VAIL to LC3B lipidation during SARS-CoV-2 infection is unclear.*

      Our experiments show unambiguously that VAIL contributes to viral replication. Therefore even if As alluded to above, we do not think a further investigation of canonical macroautophagy and SARS-CoV-2 would enhance the quality of our manuscript. We will try to make our description of the contribution of macroautophagy clearer in the revised manuscript (without providing a full literature review). We also do not think that exploring the nature of the multiple N bands on western blot is within the scope of this paper.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      (1) The authors demonstrate that female Spodoptera littoralis moths prefer to oviposit on wellwatered tomato plants and avoid drought-stressed plants. The study then recorded the sounds produced by drought-stressed plants and found that they produce 30 ultrasonic clicks per minute. Thereafter, the authors tested the response of female S. littoralis moths to clicks with a frequency of 60 clicks per minute in an arena with and without plants and in an arena setting with two healthy plants of which one was associated with 60 clicks per minute. These experiments revealed that in the absence of a plant, the moths preferred to lay eggs on the side of the area in which the clicks could be heard, while in the presence of a plant the S. littoralis females preferred to oviposit on the plant where the clicks were not audible. In addition, the authors also tested the response of S. littoralis females in which the tympanic membrane had been pierced making the moths unable to detect the click sounds. As hypothesised, these females placed their eggs equally on both sites of the area.

      Finally, the authors explored whether the female oviposition choice might be influenced by the courtship calls of S. littoralis males which emit clicks in a range similar to a drought-stressed tomato plant. However, no effect was found of the clicks from ten males on the oviposition behaviour of the female moths, indicating that the females can distinguish between the two types of clicks. Besides these different experiments, the authors also investigated the distribution of egg clusters within a longer arena without a plant, but with a sugar-water feeder. Here it was found that the egg clusters were mostly aggregated around the feeder and the speaker producing 60 clicks per minute. Lastly, video tracking was used to observe the behaviour of the area without a plant, which demonstrated

      that the moths gradually spent more time at the arena side with the click sounds.

      We thank the reviewers for their helpful comments. We agree with the summary, but would like to note that in the control experiment (Figure 2) we used a click rate of 30 clicks per minute—a design choice driven by the editor’s feedback. We have clarified this and, to further probe the system’s dynamics, added a second experiment employing the same click rate (30 clicks per minute) with a dehydrated plant (see details below). In both experiments, females again showed a clear tendency to oviposit nearer the speaker; these findings are described in the updated manuscript.

      (2) The study addresses a very interesting question by asking whether female moths incorporate plant acoustic signals into their oviposition choice, unfortunately, I find it very difficult to judge how big the influence of the sound on the female choice really is as the manuscript does not provide any graphs showing the real numbers of eggs laid on the different plants, but instead only provides graphs with the Bayesian model fittings for each of the experiments. In addition, the numbers given in the text seem to be relatively similar with large variations e.g. Figure 1B3: 1.8 {plus minus} 1.6 vs. 1.1 {plus minus} 1.0. Furthermore, the authors do not provide access to any of the raw data or scripts of this study, which also makes it difficult to assess the potential impact of this study. Hence, I would very much like to encourage the authors to provide figures showing the measured values as boxplots including the individual data points, especially in Figure 1, and to provide access to all the raw data underlying the figures.

      We acknowledge that there are researchers who favor Bayesian graphical representation versus raw data visualization. Therefore, we have added chartplots of the raw data from Figure 1 in the supplementary section. We are aware of the duplication in presentation and apologize for this redundancy.  

      Regarding the variance and means we obtained in our experiment, we have analyzed all raw data using the statistical model presented, and if statistical significance was found despite a particular mean difference or variance, this is meaningful from a biological perspective. One can certainly discuss whether this difference has biological importance, but it should be remembered that in this experimental system, we are trying to isolate the acoustic signal from a complex system that includes multiple signals. Therefore, at no point we’ve suggested that this is a standalone factor, but rather proposed it as an informative and significant component. 

      In addition to the experiments described above, we conducted an experiment in which we counted both eggs and clusters. The results indicate that cluster counts are a reliable proxy for reproductive investment at a given location. In this experiment, we present cluster numbers alongside egg counts (Figure 2).

      Furthermore, we apologize for the technical error that prevented our uploaded data files from reaching the reviewers. We have also uploaded updated data and code.

      (3) Regarding the analysis of the results, I am also not entirely convinced that each night can be taken as an independent egg-laying event, as the amount of eggs and the place were the eggs are laid by a female moth surely depends on the previous oviposition events. While I must admit that I am not a statistician, I would suggest, from a biological point of view, that each group of moths should be treated as a replicate and not each night. I would therefore also suggest to rather analyse the sum of eggs laid over the different consecutive nights than taking the eggs laid in each night as an independent data point.

      We thank the reviewer for this question. This is a valid and point that we will address in three aspects: 

      First, regarding our statistical approach, we used a model that takes into account the sequence of nights and examines whether there is an effect of the order of nights, i.e., we used GLMMs, with the night nested within the repetition. This is equivalent to addressing this as a repeated measure and is, to our best knowledge, the common way to treat such data. 

      Second, following the reviewer's comment, we also reran the statistics of the third experiment (i.e., “sound gradient experiments”, Figure 2 and Supplementary figure 4) when only taking the first night when the female/s laid eggs to avoid the concern of dependency. This analysis revealed the same result – i.e., a significant preference for the sound stimulus. We have now updated our methods and results section to clarify this point.  

      Third, an important detail that may not have been clearly specified in the methods: at the end of each night, we cleaned the arena of counted egg clusters using a cloth with ethanol, so that on the subsequent night, we would not expect there to be evidence of previous oviposition but thus would not exclude some sort of physiological or cognitive memories. We have now updated our methods section to clarify this important procedural point. 

      (4) Furthermore, it did not become entirely clear to me why a click frequency of 60 clicks per minute was used for most experiments, while the plants only produce clicks at a range of 30 clicks per minute. Independent of the ecological relevance of these sound signals, it would be nice if the authors could provide a reason for using this frequency range. Besides this, I was also wondering about the argument that groups of plants might still produce clicks in the range of 60 clicks per minute and that the authors' tests might therefore still be reasonable. I would agree with this, but only in the case that a group of plants with these sounds would be tested. Offering the choice between two single plants while providing the sound from a group of plants is in my view not the most ecologically reasonable choice. It would be great if the authors could modify the argument in the discussion section accordingly and further explore the relevance of different frequencies and dBlevels.

      This is an excellent point. We originally increased the click rate generate a strong signal. However, it was important for us to verify that there was ecological relevance in the stimulus we implemented in the system. For this purpose, we recorded a group of dehydrated plants at a distance of ~20cm and we measured a click rate of 20 clicks per minute (i.e., 0.33 Hz) (see Methods section). Therefore, as mentioned at the beginning of this letter, in the additional experiment described in Figure 2, we reduced the click frequency to 30 clicks per minute, and at this lower rate, the effect was maintained. Increasing plant density would probably lead to a higher rate of 30 clicks per minute. 

      (5) Finally, I was wondering how transferable the findings are towards insects and Lepidopterans in general. Not all insects possess a tympanic organ and might therefore not be able to detect the plant clicks that were recorded. Moreover, I would imagine that generalist herbivorous like Spodoptera might be more inclined to use these clicks than specialists, which very much rely on certain chemical cues to find their host plants. It would be great if the authors would point more to the fact that your study only investigated a single moth species and that the results might therefore only hold true for S. littoralis and closely related species, but not necessary for other moth species such as Sphingidae or even butterflies.

      Good point. Our research uses a specific model system of one moth species and one plant species in a particular plant-insect interaction where females select host plants for their offspring. As with any model-based research that attempts to draw broader conclusions, we've taken care to distinguish between our direct findings and potential wider implications. We believe our system may represent mechanisms relevant to a wider group of herbivorous insects with hearing capabilities, particularly considering that several moth families and other insect orders can detect ultrasound. However, additional research examining more moth and plant species is necessary to determine how broadly applicable these findings are. We have made these clarifications in the text.

      Reviewer #2 (Public review):

      (6) The results are intriguing, and I think the experiments are very well designed. However, if female moths use the sounds emitted by dehydrated plants as cues to decide where to oviposit, the hypothesis would predict that they would avoid such sounds. The discussion mentions the possibility of a multi-modal moth decision-making process to explain these contradictory results, and I also believe this is a strong possibility. However, since this remains speculative, careful consideration is needed regarding how to interpret the findings based solely on the direct results presented in the results section.  

      Thank you for this insightful observation. We agree that the apparent attraction of females to dehydrated-plant sounds contradicts our initial prediction. Having observed this pattern consistently across multiple setups, we have now added a targeted choice experiment to the revised manuscript: here female moths were offered a choice between dehydrated plants broadcasting their natural ultrasonic emissions and a control. These results—detailed in the Discussion and presented in full in the Supplementary Materials (Supplementary Figure 4)—show that when only a dehydrated plant is available, moths would prefer it for oviposition, supporting our hypothesis that in the absence of a real plant, the plant’s sounds might represent a plant..

      (7) Additionally, the final results describing differences in olfactory responses to drying and hydrated plants are included, but the corresponding figures are placed in the supplementary materials. Given this, I would suggest reconsidering how to best present the hypotheses and clarify the overarching message of the results. This might involve reordering the results or re-evaluating which data should appear in the main text versus the supplementary materials

      Thank you for this suggestion. We have reorganized the manuscript and removed the olfactory response data from the current version to maintain a focused narrative on acoustic cues. We agree that a detailed investigation of multimodal interactions deserves a separate study, which we plan to pursue in future work. 

      (8) There were also areas where more detailed explanations of the experimental methods would be beneficial.

      Thank you for highlighting this point. We have expanded and clarified the Methods section to provide comprehensive detail on our experimental procedures.

      Reviewer #1 (Recommendations for the authors):

      (9) Line 1: Please include the name of the species you tested also in the title as your results might not hold true for all moth species.

      We do not fully agree with this comment. Please see comment 5.

      (10) Line 19-20: Please rephrase the sentence so that it becomes clear that the "dehydration stress" refers to the plant and not to the moths.

      Thank you for the suggestion; we have clarified the text accordingly

      (11) Line 31: Male moths might provide many different signals to the females, maybe better "male sound signals" or similar.

      Thank you for the suggestion; we have clarified the text accordingly.

      (12) Line 52-53: Maybe mention here that not all moth species have evolved these abilities.

      Thank you for the suggestion; we have clarified the text accordingly.

      (13) Line 77: add a space after 38.

      Thank you for the suggestion; we have clarified the text accordingly.

      (14) Line 88: Maybe change "secondary predators" to "natural enemies".

      Thank you for the suggestion; we have clarified the text accordingly.

      (15) Line 134: Why is "notably" in italics? I would suggest using normal spelling/formatting rules here.

      Thank you for the suggestion; we have clarified the text accordingly.

      (16) Line 140-144: If you did perform the experiment also with the more ecological relevant playback rate, why not present these findings as your main results and use the data with the higher playback frequency as additional support?

      Thank you for this suggestion. We agree that the ecologically relevant playback data are important; as described in detail at the beginning of this letter and also in comment 4, however, to preserve a clear and cohesive narrative, we have maintained the original ordering of this section. Nevertheless, the various experiments conducted in Figure 1 differ in several components from Figure 2 and the work that examined sounds in plant groups in the appendices. Therefore, we find it more appropriate to use them as supporting evidence for the main findings rather than creating a comparison between different experimental systems. For this reason, we chose to keep them as a separate description in "The ecological playback findings (Lines 140–144) remain fully described in the Results and serve to reinforce the main observations without interrupting the manuscript's flow.

      (17) Line 146: Please explain already here how you deafened the moths.

      Thank you for the suggestion; we have clarified the text accordingly.

      (18) Line 181: should it be "male moths' " ?

      Thank you for the suggestion; we have clarified the text accordingly.

      (19) Line 215: Why is "without a plant" in italics? I would suggest using normal spelling/formatting rules here.

      Thank you for the suggestion; we have clarified the text accordingly.

      (20) Line 234: I do not understand why this type of statistic was used to analyse the electroantennogram (EAG) results. Would a rather simple Student's t-test or a Wilcon rank sum test not have been sufficient? I would also like to caution you not to overinterpret the data derived from the EAG, as you combined the entire headspace into one mixture it is no longer possible to derive information on the different volatiles in the blends. The differences you observe might therefore mostly be due to the amount of emitted volatiles.

      We have reorganized the manuscript and removed the olfactory response data from the current version to maintain a focused narrative on acoustic cues (See comment 7). 

      (21) Line 268: It might be nice to add an additional reference here referring to the multimodal oviposition behaviour of the moths.

      Thank you for the suggestion; we have clarified the text accordingly.

      (22) Line 284: If possible, please add another reference here referring to the different cues used by moths during oviposition.

      Thank you for the suggestion; we have clarified the text accordingly.

      (23) Line 336: What do you mean by "closed together"?

      Thank you for the suggestion; we have clarified the text accordingly.

      (24) Line 434-436: Please see my overall comments. I do not think that you can call it ecologically relevant if the signal emitted by multiple plants is played in the context of just a single plant.

      Please see comments 1 and 4.

      (25) Line 496: Please change "stats" to statistics.

      Thank you for the suggestion; we have clarified the text accordingly.

      (26) Line 522-524: I am not sure whether simply listing their names does give full credit to the work these people did for your study. Maybe also explain how they contributed to your work.

      Thank you for the suggestion; we have clarified the text accordingly.

      Reviewer #2 (Recommendations for the authors):

      (27) L54 20-60kHz --> 20Hz-60kHz or 20kHz - 60kHz?

      OK. We have replaced it.

      (28) L124 Are the results for the condition where nothing was placed and the condition where a decoy silent resistor was placed combined in the analysis? If so, were there no significant differences between the two conditions? Comparing these with a condition presenting band-limited noise in the same frequency range as the drought-stressed sounds might also have been an effective approach to further isolate the specific role of the ultrasonic emissions.

      We have used both conditions due to technical constrains and pooled them tougher for analysis— statistical tests confirmed no significant differences between them—and this clarification has now been added to the Methods section including the results of the statistical test.

      (29) L125 (Fig. 1A), see Exp. 1 in the Methods). -> (Fig.1B. See Exp.1 in the Methods).

      Thank you for the suggestion; we have clarified the text accordingly.

      (30) L132 "The opposite choice to what was seen in the initial experiment (Fig.1B)"

      Thank you for the suggestion; we have clarified the text accordingly.

      (31) L137-143 If you are writing about results, why not describe them with figures and statistics? The current description reads like a discussion.

      These findings were not among our primary research questions; however, we believe that including them in the Results section underscores the experimental differences. In our opinion, introducing an additional figure or expanding the statistical analysis at this point would disrupt the narrative flow and risk confusing the reader.

      (32) L141 "This is higher than the rate reported for a single young plant" Are you referring to the tomato plants used in the experiments? It might be helpful to include in the main text the natural click rate emitted by tomato plants, as this information is currently only mentioned in the Methods section.

      See comment 4.  

      (33) L191 Is the main point here to convey that the plant playback effect remained significant even when the sound presentation frequency was reduced to 30 clicks per minute? The inclusion of the feeder element, however, seems to complicate the message. To simplify the results, moving the content from lines 185-202 to the supplementary materials might be a better approach. Additionally, what is the rationale for placing the sugar solution in the arena? Is it to maintain the moths' vitality during the experiment? Clarifying this in the methods section would help provide context for this experimental detail.

      In this series of experiments, we manipulated four variables—single moths, ultrasonic click rate, arena configuration (from a two-choice design to an elongated enclosure), and the response metric (total egg counts rather than cluster counts)—to evaluate moth oviposition under more ecologically realistic conditions. We demonstrate the system’s robustness and validity in a more realistic setting (by tracking individual moths, counting single eggs, etc.).  

      As noted in the text, feeders were included to preserve the moths’ natural behavior and vitality. We have further clarified this in the revised manuscript.

      (34) L215 Is the click presentation frequency 30 or 60 per minute? Since Figure 3 illustrates examples of moth movement from the experiment described in Figure 1, it might be more effective to present Figure 3 when discussing the results of Figure 1 or to include it in the supplementary materials for better clarity and organization.

      See comments 1 and 4. As mentioned in the above 

      (35) L291 Please provide a detailed explanation of the experiments and measurements for the results shown in Figure S3 (and Figure S2). If the multi-modal hypothesis discussed in the study is a key focus, it might be better to include these results in the main results section rather than in the supplementary materials.

      Thank you for this suggestion. Figure S2 was removed, see comments above. We’ve added now the context to figure S3.

      (36) L303 It might be helpful to include information about the relationship between the moth species used in this study and tomato plants somewhere in the text. This would provide an important context for understanding the ecological relevance of the experiments.

      Thank you for the suggestion; we have clarified the text accordingly.

      (37) Table 1 The significant figures in the numbers presented in the tables should be consistent.

      Thank you for the suggestion; we have clarified the text accordingly.

      (38) L341 The text mentions that experiments were conducted in a greenhouse, but does this mean the arena was placed inside the greenhouse? Also, the term "arena" is used - does this refer to a sealed rectangular case or something similar? For the sound presentation experiments, it seems that the arena cage was placed inside a soundproof room. If the arena is indeed a case-like structure, were there any specific measures taken to prevent sound scattering within the case, such as the choice of materials or structural modifications?

      Here, “arena” refers to the plastic boxes used throughout this study. In this particular experiment, we presented plants alone—reflecting ongoing debate in the literature—and used these trials as a baseline for our subsequent sound-presentation experiments, during which we measured sound intensity as described in the Methods section. All sound-playback experiments were conducted in sound-proof rooms, and acoustic levels were measured beforehand—sound on the control side fell below our system’s detection threshold. 

      (39) L373 "resister similar to the speaker" Could you explain it in more detail? I think this would depend on the type of speaker used-particularly whether it includes magnets. From an experimental perspective, presenting different sounds such as white noise from the speaker might have been a better control. Was there a specific reason for not doing so? Additionally, the study does not clearly demonstrate whether the electric and magnetic field environments on both sides of the arena were appropriately controlled. Without this information, it is difficult to evaluate whether using a resistor as a substitute was adequate.

      Thank you for this comment. We have now addressed this point in the Discussion. We acknowledge that we did not account for the magnetic field, which might have differed between the speaker and the resistor. We agree that using an alternative control, such as white noise, could have been informative, and we now mention this as a limitation in the revised Methods.

      (40) L435 60Hz? The representation of frequencies in the text is inconsistent, with some values expressed in Hz and others as "clicks per second." It would be better to standardize these units for clarity, such as using Hz throughout the manuscript.

      We agree that this is confusing. We reviewed the text and made sure that when we addressed click per second, we meant how many clicks were produced and when we addressed Hz units it was in the context of sound frequencies.  

      (41) L484 "we quantified how many times each individual crossed the center of the arena" Is this data being used in the results?

      Yes. Mentioned in the text just before Figure 3. L220

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      We appreciate the constructive and supportive feedback on our manuscript. All three reviewers acknowledged the significance and novelty of our work on bacterial telomere protection. In response to their suggestions, we have conducted the requested experiments and revised the manuscript accordingly. These changes have enhanced the rigor of our study and clarified our interpretations and explanations.

      Moreover, we characterized an additional truncation mutant of TelN (TelN Δ445–631), which lacks the two C-terminal domains. Despite this deletion, the mutant retained protection activity (Supplementary Figure S4B), indicating that the remaining regions of the protein are sufficient to confer efficient protection in this assay.

      Finally, we removed three sequence alignments (previously Supplementary Figures S6A and S7), as we recognized that the high degree of sequence divergence could hinder proper alignment and potentially lead to misinterpretation.

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      This study addresses how the bacterial telomere protein TelN protects telomere ends against the action of the Mre11-Rad50 nuclease (MR). This protection is essential for the stability of hairpin-ended linear plasmid and chromosomes in bacteria but had not been explored before. The authors demonstrate that TelN is necessary and sufficient to block MR-dependent DNA cleavage when bound to its specific telomere sequence. By combining elegant genetics and biochemical approaches, it convincingly shows that TelN-dependent inhibition likely involves a specific interaction between TelN and the MR complex. The manuscript is well written, easy to read and focused on the relevant information. The claims and the conclusions are supported by the data. There is no over-interpretation.

      Comments: - Figure 1B, unnormalized transformation efficiency would be useful to show in SI

      The unnormalized B. subtilis transformation efficiency has now been added as new figure panel S1B.

      • Figures 2B, 2C, 3C, 3D, 4C, 5A and 5B: quantification of independent experiments should be added

      While these DNA protection experiments show a clearly reproducible pattern of DNA degradation, the exact response to TelN titration varies somewhat between experimental replicates. We initially included the quantification of remaining full-length DNA because the corresponding band is hard to discern in the gel image due to pixel saturation. However, we realize now that this may mislead readers to think that the degradation occurs always with the exact same dosage response.

      To avoid this, we have decided to remove the quantification and instead show the relevant part of the gel also at higher contrast to better visualize the loss of full-length DNA due to DNA degradation. In addition, we have included replicate experiments carried out at the same MR concentration (125 nM M₂R₂) or at higher concentration (500 nM M₂R₂) in the supplementary material. These examples demonstrate the general reproducibility of the assay.

      **Referee cross-commenting**

      Perfect for me. It seems that there is a consensus.

      Reviewer #1 (Significance (Required)):

      This pioneering study provides a very strong basis for a new understanding of telomeres in bacteria and offers fascinating evolutionary perspectives when compared to similar mechanisms active at telomeres in eukaryotic cells.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      The paper is well-presented and well-written throughout. The paper shows convincingly that TelN protects hairpin DNA ends from the activity of SbcCD, presumably providing a protection mechanism for N15 phage DNA in vivo. Furthermore, this protection activity is shown not to require the catalytic (resolvase) activity of TelN, nor its poorly characterised C-terminal domain. The paper also suggests that this inhibition acts both at the level of competition for the DNA hairpin end and at the level of a direct protein:protein interaction between TelN and MR. An (acknowledged) weakness is that there is no real insight into the protein:protein interaction suggested by the experiments shown in Figure 5. Ideally, the protein:protein interaction interface would be identified and mutations in this interface would be shown to reduce hairpin protection.

      Specific comments/questions

      (1) What pathway (in vivo) leads to inactivation of linear hairpin DNA - one suspects that cleavage by SbcCD at the hairpins is probably not the full story. Presumably SbcCD cleavage facilitates further processing by other long range resection systems such as RecBCD, Exo1, RecQ/J etc. Would it be appropriate to view the hairpin as an adaption to protect against these nucleases, which then must be complemented with a mechanism to suppress SbcCD?

      The reviewer's suggestion that hairpin ends represent a first layer of adaptation against nucleolytic processing is compelling. Hairpin structures inherently resist many exonucleases due to their covalently closed nature (absence of free 3’ or 5’ ends) but remain vulnerable to MR processing (Connelly et al, 1998, 1999; Saathoff et al, 2018). This creates a scenario where effective telomere protection requires both the structural barrier provided by the hairpin and an active mechanism to suppress MR activity. We have added this perspective to the relevant paragraph in the discussion.

      (2) Section starting "Direct inhibition of MR by TelN in vitro". What is the word direct supposed to convey here? To me it suggests that the inhibition is via direct interaction of TelN with MR (rather than, for example, a result of competition for the hairpin DNA end) which is not shown here. Suggest either defining or removing the word direct. This point gains more importance considering that differentiating between inhibition mechanisms becomes a focus of later parts of the paper.

      By "direct inhibition," we meant that TelN blocks MR nuclease activity without requiring additional cofactors, as demonstrated in this minimal reaction system containing only TelN, MR complex, DNA substrate, and ATP. To avoid ambiguity, we have reworded the corresponding headline and paragraph.

      (3) Figure 2B - Why no control lane without MR? - this is a basic control to show that he degradation we are seeing in the absence of TelN is MR-dependent. Formally, as shown, the degradation could be caused by the ATP stock.


      We have now included ATP-only control lanes (without MR complex), which show no substrate degradation, confirming that ATP stocks do not contain contaminating nucleases and that the observed degradation is indeed MR-dependent. These controls are included in the supplementary data (Figure S3A) along with additional replicate experiments. Notably, the dose-dependent protection observed at low TelN concentrations (where MR activity is not fully inhibited) provides additional evidence for the specificity of the MR-TelN interaction system, as non-specific nuclease contamination would result in complete substrate degradation regardless of TelN concentration.

      (4) Why not use B. subtilis SbcCD for the species specificity experiment? Also, is it not surprising that TelN yielded zero protection against MRX given that the DNA sequence specificity experiments above suggest competition for DNA substrate is part of the inhibition mechanism?


      We agree that this would be a great addition. We attempted but were unable to purify active B. subtilis SbcCD protein despite multiple attempts. The yeast MRX experiment serves the same purpose of demonstrating species specificity and represents a more evolutionarily distant comparison, which strengthens our conclusions about bacterial-specific inhibition.

      (5) If the authors felt it appropriate, I thought there was scope for further discussion/introductory material. There are strong parallels here with mechanisms used by phage to protect themselves from the activities of RecBCD, which include both proteins that protect DNA ends like T4 gene 2, we well as proteins that bind directly to RecBCD to inactivate it like lambda Gam. As such, the work here will appeal as much to those interested in bacterial defence systems / phage:host interactions as it does to those interested in telomere biology. Especially significant is the inhibition of DNA end processing factors by lambda Gam since this protein is reported to interact with both RecBCD and SbcCD (PMID: 2531105).

      We agree that there are obvious parallels between lambda Gam and TelN as counter-defence factors. This was likely largely missed in previous work because the telomere resolution activity of TelN masked its function in counter-defence. We have added a statement on this matter at the end of the discussion.

      (6) Just a gripe really: it seems to be 'de rigeur' at the moment to re-name bacterial proteins for their human orthologues, presumably to elevate the perceived importance of the work(?), but it is not a practice I think is terribly helpful as it causes issues when searching literature. Minimally it would be great if the authors could ensure they add SbcCD as a keyword for search purposes.

      We appreciate the reviewer's concern about nomenclature inconsistencies in the literature. We have chosen MR over SbcCD as a more generic term that covers eukaryotes, archaea and lately also bacteria and will hopefully contribute to a more consistent terminology in the literature across the domains of life in the future. Our choice to use "Mre11-Rad50" (MR) for the E. coli SbcCD complex is also consistent with prominent recent publications (Käshammer et al., 2019; Gut et al., 2022), explicitly referring to the E. coli system as "Mre11-Rad50" while acknowledging the bacterial designation. To link to previous literature, we made sure that both "SbcCD" and "Mre11-Rad50" are mentioned in the abstract. And, as suggested, we have now also added “SbcCD” to our keyword list to facilitate comprehensive literature searches.

      **Referee cross-commenting**

      I have nothing to add. The reviewers' comments are all broadly positive and consistent.

      Reviewer #2 (Significance (Required):

      This is an excellent paper unveiling a phage encoded "counter-defence" mechanism designed to protect phage DNA from degradation. It will be of special interest to those studying telomere biology of phage:host interactions.



      Reviewer #3

      The authors investigate how the N15 phage protelomerase TelN protects linear chromosomes that terminate in hairpin structures (a sort of telomere). In E. coli and B. subtilis cells, removal or truncation of telN reduces transformation/survival of linear DNA, whereas complementation with full-length or a catalytically inactive TelN restores viability, consistent with TelN playing a nonenzymatic capping function.

      In vitro, TelN binds hairpin substrates with moderate affinity and protects them from the nuclease activity of the Mre11/Rad50 complex. The authors propose that TelN originated as an early, sequence specific barrier against MR mediated DNA end processing, establishing fundamental principles of telomere protection that persist from bacteria to eukaryotes.

      Major comments:

      The manuscript convincingly shows that TelN can functionally block the Mre11Rad50 (MR) nuclease on a hairpin DNA end in a sequence specific manner (suggesting a physical interaction), but it doesn't directly demonstrate this. A simple pull-down or equilibrium binding method would be useful in proving a physical interaction.

      We agree that this would be a valuable addition to the study. We have made several attempts to detect direct interaction by co-immunoprecipitation. However, without success so far. We do not have sufficient material for equilibrium binding methods (yet).__ ____ __


      The MR complex requires ATP hydrolysis for resection of DNA ends. It would be a nice addition to the manuscript if the effect of TelN of Rad50 ATPase activity was tested.


      We have tested the effect of TelN on Rad50 ATPase activity and found no significant impact under our experimental conditions, possible in line with the lack of stable interaction.

      The bar plot on Fig 3B indicates that the experiments are performed in triplicate. The statistical significance of the differences between conditions should be determined. The same general comment could be made regarding the quantification of the polyacrylamide gels - how reproducible are these values?


      We performed paired t-test analysis for the following figures and now indicate the p-values wherever significant (below 0.05): Figures 1D, 1E, 3B, 4B and S4B. We used paired t-tests to generally compare linear vs circular plasmid transformation efficiency for each condition. In Figure 4B, which included two different linear DNA constructs, we compared the two linear DNA constructs directly to each other. [Given that our experimental design included multiple control conditions with known expected outcomes to validate assay performance, rather than many independent exploratory comparisons, we report uncorrected p-values as the primary analysis. The inclusion of multiple controls with predictable outcomes reduces the likelihood of false positive interpretations.]

      As stated in response to reviewer 1, while the exact values for the DNA degradation profile vary somewhat between experiments (likely due to variations in band quantification – see also response to comment below), the general trends are robust as for example indicated by similar experiments performed with higher MR concentration (500 nM instead of 125 nM M₂R₂ concentrations for all TelN variants) demonstrating reproducibility across different conditions. For Figure 5, however, we are unable to provide additional repeat experiments due to limitations in reagent availability. Considering the robust effect seen with Ec MR controls and the presence of multiple samples in the dilution series, we are nevertheless confident about the conclusion.

      Minor comments:

      A better explanation of how the gels were quantified should be provided. Were the products included in the analysis, or was it just the decrease in the substrate band that was measured?

      As also stated above, we have removed the band quantification and instead show the bands also at different contrast settings.

      In our original approach, gel band quantification was performed using ImageQuant TL software (version 8.2.0, GE Healthcare). For each gel, individual lanes were defined using either fixed-width boundaries (95-103 pixels) or automatic edge detection, depending on the gel quality and band definition. Band volumes were calculated using rolling ball background subtraction (radius 180 pixels) with automatic band detection. Substrate degradation was assessed by measuring the integrated density (volume) of the remaining full-length (or near full-length) substrate bands under different treatment conditions. The band volume values were plotted directly to compare substrate levels across treatment groups.

      We now present the data as two gel panels: an exposure showing the full reaction profile, and another exposure focusing on the substrate bands to clearly demonstrate dose-dependent protection. Additional replicate experiments including ATP-only controls (confirming no contamination from ATP stocks) and experiments at 500 nM M₂R₂ concentrations, are provided in the supplementary data. This approach provides more direct visualization of the biological phenomenon with comprehensive control validation.

      I felt like the Results jump rather abruptly from B. subtilis chromosome assays to E. coli plasmid experiments. Maybe the addition of a few linking sentences would improve this transition.


      Upon re-reading the manuscript we agree with this assertion and have added further information to provide a smoother transition.

      A comment on the stoichiometry of TelN and genome ends during phage replication would be useful.

      Our in vitro data suggest that effective protection can be achieved at relatively low TelN:DNA ratios in vitro, consistent with the notion of formation of stable, protective nucleoprotein structures. We unfortunately do not currently have information on the copy number of TelN per cell or per hairpin end. It is not easy to obtain reliable values for these numbers. However, we can speculate that multiple TelN proteins are present due to the presence of three copies of a DNA sequence motif (binding to CTD1) in each telomeric DNA, consistent with the formation of stable, protective nucleoprotein structures.

      Reviewer #3 (Significance (Required)):

      General assessment:

      Strengths: A nice combination of genetics and biochemistry convincingly demonstrates that TelN protects linear chromosomes/replicons from MR-dependent degradation independent of its cleavage-ligase activity. It does this by binding to the hairpin DNA ends in a sequence specific fashion and the species specificity suggests a direct physical interaction, which likely inhibits the nuclease activity of the MR complex

      Limitations: The lack of characterization of the putative physical interaction between TelN and the MR complex is considered a weakness.

      Advance: The manuscript fills in a mechanistic gap between protelomerase-mediated telomere formation and maintenance by demonstrating a protective/capping role. This is the first quantitative analysis of DNA-end protection from MR nuclease activity by TelN.

      Audience: Readers interested in bacterial chromosome biology, DNA repair, the parallels to eukaryotic shelterin will be interesting to the broader telomere and genome stability communities.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      This study investigates the sex determination mechanism in the clonal ant Ooceraea biroi, focusing on a candidate complementary sex determination (CSD) locus-one of the key mechanisms supporting haplodiploid sex determination in hymenopteran insects. Using whole genome sequencing, the authors analyze diploid females and the rarely occurring diploid males of O. biroi, identifying a 46 kb candidate region that is consistently heterozygous in females and predominantly homozygous in diploid males. This region shows elevated genetic diversity, as expected under balancing selection. The study also reports the presence of an lncRNA near this heterozygous region, which, though only distantly related in sequence, resembles the ANTSR lncRNA involved in female development in the Argentine ant, Linepithema humile (Pan et al. 2024). Together, these findings suggest a potentially conserved sex determination mechanism across ant species. However, while the analyses are well conducted and the paper is clearly written, the insights are largely incremental. The central conclusion - that the sex determination locus is conserved in ants - was already proposed and experimentally supported by Pan et al. (2024), who included O. biroi among the studied species and validated the locus's functional role in the Argentine ant. The present study thus largely reiterates existing findings without providing novel conceptual or experimental advances.

      Although it is true that Pan et al., 2024 demonstrated (in Figure 4 of their paper) that the synteny of the region flanking ANTSR is conserved across aculeate Hymenoptera (including O. biroi), Reviewer 1’s claim that that paper provides experimental support for the hypothesis that the sex determination locus is conserved in ants is inaccurate. Pan et al., 2024 only performed experimental work in a single ant species (Linepithema humile) and merely compared reference genomes of multiple species to show synteny of the region, rather than functionally mapping or characterizing these regions.

      Other comments:

      The mapping is based on a very small sample size: 19 females and 16 diploid males, and these all derive from a single clonal line. This implies a rather high probability for false-positive inference. In combination with the fact that only 11 out of the 16 genotyped males are actually homozygous at the candidate locus, I think a more careful interpretation regarding the role of the mapped region in sex determination would be appropriate. The main argument supporting the role of the candidate region in sex determination is based on the putative homology with the lncRNA involved in sex determination in the Argentine ant, but this argument was made in a previous study (as mentioned above).

      Our main argument supporting the role of the candidate region in sex determination is not based on putative homology with the lncRNA in L. humile. Instead, our main argument comes from our genetic mapping (in Fig. 2), and the elevated nucleotide diversity within the identified region (Fig. 4). Additionally, we highlight that multiple genes within our mapped region are homologous to those in mapped sex determining regions in both L. humile and Vollenhovia emeryi, possibly including the lncRNA.

      In response to the Reviewer’s assertion that the mapping is based on a small sample size from a single clonal line, we want to highlight that we used all diploid males available to us. Although the primary shortcoming of a small sample size is to increase the probability of a false negative, small sample sizes can also produce false positives. We used two approaches to explore the statistical robustness of our conclusions. First, we generated a null distribution by randomly shuffling sex labels within colonies and calculating the probability of observing our CSD index values by chance (shown in Fig. 2). Second, we directly tested the association between homozygosity and sex using Fisher’s Exact Test (shown in Supplementary Fig. S2). In both cases, the association of the candidate locus with sex was statistically significant after multiple-testing correction using the Benjamini-Hochberg False Discovery Rate. These approaches are clearly described in the “CSD Index Mapping” section of the Methods.

      We also note that, because complementary sex determination loci are expected to evolve under balancing selection, our finding that the mapped region exhibits a peak of nucleotide diversity lends orthogonal support to the notion that the mapped locus is indeed a complementary sex determination locus.

      The fourth paragraph of the results and the sixth paragraph of the discussion are devoted to explaining the possible reasons why only 11/16 genotyped males are homozygous in the mapped region. The revised manuscript will include an additional sentence (in what will be lines 384-388) in this paragraph that includes the possible explanation that this locus is, in fact, a false positive, while also emphasizing that we find this possibility to be unlikely given our multiple lines of evidence.

      In response to Reviewer 1’s suggestion that we carefully interpret the role of the mapped region in sex determination, we highlight our careful wording choices, nearly always referring to the mapped locus as a “candidate sex determination locus” in the title and throughout the manuscript. For consistency, the revised manuscript version will change the second results subheading from “The O. biroi CSD locus is homologous to another ant sex determination locus but not to honeybee csd” to “O. biroi’s candidate CSD locus is homologous to another ant sex determination locus but not to honeybee csd,” and will add the word “candidate” in what will be line 320 at the beginning of the Discussion, and will change “putative” to “candidate” in what will be line 426 at the end of the Discussion.

      In the abstract, it is stated that CSD loci have been mapped in honeybees and two ant species, but we know little about their evolutionary history. But CSD candidate loci were also mapped in a wasp with multi-locus CSD (study cited in the introduction). This wasp is also parthenogenetic via central fusion automixis and produces diploid males. This is a very similar situation to the present study and should be referenced and discussed accordingly, particularly since the authors make the interesting suggestion that their ant also has multi-locus CSD and neither the wasp nor the ant has tra homologs in the CSD candidate regions. Also, is there any homology to the CSD candidate regions in the wasp species and the studied ant?

      In response to Reviewer 1’s suggestion that we reference the (Matthey-Doret et al. 2019) study in the context of diploid males being produced via losses of heterozygosity during asexual reproduction, the revised manuscript will include (in what will be lines 123-126) the highlighted portion of the following sentence: “Therefore, if O. biroi uses CSD, diploid males might result from losses of heterozygosity at sex determination loci (Fig. 1C), similar to what is thought to occur in other asexual Hymenoptera that produce diploid males (Rabeling and Kronauer 2012; Matthey-Doret et al. 2019).”

      We note, however, that in their 2019 study, Matthey-Doret et al. did not directly test the hypothesis that diploid males result from losses of heterozygosity at CSD loci during asexual reproduction, because the diploid males they used for their mapping study came from inbred crosses in a sexual population of that species.

      We address this further below, but we want to emphasize that we do not intend to argue that O. biroi has multiple CSD loci. Instead, we suggest that additional, undetected CSD loci is one possible explanation for the absence of diploid males from any clonal line other than clonal line A. In response to Reviewer 1’s suggestion that we reference the (Matthey-Doret et al. 2019) study in the context of multilocus CSD, the revised manuscript version will include the following additional sentence in the fifth paragraph of the discussion (in what will be lines 372-374): “Multi-locus CSD has been suggested to limit the extent of diploid male production in asexual species under some circumstances (Vorburger 2013; Matthey-Doret et al. 2019).”

      Regarding Reviewer 2’s question about homology between the putative CSD loci from the (Matthey-Doret et al. 2019) study and O. biroi, we note that there is no homology. The revised manuscript version will have an additional Supplementary Table (which will be the new Supplementary Table S3) that will report the results of this homology search. The revised manuscript will also include the following additional sentence in the Results, in what will be lines 172-174: “We found no homology between the genes within the O. biroi CSD index peak and any of the genes within the putative L. fabarum CSD loci (Supplementary Table S3).”

      The authors used different clonal lines of O. biroi to investigate whether heterozygosity at the mapped CSD locus is required for female development in all clonal lines of O. biroi (L187-196). However, given the described parthenogenesis mechanism in this species conserves heterozygosity, additional females that are heterozygous are not very informative here. Indeed, one would need diploid males in these other clonal lines as well (but such males have not yet been found) to make any inference regarding this locus in other lines.

      We agree that a full mapping study including diploid males from all clonal lines would be preferable, but as stated earlier in that same paragraph, we have only found diploid males from clonal line A. We stand behind our modest claim that “Females from all six clonal lines were heterozygous at the CSD index peak, consistent with its putative role as a CSD locus in all O. biroi.” In the revised manuscript version, this sentence (in what will be lines 199-201) will be changed slightly in response to a reviewer comment below: “All females from all six clonal lines (including 26 diploid females from clonal line B) were heterozygous at the CSD index peak, consistent with its putative role as a CSD locus in all O. biroi.”

      Reviewer #2 (Public review):

      The manuscript by Lacy et al. is well written, with a clear and compelling introduction that effectively conveys the significance of the study. The methods are appropriate and well-executed, and the results, both in the main text and supplementary materials, are presented in a clear and detailed manner. The authors interpret their findings with appropriate caution.

      This work makes a valuable contribution to our understanding of the evolution of complementary sex determination (CSD) in ants. In particular, it provides important evidence for the ancient origin of a non-coding locus implicated in sex determination, and shows that, remarkably, this sex locus is conserved even in an ant species with a non-canonical reproductive system that typically does not produce males. I found this to be an excellent and well-rounded study, carefully analyzed and well contextualized.

      That said, I do have a few minor comments, primarily concerning the discussion of the potential 'ghost' CSD locus. While the authors acknowledge (line 367) that they currently have no data to distinguish among the alternative hypotheses, I found the evidence for an additional CSD locus presented in the results (lines 261-302) somewhat limited and at times a bit difficult to follow. I wonder whether further clarification or supporting evidence could already be extracted from the existing data. Specifically:

      We agree with Reviewer 2 that the evidence for a second CSD locus is limited. In fact, we do not intend to advocate for there being a second locus, but we suggest that a second CSD locus is one possible explanation for the absence of diploid males outside of clonal line A. In our initial version, we intentionally conveyed this ambiguity by titling this section “O. biroi may have one or multiple sex determination loci.” However, we now see that this leads to undue emphasis on the possibility of a second locus. In the revised manuscript, we will split this into two separate sections: “Diploid male production differs across O. biroi clonal lines” and “O. biroi lacks a tra-containing CSD locus.”

      (1) Line 268: I doubt the relevance of comparing the proportion of diploid males among all males between lines A and B to infer the presence of additional CSD loci. Since the mechanisms producing these two types of males differ, it might be more appropriate to compare the proportion of diploid males among all diploid offspring. This ratio has been used in previous studies on CSD in Hymenoptera to estimate the number of sex loci (see, for example, Cook 1993, de Boer et al. 2008, 2012, Ma et al. 2013, and Chen et al., 2021). The exact method might not be applicable to clonal raider ants, but I think comparing the percentage of diploid males among the total number of (diploid) offspring produced between the two lineages might be a better argument for a difference in CSD loci number.

      We want to re-emphasize here that we do not wish to advocate for there being two CSD loci in O. biroi. Rather, we want to explain that this is one possible explanation for the apparent absence of diploid males outside of clonal line A. We hope that the modifications to the manuscript described in the previous response help to clarify this.

      Reviewer 2 is correct that comparing the number of diploid males to diploid females does not apply to clonal raider ants. This is because males are vanishingly rare among the vast numbers of females produced. We do not count how many females are produced in laboratory stock colonies, and males are sampled opportunistically. Therefore, we cannot report exact numbers. However, we will add the highlighted portion of the following sentence (in what will be lines 268-270) to the revised manuscript: “Despite the fact that we maintain more colonies of clonal line B than of clonal line A in the lab, all the diploid males we detected came from clonal line A.”

      (2) If line B indeed carries an additional CSD locus, one would expect that some females could be homozygous at the ANTSR locus but still viable, being heterozygous only at the other locus. Do the authors detect any females in line B that are homozygous at the ANTSR locus? If so, this would support the existence of an additional, functionally independent CSD locus.

      We thank the reviewer for this suggestion, and again we emphasize that we do not want to argue in favor of multiple CSD loci. We just want to introduce it as one possible explanation for the absence of diploid males outside of clonal line A.

      The 26 sequenced diploid females from clonal line B are all heterozygous at the mapped locus, and the revised manuscript will clarify this in what will be lines 199-201. Previously, only six of those diploid females were included in Supplementary Table S2, and that will be modified accordingly.

      (3) Line 281: The description of the two tra-containing CSD loci as "conserved" between Vollenhovia and the honey bee may be misleading. It suggests shared ancestry, whereas the honey bee csd gene is known to have arisen via a relatively recent gene duplication from fem/tra (10.1038/nature07052). It would be more accurate to refer to this similarity as a case of convergent evolution rather than conservation.

      In the sentence that Reviewer 2 refers to, we are representing the assertion made in the (Miyakawa and Mikheyev 2015) paper in which, regarding their mapping of a candidate CSD locus that contains two linked tra homologs, they write in the abstract: “these data support the prediction that the same CSD mechanism has indeed been conserved for over 100 million years.” In that same paper, Miyakawa and Mikheyev write in the discussion section: “As ants and bees diverged more than 100 million years ago, sex determination in honey bees and V. emeryi is probably homologous and has been conserved for at least this long.”

      As noted by Reviewer 2, this appears to conflict with a previously advanced hypothesis: that because fem and csd were found in Apis mellifera, Apis cerana, and Apis dorsata, but only fem was found in Mellipona compressipes, Bombus terrestris, and Nasonia vitripennis, that the csd gene evolved after the honeybee (Apis) lineage diverged from other bees (Hasselmann et al. 2008). However, it remains possible that the csd gene evolved after ants and bees diverged from N. vitripennis, but before the divergence of ants and bees, and then was subsequently lost in B. terrestris and M. compressipes. This view was previously put forward based on bioinformatic identification of putative orthologs of csd and fem in bumblebees and in ants [(Schmieder et al. 2012), see also (Privman et al. 2013)]. However, subsequent work disagreed and argued that the duplications of tra found in ants and in bumblebees represented convergent evolution rather than homology (Koch et al. 2014). Distinguishing between these possibilities will be aided by additional sex determination locus mapping studies and functional dissection of the underlying molecular mechanisms in diverse Aculeata.

      Distinguishing between these competing hypotheses is beyond the scope of our paper, but the revised manuscript will include additional text to incorporate some of this nuance. We will include these modified lines below (in what will be lines 287-295), with the additions highlighted:

      “A second QTL region identified in V. emeryi (V.emeryiCsdQTL1) contains two closely linked tra homologs, similar to the closely linked honeybee tra homologs, csd and fem (Miyakawa and Mikheyev 2015). This, along with the discovery of duplicated tra homologs that undergo concerted evolution in bumblebees and ants (Schmieder et al. 2012; Privman et al. 2013) has led to the hypothesis that the function of tra homologs as CSD loci is conserved with the csd-containing region of honeybees (Schmieder et al. 2012; Miyakawa and Mikheyev 2015). However, other work has suggested that tra duplications occurred independently in honeybees, bumblebees, and ants (Hasselmann et al. 2008; Koch et al. 2014), and it remains to be demonstrated that either of these tra homologs acts as a primary CSD signal in V. emeryi.”

      (4) Finally, since the authors successfully identified multiple alleles of the first CSD locus using previously sequenced haploid males, I wonder whether they also observed comparable allelic diversity at the candidate second CSD locus. This would provide useful supporting evidence for its functional relevance.

      As is already addressed in the final paragraph of the results and in Supplementary Fig. S4, there is no peak of nucleotide diversity in any of the regions homologous to V.emeryiQTL1, which is the tra-containing candidate sex determination locus (Miyakawa and Mikheyev 2015). In the revised manuscript, the relevant lines will be 307-310. We want to restate that we do not propose that there is a second candidate CSD locus in O. biroi, but we simply raise the possibility that multi-locus CSD *might* explain the absence of diploid males from clonal lines other than clonal line A (as one of several alternative possibilities).

      Overall, these are relatively minor points in the context of a strong manuscript, but I believe addressing them would improve the clarity and robustness of the authors' conclusions.

      Reviewer #3 (Public review):

      Summary:

      The sex determination mechanism governed by the complementary sex determination (CSD) locus is one of the mechanisms that support the haplodiploid sex determination system evolved in hymenopteran insects. While many ant species are believed to possess a CSD locus, it has only been specifically identified in two species. The authors analyzed diploid females and the rarely occurring diploid males of the clonal ant Ooceraea biroi and identified a 46 kb CSD candidate region that is consistently heterozygous in females and predominantly homozygous in males. This region was found to be homologous to the CSD locus reported in distantly related ants. In the Argentine ant, Linepithema humile, the CSD locus overlaps with an lncRNA (ANTSR) that is essential for female development and is associated with the heterozygous region (Pan et al. 2024). Similarly, an lncRNA is encoded near the heterozygous region within the CSD candidate region of O. biroi. Although this lncRNA shares low sequence similarity with ANTSR, its potential functional involvement in sex determination is suggested. Based on these findings, the authors propose that the heterozygous region and the adjacent lncRNA in O. biroi may trigger female development via a mechanism similar to that of L. humile. They further suggest that the molecular mechanisms of sex determination involving the CSD locus in ants have been highly conserved for approximately 112 million years. This study is one of the few to identify a CSD candidate region in ants and is particularly noteworthy as the first to do so in a parthenogenetic species.

      Strengths:

      (1) The CSD candidate region was found to be homologous to the CSD locus reported in distantly related ant species, enhancing the significance of the findings.

      (2) Identifying the CSD candidate region in a parthenogenetic species like O. biroi is a notable achievement and adds novelty to the research.

      Weaknesses

      (1) Functional validation of the lncRNA's role is lacking, and further investigation through knockout or knockdown experiments is necessary to confirm its involvement in sex determination.

      See response below.

      (2) The claim that the lncRNA is essential for female development appears to reiterate findings already proposed by Pan et al. (2024), which may reduce the novelty of the study.

      We do not claim that the lncRNA is essential for female development in O. biroi, but simply mention the possibility that, as in L. humile, it is somehow involved in sex determination. We do not have any functional evidence for this, so this is purely based on its genomic position immediately adjacent to our mapped candidate region. We agree with the reviewer that the study by Pan et al. (2024) decreases the novelty of our findings. Another way of looking at this is that our study supports and bolsters previous findings by partially replicating the results in a different species.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      L307-308 should state homozygous for either allele in THE MAJORITY of diploid males.

      This will be fixed in the revised manuscript, in what will be line 321.

      Reviewer #3 (Recommendations for the authors):

      The association between heterozygosity in the CSD candidate region and female development in O. biroi, along with the high sequence homology of this region to CSD loci identified in two distantly related ant species, is not sufficient to fully address the evolution of the CSD locus and the mechanisms of sex determination.

      Given that functional genetic tools, such as genome editing, have already been established in O. biroi, I strongly recommend that the authors investigate the role of the lncRNA through knockout or knockdown experiments and assess its impact on the sex-specific splicing pattern of the downstream tra gene.

      Although knockout experiments of the lncRNA would be illuminating, the primary signal of complementary sex determination is heterozygosity. As is clearly stated in our manuscript and that of (Pan et al. 2024), it does not appear to be heterozygosity within the lncRNA that induces female development, but rather heterozygosity in non-transcribed regions linked to the lncRNA. Therefore, future mechanistic studies of sex determination in O. biroi, L. humile, and other ants should explore how homozygosity or heterozygosity of this region impacts the sex determination cascade, rather than focusing (exclusively) on the lncRNA.

      With this in mind, we developed three sets of guide RNAs that cut only one allele within the mapped CSD locus, with the goal of producing deletions within the highly variable region within the mapped locus. This would lead to functional hemizygosity or homozygosity within this region, depending on how the cuts were repaired. We also developed several sets of PCR primers to assess the heterozygosity of the resultant animals. After injecting 1,162 eggs over several weeks and genotyping the hundreds of resultant animals with PCR, we confirmed that we could induce hemizygosity or homozygosity within this region, at least in ~1/20 of the injected embryos. Although it is possible to assess the sex-specificity of the splice isoform of tra as a proxy for sex determination phenotypes (as done by (Pan et al. 2024)), the ideal experiment would assess male phenotypic development at the pupal stage. Therefore, over several more weeks, we injected hundreds more eggs with these reagents and reared the injected embryos to the pupal stage. However, substantial mortality was observed, with only 12 injected eggs developing to the pupal stage. All of these were female, and none of them had been successfully mutated.

      In conclusion, we agree with the reviewer that functional experiments would be useful, and we made extensive attempts to conduct such experiments. However, these experiments turned out to be extremely challenging with the currently available protocols. Ultimately, we therefore decided to abandon these attempts.  

      We opted not to include these experiments in the paper itself because we cannot meaningfully interpret their results. However, we are pleased that, in this response letter, we can include a brief description for readers interested in attempting similar experiments.

      Since O. biroi reproduces parthenogenetically and most offspring develop into females, observing a shift from female- to male-specific splicing of tra upon early embryonic knockout of the lncRNA would provide much stronger evidence that this lncRNA is essential for female development. Without such functional validation, the authors' claim (lines 36-38) seems to reiterate findings already proposed by Pan et al. (2024) and, as such, lacks sufficient novelty.

      We have responded to the issue of “lack of novelty” above. But again, the actual CSD locus in both O. biroi and L. humile appears to be distinct from (but genetically linked to) the lncRNA, and we have no experimental evidence that the putative lncRNA in O. biroi is involved in sex determination at all. Because of this, and given the experimental challenges described above, we do not currently intend to pursue functional studies of the lncRNA.

      References

      Hasselmann M, Gempe T, Schiøtt M, Nunes-Silva CG, Otte M, Beye M. 2008. Evidence for the evolutionary nascence of a novel sex determination pathway in honeybees. Nature 454:519–522.

      Koch V, Nissen I, Schmitt BD, Beye M. 2014. Independent Evolutionary Origin of fem Paralogous Genes and Complementary Sex Determination in Hymenopteran Insects. PLOS ONE 9:e91883.

      Matthey-Doret C, van der Kooi CJ, Jeffries DL, Bast J, Dennis AB, Vorburger C, Schwander T. 2019. Mapping of multiple complementary sex determination loci in a parasitoid wasp. Genome Biology and Evolution 11:2954–2962.

      Miyakawa MO, Mikheyev AS. 2015. QTL mapping of sex determination loci supports an ancient pathway in ants and honey bees. PLOS Genetics 11:e1005656.

      Pan Q, Darras H, Keller L. 2024. LncRNA gene ANTSR coordinates complementary sex determination in the Argentine ant. Science Advances 10:eadp1532.

      Privman E, Wurm Y, Keller L. 2013. Duplication and concerted evolution in a master sex determiner under balancing selection. Proceedings of the Royal Society B: Biological Sciences 280:20122968.

      Rabeling C, Kronauer DJC. 2012. Thelytokous parthenogenesis in eusocial Hymenoptera. Annual Review of Entomology 58:273–292.

      Schmieder S, Colinet D, Poirié M. 2012. Tracing back the nascence of a new sex-determination pathway to the ancestor of bees and ants. Nature Communications 3:1–7.

      Vorburger C. 2013. Thelytoky and Sex Determination in the Hymenoptera: Mutual Constraints. Sexual Development 8:50–58.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this paper, Behruznia and colleagues use long-read sequencing data for 339 strains of the Mycobacterium tuberculosis complex to study genome evolution in this clonal bacterial pathogen. They use both a "classical" pangenome approach that looks at the presence and absence of genes, and a pangenome graph based on whole genomes in order to investigate structural variants in non-coding regions. The comparison of the two approaches is informative and shows that much is missed when focussing only on genes. The two main biological results of the study are that 1) the MTBC has a small pangenome with few accessory genes, and that 2) pangenome evolution is driven by genome reduction. In the revised article, the description of the data set and the methods is much improved, and the comparison of the two pangenome approaches is more consistent. I still think, however, that the discussion of genome reduction suffers from a basic flaw, namely the failure to distinguish clearly between orthologs and homologs/paralogs.

      Strengths:

      The authors put together the so-far largest data set of long-read assemblies representing most lineages of the Mycobacterium tuberculosis context, and covering a large geographic area. They sequenced and assembled genomes for strains of M. pinnipedi, L9, and La2, for which no high-quality assemblies were available previously. State-of-the-art methods are used to analyze gene presence-absence polymorphisms (Panaroo) and to construct a pangenome graph (PanGraph). Additional analysis steps are performed to address known problems with misannotated or misassembled genes.

      Weaknesses:

      The revised manuscript has gained much clarity and consistency. One previous criticism, however, has in my opinion not been properly addressed. I think the problem boils down to not clearly distinguishing between orthologs and paralogs/homologs. As this problem affects a main conclusion - the prevalence of deletions over insertions in the MTBC - it should be addressed, if not through additional analyses, then at least in the discussion.

      Insertions and deletions are now distinguished in the following way: "Accessory regions were further classified as a deletion if present in over 50% of the 192 sub-lineages or an insertion/duplication if present in less than 50% of sub-lineages." The outcome of this classification is suspicious: not a single accessory region was classified as an insertion/duplication. As a check of sanity, I'd expect at least some insertions of IS6110 to show up, which has produced lineage- or sublineage-specific insertions (Roychowdhury et al. 2015, Shitikov et al. 2019). Why, for example, wouldn't IS6110 insertions in the single L8 strain show up here?

      In a fully clonal organism, any insertion/duplication will be an insertion/duplication of an existing sequence, and thus produce a paralog. If I'm correctly understanding your methods section, paralogs are systematically excluded in the pangraph analysis. Genomic blocks are summarized at the sublineage levels as follows (l.184 ): "The DNA sequences from genomic blocks present in at least one sub-lineage but completely absent in others were extracted to look for long-term evolution patterns in the pangenome." I presume this is done using blastn, as in other steps of the analysis.

      So a sublineage-specific copy of IS6110 would be excluded here, because IS6110 is present somewhere in the genome in all sublineages. However, the appropriate category of comparison, at least for the discussion of genome reduction, is orthology rather than homology: is the same, orthologous copy of IS6110, at the same position in the genome, present or absent in other sublineages? The same considerations apply to potential sublineage-specific duplicates of PE, PPE, and Esx genes. These gene families play important roles in host-pathogen interactions, so I'd argue that the neglect of paralogs is not a finicky detail, but could be of broader biological relevance.

      Reviewer #2 (Public review):

      Summary:

      The authors attempted to investigate the pangenome of MTBC by using a selection of state-of-the-art bioinformatic tools to analyse 324 complete and 11 new genomes representing all known lineages and sublineages. The aim of their work was to describe the total diversity of the MTBC and to investigate the driving evolutionary force. By using long read and hybrid approaches for genome assembly, an important attempt was made to understand why the MTBC pangenome size was reported to vary in size by previous reports. This study provides strong evidence that the MTBC pangenome is closed and that genome reduction is the main driver of this species evolution.

      Strengths:

      A stand-out feature of this work is the inclusion of non-coding regions as opposed to only coding regions which was a focus of previous papers and analyses which investigated the MTBC pangenome. A unique feature of this work is that it highlights sublineage-specific regions of difference (RDs) that was previously unknown. Another major strength is the utilisation of long-read whole genomes sequences, in combination with short-read sequences when available. It is known that using only short reads for genome assembly has several pitfalls. The parallel approach of utilizing both Panaroo and Pangraph for pangenomic reconstruction illuminated limitations of both tools while highlighting genomic features identified by both. This is important for any future work and perhaps alludes to the need for more MTBC-specific tools to be developed. Lastly, ample statistical support in the form of Heaps law and genome fluidity calculations for each pangenome to demonstrate that they are indeed closed.

      Weaknesses:

      There are no major weaknesses in the revised version of this manuscript.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      l. 27: "lineage-specific and -independent deletions": it is still not clear to me what a lineage-independent, or convergent, deletion is supposed to be. TBD1, for instance, is not lineage-specific, but it is also not convergent: it occurred once in the common ancestor of lineages 1, 2, and 3, while convergence implies multiple parallel occurrences.

      We have changed this and in other places to more evolutionary terms, such as divergent (single event) and convergent (multiple events), or explain exactly what is meant where needed.

      l. 118: "where relevant", what does that mean?

      This was superfluous to the description and so is now removed.

      l. 178ff.: It is not clear to me what issue is addressed by this correction of the pangenome graph. Also here there seems to be some confusion regarding orthologs and paralogs. A gene or IS copy can be present at one locus but absent at another, which is not a mistake of Pangraph that would require correction. It's rather the notion of "truly absent region" which is ambiguous.

      We have changed the text to be more specific on the utility of this step. Since it is known that Panaroo mislabels some genes as being absent due to over splitting (see Ceres et al 2022 and our reclassification earlier in the paper), we wanted to see if the same occurred in Pangraph. We have modified the methods text to be more specific (line 181) and in the results included the percentage of total genes/regions affected by this correction.

      In relation to copy number, Pangraph is not syntenic in its approach; if a region is present anywhere it is labelled as present in the genome. Pangraph will look for multiple copies of that region (e.g. an IS element) but indeed we did not look for specific syntenic changes across the genomes. This would be a great analysis and something we will consider in the future; we have indicated such in the discussion (line 454).

      l. 305: "mislabelled as absent": see above, is this really 'mislabelled'?

      See answer to question above

      l. 372: "using the approach": something missing here.

      This was superfluous to the description and so is now removed.

      l. 381: the "additional analysis of paralogous blocks" (l. 381) seems to suffer from the same confusion of ortho- and paralogy described above: no new sub-lineage-specific accessory regions are found presumably because the analysis did consider any copy rather than orthologous copies.

      Paralogous copies were looked for by Pangraph, and we did not find any sub-lineage where all members had additional copies compared to other sub-lineages. Indeed, single genomes could have these, and shorter timescales could see a lot of such insertions, but we looked at longer-scale (all genomes within a sub-lineage) patterns and did not find these. These limitations are already outlined in the discussion.

      l. 415: see above. There is no diagnosis of a problem that would motivate a "correction". That's different from the correction of the Panaroo results, where fragmented annotations have been shown to be a problem.

      Of interest, the refining of regions did re-label multiple regions as being core when Pangraph labelled it as absent from some genomes was at about the same rate as the correction to Pangraph (2% of genes/regions). This indicates there is a stringency issue with pangraph where blocks are mislabelled as absent. The underlying reason or this is not clear but the correction is evidently required in this version of Pangraph.

      l. 430ff.: The issue of paralogy and that the "same" gene or region is defined in terms of homology rather than orthology should be addressed here. For me the given evidence does not support the claim that deletion is driving molecular evolution in the MTBC.

      As outlined above, indeed paralogy may be driving some elements of the overall evolutionary patterns; our analysis just did not find this. Panaroo without merged paralogs did not find paralogous genes as a main differentiating factor for any sub-lineage. Pangraph also did not find multiple copies of blocks present in all genomes in a sub-lineage. As outlined above, indeed single genomes show such patterns but we did not include single genome analyses here, and outline that as a next steps in the discussion. We have also linked to a recent pangenome paper that showed duplication is present in the pangenome of Mtbc, although not related to any specific lineage (Discussion line 485).

      l. 443 ff: "lineage-independent deletions (convergent evolution)": see above, I still think this terminology is unclear

      This has now been made clearer to be specifically about convergent and divergent evolutionary patterns.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this detailed study, Cohen and Ben-Shaul characterized the AOB cell responses to various conspecific urine samples in female mice across the estrous cycle. The authors found that AOB cell responses vary with the strains and sexes of the samples. Between estrous and non-estrous females, no clear or consistent difference in responses was found. The cell response patterns, as measured by the distance between pairs of stimuli, are largely stable. When some changes do occur, they are not consistent across strains or male status. The authors concluded that AOB detects the signals without interpreting them. Overall, this study will provide useful information for scientists in the field of olfaction.

      Strengths:

      The study uses electrophysiological recording to characterize the responses of AOB cells to various urines in female mice. AOB recording is not trivial as it requires activation of VNO pump. The team uses a unique preparation to activate the VNO pump with electric stimulation, allowing them to record AOB cell responses to urines in anesthetized animals. The study comprehensively described the AOB cell responses to social stimuli and how the responses vary (or not) with features of the urine source and the reproductive state of the recording females. The dataset could be a valuable resource for scientists in the field of olfaction.

      Weaknesses:

      (1) The figures could be better labeled.

      We revised all figures (except the model figure, Fig. 8), and among other improvements (many of which were suggested by the reviewers in other comments), added more labelling and annotation within the figures.

      (2) For Figure 2E, please plot the error bar. Are there any statistics performed to compare the mean responses?

      We added error bars (standard errors of the mean). We had not originally performed statistical comparisons between the stimuli, but now we have. The analysis of responses strength now appears in a new table (Table 1)

      (3) For Figure 2D, it will be more informative to plot the percentage of responsive units.

      Done.

      (4) Could the similarity in response be explained by the similarity in urine composition? The study will be significantly strengthened by understanding the "distance" of chemical composition in different urine.

      We agree. As we wrote in the Discussion: “Ultimately, lacking knowledge of the chemical space associated with each of the stimuli, this and all the other ideas developed here remain speculative.” We note however, that chemical distance (which in itself is hard to define) will provide only part of the picture. The other part is the “projection” of chemical space on the receptor array. This is an idea that we develop in the Discussion and in Figure 8. Specifically, that it is the combination of stimulus composition, and receptor tuning properties that will determine stimulus distances in neuronal space.

      That said, a better understanding of the chemical distance is an important aspect that we are working to include in our future studies. For this dataset unfortunately, we have no such data.

      (5) If it is not possible for the authors to obtain these data first-hand, published data on MUPs and chemicals found in these urines may provide some clues.

      This comment is directly related to the previous one. Measurements about some classes of molecules may be found for some of the stimuli that we used here, but not for all. We are not aware of any single dataset that contains this information for any type of molecule across the entire stimulus set that we have used and pooling results from different studies has limited validity because of the biological and technical variability across studies. In order to reliably interpret our current recordings, it would be necessary to measure the urinary content of the very same samples that were used for stimulation. Unfortunately, we are not able to conduct this analysis at this stage.

      (6) It is not very clear to me whether the female overrepresentation is because there are truly more AOB cells that respond to females than males or because there are only two female samples but 9 male samples.

      The definitive answer to this comment is given in our response to the next one.

      Nevertheless, we agree that this is an important point. It is true that the number of neurons fulfilling each of the patterns depends on the number of individual stimuli that define it (and on the frequency of neurons that respond to those stimuli). However, our measure of “over representation” was designed to overcome this bias, by using bootstrapping to reveal if the observed number of patterns is larger than expected by chance.  The higher frequency of responses to female, as compared to male stimuli, is observed in other studies by others and by us, also when the number of male and female stimuli is matched (e.g., Bansal et al BMC Biol 2021, Ben-Shaul et al, PNAS 2010, Hendrickson et al, JNS, 2008). However, here, by overrepresentation, we do not refer to the higher frequency of female responding neurons, but rather that given the number of responding neurons, the female pattern is more common than expected by chance.

      (7) If the authors only select two male samples, let's say ICR Naïve and ICR DOM, combine them with responses to two female samples, and do the same analysis as in Figure 3, will the female response still be overrepresented?

      Following this suggestion, we have performed this analysis, and we were glad to see that the result is the one we had anticipated. Below, we provide an image of the results, following the same approach that we applied before, and showed in Figure 3C. Here, we defined a female pattern (using the two female samples) and compared it to a male pattern (using the ICR naïve and ICR DOM as suggested). It is as if we had only four stimuli in our set. As in the article, we calculated the expected distribution with 100,000 shuffles. We denoted this pattern as F/M ICR. The results are shown below.

      Under the present conditions, the distribution of the number of female selective patterns is larger (i.e., shifted to the right, compare to the female category in Figure 3C. This is expected, since now the criterion is more permissive. Specifically, now to qualify as a “female pattern”, the two responses to female urine must be stronger only than the responses to the two male stimuli included in this analysis (and to all other responses). Notably, although the null distribution shifted to the right, the actual number of neurons fulfilling this pattern is also larger, so that the actual number remains significantly larger than expected by chance. This is also true for the reverse category (as is the case in the ~female category Figure 3C).  Thus, we conclude that overrepresentation of the female pattern is not a trivial consequence of the number of male and female stimuli.

      Author response image 1.

      (8) In Figure 4B and 4C, the pairwise distance during non-estrus is generally higher than that during estrus, although they are highly correlated. Does it mean that the cells respond to different urines more distinctively during diestrus than in estrus?

      This is an important observation (!) and we had originally overlooked it.  It is true that higher distance (as they are in estrus) imply more distinct population level responses and hence better discrimination among stimuli. However, this is inconsistent with all our other analyses that do not point to enhanced selectivity or discrimination in either state. If anything, we find somewhat higher sparseness in estrus.  Yet, there may be technical explanations for the differences.

      For Euclidean distances, the explanation may be trivial. The distance depends on the number of dimensions (i.e., units), and since our sample contains more neurons recorded during non-estrus, the larger distance is expected.

      In fact, there is a similar dependence on sample size for the correlation distance. Smaller samples are associated with higher (spurious) correlations, and hence larger samples are be associated with larger distances. To demonstrate this, we conducted a simple simulation, where we calculated the absolute correlation coefficients of random samples from standard normal distributions (using the MATLAB function randn), changing the size of the population. For each sample size, we conducted 1000 tests. We considered sample sizes from 10 to 100000, including 200 and 300 (which are similar to our sample sizes). The results are shown in the figure below. Note that the absolute value of the correlation coefficient decreases with sample size, while the p-value for the observed correlation is stable at ~0.5.

      While this is not a rigorous analysis of this issue, and while it does not exactly reflect the scenario in our data, where correlations are generally positive, it shows that the observed correlation (and hence correlation distance) is also affected by sample size.

      For these reasons, we focus on comparison of these distances, rather than the absolute values of the correlation distances.

      Author response image 2.

      Following this comment, we now write in the manuscript:

      “We first note that distances are generally larger during non-estrus, suggesting enhanced discrimination during this stage. However, further analyses of sparseness and selectivity do not support this idea (see below). Furthermore, we note that both Euclidean and correlation distances generally depend on sample size. In both cases, distances are expected to increase as a function of sample size, which in our dataset, is larger for the non-estrus (n = 305) as compared to the estrus (n = 241) neurons. Because of this factor, we focus here on the similarity of the relative within-state distances across the states (and not on their absolute magnitudes). Specifically, we find a positive and significant correlation among pairwise population distances under the two states. Thus, at the population level, representational space remains broadly stable across the estrus cycle. Nevertheless, several points in Fig. 4D, E clearly diverge from a linear relationship, implying that representational space differs under the two states. We next examine such state-dependent changes in more detail.”

      (9) The correlation analysis is not entirely intuitive when just looking at the figures. Some sample heatmaps showing the response differences between estrous states will be helpful.

      If we understand correctly, the idea is to show the correlation matrices from which the values in 4B and 4C are taken. The relevant images are now included in figure 4B, C and are references within the main text.

      Reviewer #2 (Public review):

      Summary:

      Many aspects of the study are carefully done, and in the grand scheme this is a solid contribution. I have no "big-picture" concerns about the approach or methodology. However, in numerous places the manuscript is unnecessarily vague, ambiguous, or confusing. Tightening up the presentation will magnify their impact.

      We have reviewed the text and made substantial editing changes. Along with other specific comments by made both reviewers, we hope that these changes improve the presentation.

      Strengths:

      (1) The study includes urine donors from males of three strains each with three social states, as well as females in two states. This diversity significantly enhances their ability to interpret their results.

      (2) Several distinct analyses are used to explore the question of whether AOB MCs are biased towards specific states or different between estrus and non-estrus females. The results of these different analyses are self-reinforcing about the main conclusions of the study.

      (3) The presentation maintains a neutral perspective throughout while touching on topics of widespread interest.

      Weaknesses:

      (1) Introduction:

      The discussion of the role of the VNS and preferences for different male stimuli should perhaps include Wysocki and Lepri 1991

      We assume that the reviewer is referring to “Consequences of removing the vomeronasal organ” by Wysocki CJ, Lepri JJ, a review article in J Steroid Biochem from 1991. We were not familiar with this specific article and have now read it. The article discusses various male behaviors, and some effects on female behavior and physiology (e.g., puberty acceleration, maternal behaviors, ovulation) but we could not find any mention of the preference of female mice in this article. We also expanded our search to all pubmed articles authored by Wysocki and Lepri and then all articles by Wysocki (with the keyword Vomeronasal). Despite our best intentions to give due credit, we found nothing that seems directly related to this statement. Please correct us if we had missed anything.

      (2) Results:

      a) Given the 20s gap between them, the distinction between sample application and sympathetic nerve trunk stimulation needs to be made crystal clear; in many places, "stimulus application" is used in places where this reviewer suspects they actually mean sympathetic nerve trunk stimulation.

      We realize that this is confusing, and we also agree that at least in one place, we have not been sufficiently clear about the distinction. To clarify, we distinguish between stimulus application (physical application of stimulus to the nostril), and stimulation (which refers to SNT stimulation, which typically induces VNO suction). The general term stimulus presentation refers to the entire process. As explained in the text, in our analysis, we consider the entire window starting at application and ending 40s after stimulation. This is because we sometimes observe immediate responses following application. One such responses is seen in Figure 2D, and this is directly related to a detailed comment made below (on Figure 1D, part c). Indeed, for this figure time 0 indicates stimulus application. This was indicated previously, but we have now rearranged order of the panels to make the distinction between this response and other clearer. We have also revised the figure caption and the text to clarify this issue.

      b) There appears to be a mismatch between the discussion of Figure 3 and its contents. Specifically, there is an example of an "adjusted" pattern in 3A, not 3B.

      True. we have revised the text to correctly refer to the figure. Thanks.

      c) The discussion of patterns neglects to mention whether it's possible for a neuron to belong to more than one pattern. For example, it would seem possible for a neuron to simultaneously fit the "ICR pattern" and the "dominant adjusted pattern" if, e.g., all ICR responses are stronger than all others, but if simultaneously within each strain the dominant male causes the largest response.

      This is true. In the legend to Figure 3B, we actually wrote: “A neuron may fulfill more than one pattern and thus may appear in more than one row.”, but we now also write in the main text:

      “We note that criteria for adjusted patterns are less stringent than for the standard patterns defined above. Furthermore, some patterns are not mutually exclusive, and thus, a neuron may fulfil more than a single pattern.”

      (3) Discussion:

      a) The discussion of chemical specificity in urine focuses on volatiles and MUPs (citation #47), but many important molecules for the VNS are small, nonvolatile ligands. For such molecules, the corresponding study is Fu et al 2015.

      Agreed. We now cite this work and several others that were not included before in the context of chemical and electrophysiological analyses.

      b) "Following our line of reasoning, this scarcity may represent an optimal allocation of resources to separate dominant from naïve males": 1 unit out of 215 is roughly consistent with a single receptor. Surely little would be lost if there could be more computational capacity devoted to this important axis than that? It seems more likely that dominance is computed from multiple neuronal types with mixed encoding.

      We fully agree, and we are not claiming that dominance, nor any other feature, is derived using dedicated feature selective neurons. Our discussion of resource allocation is inevitably speculative. Our main point in this context is that a lack of overrepresentation does not imply that a feature is not important. As a note, we do not think that there is good reason to suppose that AOB neurons reflect the activity of single receptors.

      To present this potential confusion, we now added the following sentences in the Discussion subsection titled “Response patterns of AOB-MCs”:

      “We stress that we do not suggest that features such as physiological state are encoded by the activity of single neurons. In fact, we believe that most ethologically relevant features are encoded by the activity of multiple neurons. Nevertheless, such population level representations ultimately depend on the response properties of individual neurons, and we thus ask: what can we learn from our analysis of response pattern frequency?”

      (4) Methods:

      a) Male status, "were unambiguous in most cases": is it possible to put numerical estimates on this? 55% and 99% are both "most," yet they differ substantially in interpretive uncertainty.

      Upon reexamination, we realized that this sentence is incorrect. Ambiguous cases were not considered as dominant for urine collection. We only classified mice as dominant if they “won” the tube test and exhibited dominant behavior in the subsequent observation period in the cage. The phrasing has now been corrected in the manuscript (Methods section).

      b) Surgical procedures and electrode positioning: important details of probes are missing (electrode recording area, spacing, etc).

      This information has been added to the Methods subsection “Surgical procedures and electrode positioning”

      c) Stimulus presentation procedure: Are stimuli manually pipetted or delivered by apparatus with precise timing?

      They are delivered manually. This has now been clarified in the text.

      d) Data analysis, "we applied more permissive criteria involving response magnitude": it's not clear whether this is what's spelled out in the next paragraph, or whether that's left unspecified. In either case, the next paragraph appears to be about establishing a noise floor on pattern membership, not a "permissive criterion."

      True, the next paragraph is not the explanation for the more permissive criteria. The more permissive criteria involving response magnitude are actually those described in Figure 3A and 3B. The sentence that was quoted above merely states that before applying those criteria, we had also searched for patterns defined by binary designation of neurons as responsive, or not responsive, to each of the stimuli (this is directly related to the next comment below). Using those binary definitions, we obtained a very small number of neurons for each pattern and thus decided to apply the approach actually used and described in the manuscript.

      To clarify this confusion, we thoroughly derived the description of this paragraph, and the beginning of the next one in the Methods section.

      e) Data analysis, method for assessing significance: there's a lot to like about the use of pooling to estimate the baseline and the use of an ANOVA-like test to assess unit responsiveness.

      But:

      i) for a specific stimulus, at 4 trials (the minimum specified in "Stimulus presentation procedure") kruskalwallis is questionable. They state that most trials use 5, however, and that should be okay.

      The exact values are now given in the text. The mean number of repeated presentations per stimulus: 5.1± 0.9, mean ± sd. In 72% of the cases, stimuli were given 5 or more times. Otherwise, they were presented 4 times. In the context of the statistical test, we note that we are not comparing 5 (or 4) values with another set of 5 (or 4 values), but with a much larger sample (~44-55 baseline trials – given 11 trials and 4-5 repeats of each). Under this scenario, we think that the statistical approach is sound. However, the more important consideration, in our opinion, is given below.

      ii) the methods statement suggests they are running kruskalwallis individually for each neuron/stimulus, rather than once per neuron across all stimuli. With 11 stimuli, there is a substantial chance of a false-positive if they used p < 0.05 to assess significance. (The actual threshold was unstated.) Were there any multiple comparison corrections performed? Or did they run kruskalwallis on the neuron, and then if significant assess individual stimuli? (Which is a form of multiple-comparisons correction.)

      First, we indeed failed to mention that our criterion was 0.05. This has been corrected, by adding the information to the results and the Methods sections. No, we did not apply any multiple comparison measures. We consider each neuron-stimulus pair as an independent entity, and we are aware that this leads to a higher false positive rate. On the other hand, applying multiple comparisons would be problematic, as the same number of stimuli used in different studies varies. Application of multiple comparison corrections would thus lead to different response criteria across different studies, which would be very problematic. This raises the almost philosophical question regarding the use of multiple comparisons (as well as one and two tailed tests), but practically, most, if not all of our conclusions involve comparisons across conditions. For this purpose, we think that our procedure is valid. More generally, while selection of responses according to significance has some obvious advantages, the decision to use any particular criterion is entirely arbitrary. Therefore, we do not attach any special meaning to the significance threshold used here. Rather, we think of it as a simple criterion that allows us to exclude weakly responding or non-responsive neurons, and to compare frequencies of neurons that fulfill this criterion, under different conditions and contexts.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      Results:

      "are represented more than represented by chance" seems to have a misplaced word

      True. Thanks. Corrected.

      Figure 1D:

      a) Indicate the meaning of the number that appears in the top left for each unit (10, 5, 40, 5, 5) (I'm guessing it's the vertical scale for the PSTH, but best to spell it out explicitly.)

      This information has been added.

      b) "The red vertical line indicates stimulus application": is it the application of the chemical stimulus or SNT shock?

      Please see our answer to c

      c) "For unit 2, time 0 indicate stimulus application, as in this case, responses began after stimulus application, prior to stimulation." First, the meaning of time 0 for the other units is not clearly specified (we infer that unit 2 is an exception, but we don't know what most of them mean). Second, it seems as if the response (?) to ICR naive begins even before stimulus application.

      This issue was also mentioned above as the 2nd weakness raised by this reviewer. To explain the meaning of the red lines, and resolve this confusion, we revised the figure caption text to indicate that for all units (except the former unit 2) time 0 indicates SNT stimulation. We also changed the order of the unit examples, placing the former unit 2 in the rightmost position. It is true that for this unit, there is a firing rate change prior to stimulus application, which actually appears as rate attenuation following stimulus application. In this specific case, we consider this activity as “noise”, and note that this neuron-stimulus combination would not be classified as a response (since there is no consistent change across stimulus presentation).

      As a note, while reviewing this figure, we noted an error. We have previously written that the ITI was 10 s, whereas it was actually 18 s long. This has been corrected in the Figure and in the text.

      Figure 2B:

      "The mean error due to the reduced 2-D representation is 0.29 (arbitrary units)." This is unclear. MDS is often described in terms of % of variance explained, is that what this means? If so, the units are not arbitrary; otherwise, it's unclear whether specifying a value with arbitrary units adds any value.

      This is a very good point, and we thank the reviewer for identifying this mistake. The units are not arbitrary! They are units of correlation distance. We now added a scale bar (a square) to panel 2B to indicate what a distance of 0.1. Following this comment, we also calculated the mean error in the original data, and noted the ratio between the mean absolute error (due to considering only two dimensions) and the mean original distances. We also now report the value of the first two eigenvalues. Specifically, we now write:

      “Note that like all dimensionally reduced representations, the representation in Fig. 2B is an approximation. Here, the first two eigenvalues of account for 44.6% of the variance of the original distances (30.4% and 14.2%, respectively for the first and second dimension). Another way to evaluate the representation is via the mean error due to the reduced 2-D representation. Here, it is 0.29, whereas the mean of the original distances is 0.73.”

      Figure 3A:

      a) There is a truncated label (or something) above the panel letter.

      Thanks. Corrected. This was part of the “Figure” label

      b) The graphic for the "adjusted pattern" also fits the criterion of the "pattern": for example, in the top row the activity for ICR is still higher than for any other stimulus, thus fulfilling the criterion of a "pattern" and not just an "adjusted pattern."

      That was not our intention. An adjusted pattern does not necessarily fulfill the (non-adjusted) “pattern” (while the opposite is true). We have now revised the rightmost panel in figure 3A, adding both “&s” to indicate that all three conditions must be fulfilled, and in attempt for a more intuitive representation, applied a different background denoting stimuli with irrelevant responses. We also changed the terms in the legend within the panel, making them more accurate: (Thus, “strong activity” was changed to “stronger responses”). In addition, we revised the text and figure legends in attempt to better clarify these definitions.

      Figure 3B:

      I'm assuming that the columns of the heatmap correspond to different urine stimuli, and that the color is normalized firing rate. But readers should not have to guess.

      True, and agreed. We added legends to clarify this.

      Figure 4B:

      The caption should mention that the pairwise measures are between the stimulus columns of panel A.

      We revised the caption to indicate this. Note that we also added two additional panels to this figure.

      Figure 5A&B:

      Instead of a multiple-comparisons correction, it seems likely to be better to use a 2-way ANOVA. At a minimum, the nature of the multiple-comparisons correction needs to be specified (many are conservative, but they differ in the extent of how conservative they are).

      We now write in the text that we used a Bonferroni correction (this information previously appeared only in the caption). We also found an error in the caption. We previously wrote that we used a binomial exact test for both panels A and B. However, only the data in panel A was calculated with a binomial exact test. The data in panel B was calculated with a one-way ANOVA.

      We now also applied a 2-way ANOVA to response magnitudes (i.e., panel B). We find a main effect of stimulus, but not of state, and no effect of interaction between the two. This is consistent with our previous analyses. This analysis is now included in the text. We thank the reviewer for this suggestion.

      Editor's note:

      Should you choose to revise your manuscript, if you have not already done so, please include full statistical reporting including exact p-values wherever possible alongside the summary statistics (test statistic and df) and, where appropriate, 95% confidence intervals. These should be reported for all key questions and not only when the p-value is less than 0.05 in the main manuscript.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review): 

      This paper describes technically-impressive measurements of calcium signals near synaptic ribbons in goldfish bipolar cells. The data presented provides high spatial and temporal resolution information about calcium concentrations along the ribbon at various distances from the site of entry at the plasma membrane. This is important information. Important gaps in the data presented mean that the evidence for the main conclusions is currently inadequate. 

      Strengths 

      The technical aspects of the measurements are impressive. The authors use calcium indicators bound to the ribbon and high speed line scans to resolve changes with a spatial resolution of ~250 nm and temporal resolution of less than 10 ms. These spatial and temporal scales are much closer to those relevant for vesicle release than previous measurements. 

      The use of calcium indicators with very different affinities and of different intracellular calcium buffers helps provide confirmation of key results. 

      Thank you very much for this positive evaluation of our work.

      Weaknesses 

      Multiple key points of the paper lack a statistical test or summary data from populations of cells. For example, the text states that the proximal and distal calcium kinetics in Figure 2A differ. This is not clear from the inset to Figure 2A - where the traces look like scaled versions of each other. Values for time to half-maximal peak fluorescence are given for one example cell but no statistics or summary are provided. Figure 8 shows examples from one cell with no summary data. This issue comes up in other places as well. 

      Thank you for this fair and valuable feedback. Following also the suggestion by the Editor, we have now removed the rise-time kinetic fitting results from the manuscript and only retain the bi-exponential decay time constant values. Further, we explicitly detail the issues with kinetic fitting, and state that the precise quantitative conclusions should not be drawn from the differences in kinetic parameters (pages 7 and 2728). 

      We have included the results of paired-t-tests to compare the amplitudes of proximal vs. distal calcium signals shown in Fig. 2A & B, Fig. 3C & D, Fig. 4C & D, Fig. 5A-D, and Fig. 8E&F. Because proximal and distal calcium signals were obtained from the same ribbons within 500-nm distances, as the Reviewer pointed out, “the traces look like scaled versions of each other”. For experiments where we make comparisons across cells or different calcium indicators, as shown in Fig. 3E & F, Fig.5E, and Fig. 8B&C, we have included the results of an unpaired t-test. We have also included the t-test statistics information in the respective figure legends in the revised version.

      In Figure 8, we have shown example fluorescence traces from two different cells at the bottom of the A panel, and example traces from different ribbons of RBC a in the D, and the summary data is described in B-C and E-F, with statistics provided in the figure legends.

      The rise time measurements in Figure 2 are very different for low and high affinity indicators, but no explanation is given for this difference. Similarly, the measurements of peak calcium concentration in Figure 4 are very different with the two indicators. That might suggest that the high affinity indicator is strongly saturated, which raises concerns about whether that is impacting the kinetic measurements. 

      Yes, we do believe that the high-affinity indicator is partially saturated, and therefore, the measurement with the low-affinity indicator dye is a more accurate reflection of the measured Ca<sup>2+</sup> signal. We now state this more explicitly in the text. Further, we note that the rise time values are no longer listed due to lack of statistical significance for such comparisons, as noted above.

      Reviewer #2 (Public review): 

      Summary: 

      The study introduces new tools for measuring intracellular Ca2+ concentration gradients around retinal rod bipolar cell (rbc) synaptic ribbons. This is done by comparing the Ca2+ profiles measured with mobile Ca2+ indicator dyes versus ribbon-tethered (immobile) Ca2+ indicator dyes. The Ca2+ imaging results provide a straightforward demonstration of Ca2+ gradients around the ribbon and validate their experimental strategy. This experimental work is complemented by a coherent, open-source, computational model that successfully describes changes in Ca2+ domains as a function of Ca2+ buffering. In addition, the authors try to demonstrate that there is heterogeneity among synaptic ribbons within an individual rbc terminal. 

      Strengths: 

      The study introduces a new set of tools for estimating Ca2+ concentration gradients at ribbon AZs, and the experimental results are accompanied by an open-source, computational model that nicely describes Ca2+ buffering at the rbc synaptic ribbon. In addition, the dissociated retinal preparation remains a valuable approach for studying ribbon synapses. Lastly, excellent EM. 

      Thank you very much for this positive evaluation of our work.

      Comments on revisions: 

      Specific minor comments: 

      (1) Rewrite the final sentence of the Abstract. It is difficult to understand. 

      Thank you for pointing that out. We have updated the final sentence of the Abstract.

      (2) Add a definition in the Introduction (and revisit in the Discussion) that delineates between micro- and nano-domain. A practical approach would be to round up and round down. If you round up from 0.6 um, then it is microdomain which means ~ 1 um or higher. Likewise, round down from 0.3 um to nanodomain? If you are using confocal, or even STED, the resolution for Ca imaging will be in the 100 to 300 nm range. The point of your study is that your new immobile Ca2-ribbon indicator may actually be operating on a tens of nm scale: nanophysiology. The Results are clearly written in a way that acknowledges this point but maybe make such a "definition" comment in the intro/discussion in order to: 1) demonstrate the power of the new Ca2+ indicator to resolve signals at the base of the ribbon (effectively nano), and 2) (Discussion) to acknowledge that some are achieving nanoscopic resolution (50 to 100nm?) with light microscopy (as you ref'd Neef et al., 2018 Nat Comm).  

      Thank you for the valuable comments. We have now provided this information in the introduction and discussion.  

      (3) Suggested reference: Grabner et al. 2022 (Sci Adv, Supp video 13, and Fig S5). Here rod Cav channels are shown to be expressed on both sides the ribbon, at its base, and they are within nanometers from other AZ proteins. This agrees with the conclusions from your imaging work.  

      Thank you for the valuable suggestion. We have now provided this information in the introduction and discussion.

      (4) In the Discussion, add a little more context to what is known about synaptic transmission in the outer and inner retina.. First, state that the postsynaptic receptors (for example: mGluR6-OnBCs vs KARs-OffBCs, vs. AMPAR-HCs), and possibly the synaptic cleft (ground squirrel), are known to have a significant impact on signaling in the outer retina. In the inner retina, there are many more unknowns. For example, when I think of the pioneering Palmer JPhysio study, which you sight, I think of NMDAR vs AMPAR, and uncertainty in what type postsynaptic cell was patched (GC or AC....). Once you have informed the reader that the postsynapse is known to have a significant impact on signaling, then promote your experimental work that addresses presynaptic processes: "...the new tool and results allow us to explore release heterogeneity, ribbon by ribbon in dissociated preps, which we eventually plan to use at ribbon synapses within slices......to better understand how the presynapse shapes signaling......". 

      Thank you for the valuable comments. We have now provided this information in the introduction and discussion.

      Reviewer #3 (Public review): 

      Summary: 

      In this study, the authors have developed a new Ca indicator conjugated to the peptide, which likely recognizes synaptic ribbons and have measured microdomain Ca near synaptic ribbons at retinal bipolar cells. This interesting approach allows one to measure Ca close to transmitter release sites, which may be relevant for synaptic vesicle fusion and replenishment. Though microdomain Ca at the active zone of ribbon synapses has been measured by Hudspeth and Moser, the new study uses the peptide recognizing synaptic ribbons, potentially measuring the Ca concentration relatively proximal to the release sites. 

      Strengths: 

      The study is, in principle, technically well done, and the peptide approach is technically interesting, which allows one to image Ca near the particular protein complexes. The approach is potentially applicable to other types of imaging. 

      Thank you very much for this appreciation.

      Weaknesses: 

      Peptides may not be entirely specific, and genetic approach tagging particular active zone proteins with fluorescent Ca indicator proteins may well be more specific. Although the authors are aware of this and the peptide approach is generally used for ribbon synapses, the authors should be aware of this, when interpreting the results. 

      We acknowledge the reviewer’s point and believe the peptides and genetic approaches to measure local calcium signals have their merits, each with separate advantages and disadvantages.  

      Reviewer #1 (Recommendations for the authors): 

      The revisions helped with some concerns about the original paper, but some issues were not adequately addressed. I have left two primary concerns in my public review. To summarize those: 

      The difference in kinetics of proximal and distal locations is emphasized and quantified in the paper, but the quantification consists of a fit to the average responses. This does not give an idea of whether the difference observed is significant or not. Without an estimate of the error across measurements the difference in kinetic quoted is not interpretable. 

      Thank you for this feedback. Since the kinetics information is a minor part of the manuscript, we have followed the Editor’s advice to significantly tone down the comparison of kinetic fit parameters (completely removing the rise-time comparisons), in order to put more focus on the better-documented conclusions. We also note that we did establish statistical significance of the differences in fluorescence signal amplitudes. 

      Somewhat relatedly, the difference in amplitude and kinetics of the calcium signals measured with low and high affinity indicators is quite concerning. The authors added one sentence stating that the high affinity indicator might be saturated. This is not adequate. Should we distrust the measurements using the high affinity indicator? The differences between the results using the low and high affinity indicators is in some cases large - e.g. larger than the differences cited as a key result between distal and proximal locations. This issue needs to be dealt with directly in the paper. 

      Thank you for this feedback. Yes, the measurements from high-affinity indicators cannot report the Ca2+ as accurately as low-affinity indicators. However, the value of HA indicators is in their ability to detect lowamplitude signals that lower-affinity indicators may miss due to lower signal-to-noise resolution.  We added a sentence on page 12 to further stress this point.

      Related to the point about statistics, it is not clear how to related the horizontal lines in Figure 8 to the actual measurements. It is critical for the evaluation of the conclusions from that figure to understand what is plotted and what the error bars are on the plotted data. 

      We apologize for the earlier ambiguity in Fig. 8. In this figure, we first compare proximal (panel B) and distal (panel C) calcium signals across several RBCs, labeled RBC-a through RBC-d. Each RBC contains multiple ribbons, and for each cell, we present the average calcium signals from multiple ribbons using box plots in panels B and C. In these box plots, the horizontal lines represent the average calcium signal for each cell, while the size of the error bars reflects the variability in proximal and distal calcium signals among the ribbons within that RBC.

      For example, RBC-a had five identifiable ribbons. In panels D–F, we use RBC-a to illustrate the variability in calcium signals across individual ribbons. Specifically, we distinguished proximal and distal calcium signals from five ribbons (ribbons 1–5) within RBC-a. When feasible, we acquired multiple x–t line scans at a single ribbon, shown now as individual data points, to assess variability in calcium signals recorded from the same ribbon.

      The box plots in panels E and F display the average calcium signal (horizontal lines) for each ribbon, based on multiple recordings. These plots demonstrate considerable variability between ribbons of RBC-a. Importantly, the lack of or minimal error bars for repeated measurements at the same ribbon indicates that the proximal and distal calcium signals are consistent within a ribbon. These findings emphasize that the observed variability among ribbons and among cells reflects true biological heterogeneity in local calcium domains, rather than experimental noise.

    1. Author response:

      The following is the authors’ response to the original reviews.\

      Reviewer #1(Public review):

      (1) Changes in blood volume due to brain activity are indirectly related to neuronal responses. The exact relationship is not clear, however, we do know two things for certain: (a) each measurable unit of blood volume change depends on the response of hundreds or thousands of neurons, and (b) the time course of the volume changes are slow compared to the potential time course of the underlying neuronal responses. Both of these mean that important variability in neuronal responses will be averaged out when measuring blood changes. For example, if two neighbouring neurons have opposite responses to a given stimulus, this will produce opposite changes in blood volume, which will cancel each other out in the blood volume measurement due to (a). This is important in the present study because blood volume changes are implicitly being used as a measure of coding in the underlying neuronal population. The authors need to acknowledge that this is a coarse measure of neuronal responses and that important aspects of neuronal responses may be missing from the blood volume measure.

      The reviewer is correct: we do not measure neuronal firing but use blood volume as a proxy for bulk local neuronal activity, which does not capture the richness of single neuron responses. This is why the paper focuses on large-scale spatial representations as well as cross-species comparison. For this latter purpose, fMRI responses are on par with our fUSI data, with both neuroimaging techniques showing the same weakness. We have now added this point to the discussion: 

      “Second, we used blood volume as a proxy for local neuronal activity. Thus, our signal ignores any heterogeneity that might exist at the level of local neuronal populations. However, our main findings are related to the large-scale organization of cortical responses and how they relate to those of humans. For this purpose, the functional spatial resolution of our signal, driven by the spatial resolution of neurovascular coupling, should be adapted. In addition, using hemodynamic signals provides a much better comparison with human fMRI data, where the same limitations are present.”

      (2) More importantly for the present study, however, the effect of (b) is that any rapid changes in the response of a single neuron will be cancelled out by temporal averaging. Imagine a neuron whose response is transient, consisting of rapid excitation followed by rapid inhibition. Temporal averaging of these two responses will tend to cancel out both of them. As a result, blood volume measurements will tend to smooth out any fast, dynamic responses in the underlying neuronal population. In the present study, this temporal averaging is likely to be particularly important because the authors are comparing responses to dynamic (nonstationary) stimuli with responses to more constant stimuli. To a first approximation, neuronal responses to dynamic stimuli are themselves dynamic, and responses to constant stimuli are themselves constant. Therefore, the averaging will mean that the responses to dynamic stimuli are suppressed relative to the real responses in the underlying neurons, whereas the responses to constant stimuli are more veridical. On top of this, temporal following rates tend to decrease as one ascends the auditory hierarchy, meaning that the comparison between dynamic and stationary responses will be differently affected in different brain areas. As a result, the dynamic/stationary balance is expected to change as you ascend the hierarchy, and I would expect this to directly affect the results observed in this study.

      It is not trivial to extrapolate from what we know about temporal following in the cortex to know exactly what the expected effect would be on the authors' results. As a first-pass control, I would strongly suggest incorporating into the authors' filterbank model a range of realistic temporal following rates (decreasing at higher levels), and spatially and temporally average these responses to get modelled cerebral blood flow measurements. I would want to know whether this model showed similar effects as in Figure 2. From my guess about what this model would show, I think it would not predict the effects shown by the authors in Figure 2. Nevertheless, this is an important issue to address and to provide control for.

      We understand the reviewer’s concern about potential differences in response dynamics in stationary vs non-stationary sounds. It seems that the reviewer is concerned that responses to foregrounds may be suppressed in non-primary fields because foregrounds are not stationary, and non-primary regions could struggle to track and respond to these sounds. Nevertheless, we observed the contrary, with non-primary regions overrepresenting non-stationary (dynamic) sounds, over stationary ones. For this reason, we are inclined to think that this explanation cannot falsify our findings. 

      We understand the comment that temporal following rates might differ across regions in the auditory hierarchy and agree. In fact, we do show that tuning to temporal rates differs across regions and partly explains the differences in background invariance we observe. In this regard, we think the reviewer’s suggestion is already implemented by our spectrotemporal model, which incorporates the full range of realistic temporal following rates (up to 128 Hz). The temporal averaging is done as we take the output of the model (which varies continuously through time) and average it in the same window as we used for fUSI data. When we fit this model to the ferret data, we find that voxels in non-primary regions, especially VP (tertiary auditory cortex), tend to be more tuned to low temporal rates (Figure 2F, G), and that background invariance is stronger in voxels tuned to low rates. This is, however, not true in humans, suggesting that background invariance in humans relies on different computational mechanisms. We have added a sentence to clarify this: “The model included a range of realistic temporal rates and this axis was the most informative to discriminate foregrounds from backgrounds.”

      (3) I do not agree with the equivalence that the authors draw between the statistical stationarity of sounds and their classification as foreground or background sounds. It is true that, in a common foreground/background situation - speech against a background of white noise - the foreground is non-stationary and the background is stationary. However, it is easy to come up with examples where this relationship is reversed. For example, a continuous pure tone is perfectly stationary, but will be perceived as a foreground sound if played loudly. Background music may be very non-stationary but still easily ignored as a background sound when listening to overlaid speech. Ultimately, the foreground/background distinction is a perceptual one that is not exclusively determined by physical characteristics of the sounds, and certainly not by a simple measure of stationarity. I understand that the use of foreground/background in the present study increases the likely reach of the paper, but I don't think it is appropriate to use this subjective/imprecise terminology in the results section of the paper.

      We appreciate the reviewer’s comment that the classification of our sounds into foregrounds and backgrounds is not verified by any perceptual experiments. We use those terms to be consistent with the literature (McWalter and McDermott, 2018; McWalter and McDermott, 2019), including the paper we derived this definition from (Kell et al., 2019). These terms are widely used in studies where no perceptual or behavioral experiments are included, and even when animals are anesthetized. We have clarified and justified this choice in the beginning of the Results section:

      “We used three types of stimuli: foregrounds, backgrounds, and combinations of those. We use those terms to refer to sounds differing in their stationarity, under the assumption that stationary sounds carry less information than non-stationary sounds, and are thus typically ignored.”

      We have also added a paragraph in the discussion to emphasize the limits of this definition:

      “First, this study defined foregrounds and backgrounds solely based on their acoustic stationarity, rather than perceptual judgments. This choice allowed us to isolate the contribution of acoustic factors in a simplified setting. Within this controlled framework, we show that acoustic features of foreground and background sounds drive their separation in the brain and the hierarchical extraction of foreground sound features.”

      (4) Related to the above, I think further caveats need to be acknowledged in the study. We do not know what sounds are perceived as foreground or background sounds by ferrets, or indeed whether they make this distinction reliably to the degree that humans do. Furthermore, the individual sounds used here have not been tested for their foreground/background-ness. Thus, the analysis relies on two logical jumps - first, that the stationarity of these sounds predicts their foreground/background perception in humans, and second, that this perceptual distinction is similar in ferrets and humans. I don't think it is known to what degree these jumps are justified. These issues do not directly affect the results, but I think it is essential to address these issues in the Discussion, because they are potentially major caveats to our understanding of the work.

      We agree with the reviewer that the foreground-background distinction might be different in ferrets. In anticipation of that issue, we had enriched the sound set with more ecologically relevant sounds, such as ferret and other animal vocalizations. Nevertheless, we have emphasized this limitation in addition to the limitation of our definition of foregrounds and backgrounds in the discussion: 

      “In addition, most of the sounds included in our study likely have more relevance for humans compared to ferrets (see table \ref{tbl1}). Despite including ferret vocalizations and environmental sounds that are more ecologically relevant for ferrets, it is not clear whether ferrets would behaviorally categorize foregrounds and backgrounds as humans do. Examining how ferrets naturally orient or respond to foreground and background sounds under more ecologically valid conditions, potentially with free exploration or spontaneous listening paradigms, could help address this issue.”

      Reviewer #2(Public review);

      (1) Interpretation of the cerebral blood volume signal: While the results are compelling, more caution should be exercised by the authors in framing their results, given that they are measuring an indirect measure of neural activity, this is the difference between stating "CBV in area MEG was less background invariant than in higher areas" vs. saying "MEG was less background invariant than other areas". Beyond framing, the basic properties of the CBV signal should be better explored:

      a) Cortical vasculature is highly structured (e.g. Kirst et al.( 2020) Cell). One potential explanation for the results is simply differences in vasculature and blood flow between primary and secondary areas of auditory cortex, even if fUS is sensitive to changes in blood flow, changes in capillary beds, etc (Mace et al., 2011) Nat. Methods.. This concern could be addressed by either analyzing spontaneous fluctuations in the CBV signal during silent periods or computing a signal-to-noise ratio of voxels across areas across all sound types. This is especially important given the complex 3D geometry of gyri and sulci in the ferret brain.

      We agree with the reviewers that there could be differences in vasculature across subregions of the auditory cortex and note that this point would also be valid for the published human fMRI data. Nevertheless, even if small differences in vasculature were present, it is unlikely that they would affect our analyses and results, which are designed to be independent of local vascular density. First, we normalize the signal in each voxel using the silent periods, so that the absolute strength of the raw signal, or baseline blood volume in each voxel, is factored in our analysis. Second, we only focus on reliably responsive voxels in each region and do see comparable sound-evoked responses in all regions (Figure S2). Third, our analysis mostly relies on voxel-based correlation across sounds, which is independent of the mean and variance of the voxel responses. Differences in noise, measured through test-retest reliability, can affect values of correlation, which is why we used a noise-correction procedure. After this procedure, invariance does not depend on test-retest, and differences across regions are still seen when matching for test-retest (new  Figure S7). Thus, we believe that differences in vascular architecture across regions are unlikely to affect our results. We added this point in the Methods section when discussing the noise-correction:

      “After this correction, the differences we observed between brain regions were present regardless of voxels' test-retest reliability, or noise level (Figure S7). Thus, potential differences in vasculature across regions are unlikely to affect our results.”

      b) Figure 1 leaves the reader uncertain what exactly is being encoded by the CBV signal, as temporal responses to different stimuli look very similar in the examples shown. One possibility is that the CBV is an acoustic change signal. In that case, sounds that are farther apart in acoustic space from previous sounds would elicit larger responses, which is straightforward to test. Another possibility is that the fUS signal reflects time-varying features in the acoustic signal (e.g. the low-frequency envelope). This could be addressed by cross-correlating the stimulus envelope with fUS waveform. The third possibility, which the authors argue, is that the magnitude of the fUS signal encodes the stimulus ID. A better understanding of the justification for only looking at the fUS magnitude in a short time window (2-4.8 s re: stimulus onset) would increase my confidence in the results.

      We thank the reviewer for raising that point as it highlights that the layout of Figure 1 is misleading. While Figure 1B shows an example snippet of our sound streams, Figure 1D shows the average timecourse of CBV time-locked to a change in sound (foreground or background, isolated or in a mixture). This is the average across all voxels and sounds, aiming at illustrating the dynamics for the three broad categories. In Figure 1E however, we show the cross-validated cross-correlation of CBV across sounds (and different time lags). To obtain this, we compute for each voxel the response to each sound at each time lag, thus obtaining two vectors (size: number of sounds) per lag, one per repeat. Then, we correlate all these vectors across the two repeats, obtaining one cross-correlation matrix per voxel. We finally average these matrices across all voxels. The presence of red squares with high correlations demonstrates that the signal encodes sound identity, since CBV is more similar across two repeats of the same sound (e.g., in the foreground only matrix, 0-5 s vs 0-5 s), than two different sounds (0-5 s vs. 7-12 s). We modified the figure layout as well as the legend to improve clarity.

      (2) Interpretation of the human data: The authors acknowledge in the discussion that there are several differences between fMRI and fUS. The results would be more compelling if they performed a control analysis where they downsampled the Ferret fUS data spatially and temporally to match the resolution of fMRI and demonstrated that their ferret results hold with lower spatiotemporal resolution.

      We agree with the reviewer that the use of different techniques might come in the way of cross-species comparison. We already control for the temporal aspect by using the average of stimulus-evoked activity across time (note that due to scanner noise, sounds are presented cut into small pieces in the fMRI experiments). Regarding the spatial aspect, there are several things to consider. First, both species have brains of very different sizes, a factor that is conveniently compensated for by the higher spatial resolution of fUSI compared to fMRI (0.1 vs 2 mm). Downsampling to fMRI resolution would lead to having one voxel per region per slice, which is not feasible. We also summarize results with one value per region, which is a form of downsampling that is fairer across species. Furthermore, we believe that we already established in a previous study (Landemard et al, 2021 eLife) that fUSI and fMRI data are comparable signals. We indeed could predict human fMRI responses to most sounds from ferret fUSI responses to the same identical sounds. We clarified these points in the discussion:

      “In addition, fMRI has a worse spatial resolution than fUSI (here, 2 vs. 0.1 mm voxels). However, this difference in resolution compensates for the difference in brain size between humans and ferrets. In our previous work, we showed that a large fraction of cortical responses to natural sounds could be predicted from one species to the other using these methods (Landemard et al., 2021).”

      Reviewer #3 (Public review):

      As mentioned above, interpretation of the invariance analyses using predictions from the spectrotemporal modulation encoding model hinges on the model's ability to accurately predict neural responses. Although Figure S5 suggests the encoding model was generally able to predict voxel responses accurately, the authors note in the introduction that, in human auditory cortex, this kind of tuning can explain responses in primary areas but not in non-primary areas (Norman-Haignere & McDermott, PLOS Biol. 2018). Indeed, the prediction accuracy histograms in Figure  S5C suggest a slight difference in the model's ability to predict responses in primary versus non-primary voxels. Additional analyses should be done to a) determine whether the prediction accuracies are meaningfully different across regions and b) examine whether controlling for prediction accuracy across regions (i.e., subselecting voxels across regions with matched prediction accuracy) affects the outcomes of the invariance analyses.

      The reviewer is correct: the spectrotemporal model tends to perform less well in human non-primary cortex. We believe this does not contradict our results but goes in the same direction: while there is a gradient in invariance in both ferrets and humans, this gradient is predicted by the spectrotemporal model in ferrets, but not in humans (possibly indeed because predictions are less good in human non-primary auditory cortex). Regardless of the mechanism, this result points to a difference across species. In ferrets, we found a significantly better prediction accuracy in VP (p=0.001, permutation test) and no differences between MEG and dPEG (p=0.89). In humans, prediction accuracy was slightly higher in primary compared to non-primary auditory cortex, but this effect was not significant (p=0.076). In both species, when matching prediction accuracy between regions, the gradients in invariance were preserved. We have added these analyses to the manuscript (Figure S5).

      A related concern is the procedure used to train the encoding model. From the methods, it appears that the model may have been fit using responses to both isolated and mixture sounds. If so, this raises questions about the interpretability of the invariance analyses. In particular, fitting the model to all stimuli, including mixtures, may inflate the apparent ability of the model to "explain" invariance, since it is effectively trained on the phenomenon it is later evaluated on. Put another way, if a voxel exhibits invariance, and the model is trained to predict the voxel's responses to all types of stimuli (both isolated sounds and mixtures), then the model must also show invariance to the extent it can accurately predict voxel responses, making the result somewhat circular. A more informative approach would be to train the encoding model only on responses to isolated sounds (or even better, a completely independent set of sounds), as this would help clarify whether any observed invariance is emergent from the model (i.e., truly a result of low-level tuning to spectrotemporal features) or simply reflects what it was trained to reproduce.

      We thank the reviewer for this suggestion. We have run an additional prediction using only the sounds presented in isolation, which replicates our main results (new Figure S6). We have added this control to the manuscript:

      “Results were similar if the model was fit solely on isolated sounds, excluding mixtures from the training set (Figure S6).”

      Finally, the interpretation of the foreground invariance results remains somewhat unclear. In ferrets (Figure 2I), the authors report relatively little foreground invariance, whereas in humans (Figure 5G), most participants appear to show relatively high levels of foreground invariance in primary auditory cortex (around 0.6 or greater). However, the paper does not explicitly address these apparent crossspecies differences. Moreover, the findings in ferrets seem at odds with other recent work in ferrets (Hamersky et al. 2025 J. Neurosci.), which shows that background sounds tend to dominate responses to mixtures, suggesting a prevalence of foreground invariance at the neuronal level. Although this comparison comes with the caveat that the methods differ substantially from those used in the current study, given the contrast with the findings of this paper, further discussion would nonetheless be valuable to help contextualize the current findings and clarify how they relate to prior work.

      We thank the reviewer for this point. While we found a trend for higher background invariance than foreground invariance in ferret primary auditory cortex, this difference was not significant and many voxels exhibit similar levels of background and foreground invariance (for example in Figure 2D, G). Thus, we do not think our results are inconsistent with Hamersky et al., 2025, though we agree the bias towards background sounds is not as strong in our data. This might indeed reflect differences in methodology, both in the signal that is measured (blood volume vs spikes), and the sound presentation paradigm. Our timescales are much slower and likely reflect responses post-adaptation, which might not be as true for Hamersky et al. We have added this point to the discussion, as well as a comment on the difference between ferrets and humans in foreground invariance in primary auditory cortex:

      “In ferrets, primary auditory cortex has been found to over-represent backgrounds in mixtures compared to foregrounds (Hamersky et al., 2025). In contrast, we found a slight, non-significant bias towards foregrounds in primary regions. This difference could be driven by a difference in timescales, as we looked at slower timescales in which adaptation might be more present, reducing the strength of background encoding. In humans, we found a much smaller gap between background and foreground invariance in primary auditory cortex, which was not predicted by the spectrotemporal model. Additional, more closely controlled experiments would be needed to confirm and understand this species difference.”

      Reviewer #1 (Recommendations for the authors):

      (1) In the introduction, explain the relationship between background/foreground and stationarity/non-stationarity, and thus why stationary/nonstationary stimuli could be used to probe differences in background/foreground processing.

      We have added a sentence at the beginning of the results section to justify our choice (see public review).  

      (2) Avoid use of the background/foreground terminology in Results (and probably Methods).

      For consistency with previous literature, we decided to keep this terminology, though imperfect. We further justified our choice in the beginning of the Results section (see previous point).

      (3) In the Discussion, explain what the implications of the results are for background/foreground processing, and, importantly, highlight any caveats that result from stationarity not being a direct measure of background/foreground.

      We added a paragraph in the Discussion to highlight this point choice (see public review).

      Reviewer #2 (Recommendations for the authors):

      (1) Figure 1: Showing a silent period in the examples would help in understanding the fUS signal.

      In Figure 1D, we show the average timecourse of CBV time-locked to a change in sound (foreground or background, isolated or in a mixture). This is the average across all voxels and sounds. Thus, it would not be very informative to show an equivalent plot for a silent period, as it would look flat by definition. However, we updated the layout and legend of Figure 1 to make it clearer and avoid confusion.

      (2) "Responses were not homogenous" - would make more sense to say something like "responses were not spatially distributed".

      We removed these words which were indeed not necessary: “We found that reliable soundevoked responses were confined to the central part of ventral gyrus of the auditory cortex.”

      (3) Figure 2D: The maps shown in Figure 2D are difficult to understand for the noninitiated in fUS. At a minimum, labels should be added to indicate A-P, M-L, D-V. I cannot see the white square in the primary figure. An additional graphic would be helpful here to understand the geometry of the measurement.

      We thank the reviewer for pointing out that reading these images is indeed an acquired skill. We added an annotated image of anatomy with indications of main features to guide the reader in Figure 1. We also added missing white squares. 

      (4) Figure 2F: Can the authors better justify why the summary statistic is shown for all three areas, but the individual data only compares primary vs. higher order?`

      We now show individual data for all three areas.

      (5) More methods information is needed to understand how recordings were stitched across days. Was any statistical modeling used to factor out the influence of day on overall response levels?

      We simply concatenated voxels recorded across different sessions and days. The slices were sampled randomly to avoid any systematic effect. Because different slices were sampled in different sessions, any spatial structure spanning several slices is unlikely to be artefactual. For instance, the map of average responses in Figure 2A shows a high level of continuity of spatial patterns across slices. This indicates that this pattern reflects a true underlying organization rather than session-specific noise. It also shows that the overall response levels are not affected by the day or recording session. We added a section in the Methods (“Combining different recordings”) to clarify this point:

      “The whole dataset consisted of multiple slices, each recorded in a different recording session. Slices to image on a given day were chosen at random to avoid any systematic bias. Responses were consistent across neighboring slices recorded on different sessions, as shown by the maps of average responses (Figure 2A, Figure S2) where any spatial continuity across different slices must reflect a true underlying signal in the absence of common noise.”

      Reviewer #3 (Recommendations for the authors):

      (1) Figures:

      The figures are generally very well done and visually appealing. However, I have a few suggestions and questions.

      a)  In Figure 1G, the delta CBV ranges from 0.5 to 1.5, although in subsequent figures (e.g., Figure 2D), the range is much larger (-15 to 45). Is it possible that the first figure is a proportion rather than a percentage, or is there some other explanation for the massive difference in scale? Not being very familiar with this measure, it was confusing.

      The same scale is used in both figures, the major difference being that in Figure 1D, we take the average over all voxels and sounds (for each category), which will include many nonresponsive voxels, and for responsive voxels, sounds that they do not respond a lot to. On the other hand, Figure 2D shows the response of a single, responsive voxel. Thus, the values it reaches for its preferred sounds (45%) are an extreme, which weighs only little in Figure 1D. We have changed the legend of Figure 1D to make this more explicit.

      b)  Similar to the first point, the strength of the correlations in the matrices of Figure 1E is very small (~ 0.05) compared to the test-retest reliabilities plotted in Figure 2B (~0.5). Again, I was confused by this large difference in scale.

      Two main factors explain the difference in values between Figure 1E and Figure 2B. First, in Figure 1B, each correlation is done on the average activity in a window of 0.3 s, opposed to 2.4 s in Figure 2B. More averaging leads to better SNR, which inevitably leads to higher testretest correlations. Second, in Figure 1B, the cross-correlation matrices are averaged across all responsive voxels without any criterion for reliability. On the other hand, Figure 2B show example voxels with good test-retest reliability. 

      c)  In Figure 2D, the example voxels are supposed to be shown in white. It appears that this example voxel is only shown for the non-primary voxel. Please be sure to add these voxels throughout the other panels and figures as well. 

      We fixed this mistake and added the example voxel in all panels.

      d)  Why do the invariance results (e.g., Figure 2F) for individual animals combine across dPEG and VP, while the overall results (across all animals) split things across all three regions? The results in Table 2 do, in fact, provide this data. Upon further examination of the data in Table 2, it seems like there is only a significant difference between background invariance between dPEG and VP for one of the two animals, and that this might be what drives the effect when pooling across all animals. This seems important to both show visually in the figure and to potentially discuss. There is still very clearly a difference between primary and non-primary, but whether there is a real difference between dPEG and VP seems more unclear.

      We added the values for single animals in the plot and highlighted this limitation in the text:

      “While background invariance was overall highest in VP, the differences within non-primary areas were more variable across animals (see table 2).”

      e)  Again, as in Figure 2F, the cross symbols seem like a bad choice as markers since the vertical components of the cross are suggestive of the error of the measurement. However, no error is actually plotted in these figures. I recommend using a different marker and including some measure of error in the invariance plots.

      We replaced the crosses with circles to avoid confusion. The measure of error is provided by the representation of values for single animals.

      f) The caption for Figure 4C states that each line corresponds to one animal, but does not precisely state what this line represents. Is this the median or something?

      Each line indeed represents the median across voxels for one animal. We added this information to the legend.

      g)  In Figure 5, the captions for panels D and E are swapped.

      This has now been corrected.

      (2) Discussion:

      (a) In the paragraph on methodological differences, it mentions that the fMRI voxel size is around 2 mm. This may be true in general, but given the comparison to Kell & McDermott 2019, the voxel size should reflect that used in their study (1 mm).

      The reviewer might refer to this sentence from the methods of Kell et al., 2019: “T1weighted anatomical images were collected in each participant (1-mm isotropic voxels) for alignment and cortical surface reconstruction.” However, this does not correspond to the resolution of the functional data, which is 2 mm, as mentioned a bit further in the Methods:  “In-plane resolution was 2 × 2 mm (96 × 96 matrix), and slice thickness was 2.8 mm with a 10% gap, yielding an effective voxel size of 2 × 2 × 3.08 mm.”

      (b) In the next paragraph on the control of attention, it mentions that attentional differences could play a role. However, in Kell & McDermott 2019, they manipulated attention (attend visual versus attend auditory) and found that it did not substantially affect the observed pattern invariance. I suppose it could potentially affect the degree to which an encoding model could explain the invariance. This seems important, and given that the data was already collected, it could be worth it to analyze that data.

      As the reviewer points out, Kell et al. 2019 ran an additional experiment in which they manipulated auditory vs. visual attention. However, the auditory task was just based on loudness and ensured that the participants were awake and paying attention to the stimuli, but not specifically to the foreground or background. This type of attention did not lead to changes in the observed patterns of invariance, which might have been the case for selective attention to backgrounds or foregrounds in the mixture. Given that these manipulations were not done in the ferret experiments, we chose to not include the analysis of this dataset in the scope of this paper. However, future work investigating that topic further would indeed be of interest.

      (c) The mention of "a convolutional neural network trained to recognize digits in noise" should make more obvious that this is visual recognition rather than auditory recognition.

      We clarified this sentence to make clear that the recognition is visual and not auditory: “For instance, in a convolutional neural network trained to visually recognize digits in different types of noise, when local feedback is implemented, early layers encode noise properties, while later layers represent clean signal.”

      (d) Finally, one explanation of the results in the discussion is that "primary auditory areas could be recruited to maintain background representations, enabling downstream cortical regions to use these representations to specifically suppress background information and enhance foreground representations." This "background-related information" being used to "facilitate further extraction of foregrounds" is similar to what is argued in Hicks & McDermott PNAS 2024.

      We thank the reviewer for suggesting this relevant reference and added it in this paragraph of the discussion.

      (3) Methods:

      In the "Cross-correlation matrices" section, it mentions that time-averaged responses from 2.4 to 4.8 s were used. It would be helpful to provide an explanation of why this particular time window was used. Additionally, I wondered whether one could look at adaptation type effects (e.g., that of Khalighinejad et al., 2019) or whether fUSI does not offer this kind of temporal precision?

      The effects shown in Khalighinejad et al., 2019, are indeed likely too fast to be observed with our methods. However, there are still dynamics in the fUSI signal and in its invariance (Figure S1). Each individual combination of foreground and background is presented for 4.8 s (Figure 1B). Therefore, we chose the range 2.4-4.8 s as the biggest window we could use (to improve SNR) while minimizing contamination from the previous or next sound (indeed, blood volume typically lags neuronal activity by 1.5-2 s). We added this precision to the methods.

      In the "Human analyses" section, it is very unclear which set of data was used from Kell & McDermott 2019. For example, that paper contains 4 different experiments, none of which has 7 subjects. Upon closer reading, it seems that only 7 of the 11 participants from Experiment 1 also heard the background sounds in isolation (thus enabling the foreground invariance analyses). However, they stated that there were only 3 female participants in that experiment, while you state that you used data from 7 females. It would be helpful to double-check this and to more clearly state exactly which participants (i.e., from which experiment) were used and why (e.g., why not use data from Experiment 4 in the visual task/attention condition?).

      We added a sentence to clarify which datasets were used: “Specifically, we used data from Experiment 1 which provided the closest match to our experimental conditions, and only considered the last 7 subjects that heard both the foregrounds and the backgrounds in isolation, in addition to the mixtures.” 

      It was a mistake to mention that it was all female, as the original dataset has 3 females and 8 males, of which we used 7 without any indication of their sex. Thus, we removed this mention from the text.

      In the "Statistical testing" section, why were some tests done with 1000 permutations/shuffles while others were done with 2000?

      We homogenized and used 1000 permutations/shuffles for all statistical tests.

      (4) Miscellany:

      (a) The Hamersky et al. 2023 preprint has recently been published (referenced in the public review), and so you could consider updating the reference.

      This reference has now been updated.

      (b) There are a few borderline statistical tests that could use a bit more nuance. For example (on page 4), "In primary auditory cortex (MEG), there was no significant difference between values of foreground invariance and background invariance (p = 0.063, obtained by randomly permuting the sounds' background and foreground labels, 1000 times)." This test is quite close to being significant, and this might be acknowledged.

      We emphasized the trend to nuance the interpretation of these results: “In primary auditory cortex (MEG), foreground invariance was slightly lower than background invariance, although this difference was not significant (p=0.063, obtained by randomly permuting the sounds' background and foreground labels, 1000 times).”

      (5) Potential typos:

      (a)   Should the title be "natural sound mixtures" instead of "natural sounds mixtures"?

      (b) The caption for Figure 1 says "We imaged the whole auditory through successive slices across several days." I believe this should the "the whole auditory [cortex]." c) In the first paragraph of the discussion, there is a sentence ending in "...are segregated in hemody-namic signal." I believe this should be "hemody-namic signal."

      These errors are now all corrected.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Authors showed the presence of Mtb in human liver biopsy samples of TB patient and reported that chronic infection of Mtb causes immune-metabolic dysregulation. Authors showed that Mtb replicates in hepatocytes in a lipid rich environment created by up regulating transcription factor PPARγ. Authors also reported that Mtb protects itself from anti-TB drugs by inducing drug metabolising enzymes.

      Strengths:

      It has been shown that Mtb induces storage of triacylglycerol in macrophages by induction of WNT6/ACC2 which helps in its replication and intracellular survival, however, creation of favorable replicative niche in hepatocytes by Mtb is not reported. It is known that Mtb infect macrophages and induces formation of lipid-laden foamy macrophages which eventually causes tissue destruction in TB patient. In a recent article it has been reported that "A terpene nucleoside from M. tuberculosis induces lysosomal lipid storage in foamy macrophages" that shows how Mtb manipulates host defense mechanisms for its survival. In this manuscript, authors reported the enhancement of lipid droplets in Mtb infected hepatocytes and convincingly showed that fatty acid synthesis and triacylglycerol formation is important for growth of Mtb in hepatocytes. Authors also showed the molecular mechanism for accumulation of lipid and showed that the transcription factor associated with lipid biogenesis, PPARγ and adipogenic genes were upregulated in Mtb infected cells.

      The comparison of gene expression data between macrophages and hepatocytes by authors is important which indicates that Mtb modulates different pathways in different cell type as in macrophages it is related to immune response whereas, in hepatocytes it is related to metabolic pathways.

      Authors also reported that Mtb residing in hepatocytes showed drug tolerance phenotype due to up regulation of enzymes involved in drug metabolism and showed that cytochrome P450 monooxygenase that metabolize rifampicin and NAT2 gene responsible for N-acetylation of isoniazid were up regulated in Mtb infected cells.

      Weaknesses:

      There are reports of hepatic tuberculosis in pulmonary TB patients especially in immune-compromised patients, therefore finding granuloma in human liver biopsy samples is not surprising.

      Mtb infected hepatic cells showed induced DME and NAT and this could lead to enhanced metabolism of drug by hepatic cells as a result Mtb in side HepG2 cells get exposed to reduced drug concentration and show higher tolerance to drug. Authors mentioned that " hepatocyte resident Mtb may display higher tolerance to rifampicin". In my opinion higher tolerance to drug is possible only when DME of Mtb inside is up regulated or target is modified. Although, in the end authors mentioned that drug tolerance phenotype can be better attributed to host intrinsic factors rather than Mtb efflux pumps. It may be better if Drug tolerant phenotype section can be rewritten to clarify the facts.

      In the revised manuscript, by immune-staining authors convincingly showed that hepatocytes are a favourable niche for replication of MTb.

      Authors have rewritten the drug tolerant phenotype section which reads better.

      Overall, this paper has new and important information on how MTb establishes a favourable niche for growth in hepatocytes and creates a drug tolerant environment.

      We thank the reviewer for the through and insightful review.

      Reviewer #2 (Public review):

      The manuscript by Sarkar et al has demonstrated the infection of liver cells/hepatocytes with Mtb and the significance of liver cells in the replication of Mtb by reprogramming lipid metabolism during tuberculosis. Besides, the present study shows that similar to Mtb infection of macrophages (reviewed in Chen et al., 2024; Toobian et al., 2021), Mtb infects liver cells but with a greater multiplication owing to consumption of enhanced lipid resources mediated by PPARg that could be cleared by its inhibitors. The strength of the study lies in clinical evaluation of the presence of Mtb in human autopsied liver samples from individuals with miliary tuberculosis and presence of a clear granuloma-like structure. The interesting observation is of granuloma-like structure in liver which prompts further investigations in the field.

      The modulation of lipid synthesis during Mtb infection, such as PPARg upregulation, appears generic to different cell types including both liver cells and macrophage cells. It is also known that infection affect PPARγ expression and activity in hepatocytes. It is also known that this can lead to lipid droplet accumulation in the liver and the development of fatty liver disease (as shown for HCV). This study is in similar line for M.tb infection. As liver is the main site for lipid regulation, the availability of lipid resources is greater and higher is the replication rate. In short, the observations from the study confirm the earlier studies with these additional cell types. It is known that higher the lipid content, greater are Lipid Droplet-positive Mtb and higher is the drug resistance (Mekonnen et al., 2021). The DMEs of liver cells add further to the phenotype.

      Comments on revised version:

      The authors noted that even in experiments where mice were infected with lower CFUs, the presence of Mtb colonies could still be detected in the liver. It would be beneficial to include some experimental data related to this in the supplementary information, as it could provide valuable insights for the research field.

      We thank the reviewer for the in depth evaluation of our manuscript and as suggested we will include the data where Mtb was detected in the liver at low CFUs

      Reviewer #3 (Public review):

      In this revised manuscript, the authors explore how Mtb can infect hepatocytes and create a favorable niche associated with upregulation of the transcription factor PPARγ which presumably allows the bacteria to scavenge lipids from lipid droplets in host cells and upregulate drug-metabolizing enzymes to protect against its elimination. In response to the review, the authors have performed some additional immunostaining of hepatocytes, added more detail to figure legends, added experiments somewhat showing improved colocalization and staining, clarified several points and paragraphs, and updated the referenced literature and discussion.

      The current manuscript provides evidence that human miliary TB patients have infection of hepatocytes with Mtb, with evidence that the bacteria survive at least partially through upregulation of PPARγ, which significantly changes the lipid milieu of the cells. There is also an examination of transcriptomics and lipid metabolism in response to Mtb infection, as well as drug tolerance of Mtb inside hepatocytes. The current manuscript is an improvement over the previous one.

      However, although the manuscript is improved, tissue immunophenotyping of the various cells in the liver remains weak and unconvincing. This is truly a missed opportunity and lessens the rigor of the central findings and conclusions. As pointed out by another reviewer, literature has described different fates of Mtb in the liver. Given the tissue available to the authors, carefully dissecting the various cells that the bacteria are in (esp. hepatocytes versus Kupffer cells) is critical. The authors use only 2 generic markers and do not distinguish among cell types within the tissue slices. A review of the literature shows a variety of both human and mouse antibody markers. In fact, a liver atlas based on immunophenotyping has been published. Likewise, the authors comment on liver granulomas, but this is not justified without immunophenotyping.

      We would like to thank the reviewer for the in-depth and detailed suggestions. We would like to clarify that the primary aim of our study was to determine the localization of Mtb within hepatocytes and the downstream biological consequences. To this end, we employed two well-established and widely validated markers (ASPGR 1 and albumin) that are consistently used to identify hepatocytes in both human and murine liver tissue. While we acknowledge the broader potential of comprehensive immunophenotyping, our focused approach was designed to specifically address the question of hepatocyte involvement, which the selected markers effectively support, which was further reiterated by the Reviewer 1.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      In my opinion this paper contains important information and no further information is required for this manuscript.

      We thank the reviewer for the insightful comments

      Reviewer #2 (Recommendations for the authors):

      The authors noted that even in experiments where mice were infected with lower CFUs, the presence of Mtb colonies could still be detected in the liver. It would be beneficial to include some experimental data related to this in the supplementary information, as it could provide valuable insights for the research field.

      As suggested,  we will include the data with the low CFUs in the updated manuscript.

      Reviewer #3 (Recommendations for the authors):

      • Line 340, the fact that PPARγ inhibition decreases bacterial load should not be surprising, as the authors cite several papers where this is already shown.

      • Line 379, the increased tolerance of Mtb to drugs in hepatocytes is only significant at the lower 2 concentrations, not at 5 ug/mL.

      • Fig S4F-H, the y axis is inappropriately not set to zero on the lower limit.

      • Fig S9B, the Y-axis states "relative" CFU, but there is no indication what the bars are normalized to, and the numbers are much more typical of standard CFU values. Was the "Relative" part left in by mistake?

      • Double check the ending of the figure legend for Figure S10 and S11.

      • Line 352, phenomenom [sic] is misspelled.

      • On re-read, several sentences throughout this manuscript need improvement regarding structure and grammar. I suggest careful editorial review.

      We thank the reviewer for pointing out the issues and these will be carefully modified in the next version.


      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors showed the presence of Mtb in human liver biopsy samples of TB patients and reported that chronic infection of Mtb causes immune-metabolic dysregulation. Authors showed that Mtb replicates in hepatocytes in a lipid rich environment created by up regulating transcription factor PPARγ. Authors also reported that Mtb protects itself from anti-TB drugs by inducing drug metabolising enzymes.

      Strengths:

      It has been shown that Mtb induces storage of triacylglycerol in macrophages by induction of WNT6/ACC2 which helps in its replication and intracellular survival, however, creation of favorable replicative niche in hepatocytes by Mtb is not reported. It is known that Mtb infects macrophages and induces formation of lipid-laden foamy macrophages which eventually causes tissue destruction in TB patients. In a recent article it has been reported that "A terpene nucleoside from M. tuberculosis induces lysosomal lipid storage in foamy macrophages" that shows how Mtb manipulates host defense mechanisms for its survival. In this manuscript, authors reported the enhancement of lipid droplets in Mtb infected hepatocytes and convincingly showed that fatty acid synthesis and triacylglycerol formation is important for growth of Mtb in hepatocytes. The authors also showed the molecular mechanism for accumulation of lipid and showed that the transcription factor associated with lipid biogenesis, PPARγ and adipogenic genes were upregulated in Mtb infected cells.

      The comparison of gene expression data between macrophages and hepatocytes by authors is important which indicates that Mtb modulates different pathways in different cell type as in macrophages it is related to immune response whereas, in hepatocytes it is related to metabolic pathways.

      Authors also reported that Mtb residing in hepatocytes showed drug tolerance phenotype due to up regulation of enzymes involved in drug metabolism and showed that cytochrome P450 monooxygenase that metabolize rifampicin and NAT2 gene responsible for N-acetylation of isoniazid were up regulated in Mtb infected cells.

      We thank the reviewer for the positive feedback and for highlighting the strengths of our study.

      Weaknesses:

      There are reports of hepatic tuberculosis in pulmonary TB patients especially in immune-compromised patients, therefore finding granuloma in human liver biopsy samples is not surprising.

      Mtb infected hepatic cells showed induced DME and NAT and this could lead to enhanced metabolism of drug by hepatic cells as a result Mtb in side HepG2 cells get exposed to reduced drug concentration and show higher tolerance to drug. The authors mentioned that " hepatocyte resident Mtb may display higher tolerance to rifampicin". In my opinion higher tolerance to drugs is possible only when DME of Mtb inside is up regulated or the target is modified. Although, in the end authors mentioned that drug tolerance phenotype can be better attributed to host intrinsic factors rather than Mtb efflux pumps. It may be better if the Drug tolerant phenotype section can be rewritten to clarify the facts.

      We agree that several case studies regarding liver infection in pulmonary TB patients have been reported in the literature, however this report is the first comprehensive study that establishes hepatocytes to be a favourable niche for Mtb survival and growth.

      Drug tolerance is a phenomenon that is exhibited by the bacteria and during hostpathogen interactions, can be influenced by both intrinsic (bacterial) and extrinsic (host-mediated) factors. Multiple examples of tolerance being attributed to host driven factors can be found in literature (PMID 32546788, PMID: 28659799, PMID: 32846197). Our studies demonstrate that Mtb infected hepatocytes create a drug tolerant environment by modulating the expression of Drug modifying enzymes (DMEs) in the hepatocytes.

      As suggested by the reviewer we will rewrite the drug tolerant phenotype section.

      Reviewer #2 (Public review):

      The manuscript by Sarkar et al has demonstrated the infection of liver cells/hepatocytes with Mtb and the significance of liver cells in the replication of Mtb by reprogramming lipid metabolism during tuberculosis. Besides, the present study shows that similar to Mtb infection of macrophages (reviewed in Chen et al., 2024; Toobian et al., 2021), Mtb infects liver cells but with a greater multiplication owing to consumption of enhanced lipid resources mediated by PPARg that could be cleared by its inhibitors. The strength of the study lies in the clinical evaluation of the presence of Mtb in human autopsied liver samples from individuals with miliary tuberculosis and the presence of a clear granuloma-like structure. The interesting observation is of granuloma-like structure in liver which prompts further investigations in the field.

      The modulation of lipid synthesis during Mtb infection, such as PPARg upregulation, appears generic to different cell types including both liver cells and macrophage cells. It is also known that infection affect PPARγ expression and activity in hepatocytes. It is also known that this can lead to lipid droplet accumulation in the liver and the development of fatty liver disease (as shown for HCV). This study is in a similar line for M.tb infection. As the liver is the main site for lipid regulation, the availability of lipid resources is greater and higher is the replication rate. In short, the observations from the study confirm the earlier studies with these additional cell types. It is known that higher the lipid content, the greater are Lipid Droplet-positive Mtb and higher is the drug resistance (Mekonnen et al., 2021). The DMEs of liver cells add further to the phenotype.

      We thank the reviewer for emphasizing on the strengths of our study and how it can lead to further investigations in the field.

      Reviewer #3 (Public review):

      This manuscript by Sarkar et al. examines the infection of the liver and hepatocytes during M. tuberculosis infection. They demonstrate that aerosol infection of mice and guinea pigs leads to appreciable infection of the liver as well as the lung. Transcriptomic analysis of HepG2 cells showed differential regulation of metabolic pathways including fatty acid metabolic processing. Hepatocyte infection is assisted by fatty acid synthesis in the liver and inhibiting this caused reduced Mtb growth. The nuclear receptor PPARg was upregulated by Mtb infection and inhibition or agonism of its activity caused a reduction or increase in Mtb growth, respectively, supporting data published elsewhere about the role of PPARg in lung macrophage Mtb infection. Finally, the authors show that Mtb infection of hepatocytes can cause upregulation of enzymes that metabolize antibiotics, resulting in increased tolerance of these drugs by Mtb in the liver.

      Overall, this is an interesting paper on an area of TB research where we lack understanding. However, some additions to the experiments and figures are needed to improve the rigor of the paper and further support the findings. Most importantly, although the authors show that Mtb can infect hepatocytes in vitro, they fail to describe how bacteria get from the lungs to the liver in an aerosolized infection. They also claim that "PPARg activation resulting in lipid droplets formation by Mtb might be a mechanism of prolonging survival within hepatocytes" but do not show a direct interaction between PPARg activation and lipid droplet formation and lipid metabolism, only that PPARg promotes Mtb growth. Thus, the correlations with PPARg appear to be there but causation, implied in the abstract and discussion, is not proven.

      The human photomicrographs are important and overall, well done (lung and liver from the same individuals is excellent). However, in lines 120-121, the authors comment on the absence of studies on the precise involvement of different cells in the liver. In this study there is no attempt to immunophenotype the nature of the cells harboring Mtb in these samples (esp. hepatocytes). Proving that hepatocytes specifically harbor the bacteria in these human samples would add significant rigor to the conclusions made.

      We thank the reviewer for nicely summarizing our manuscript.

      Our study establishes the involvement of liver and hepatocytes in pulmonary TB infection in mice. Understanding the mechanism of bacterial dissemination from the lung to the liver in aerosol infections demands a detailed separate study.

      Figure 6E and 6F shows how PPARγ agonist and antagonist modulate (increase and decrease respectively) bacterial growth in hepatocytes (further supported by the CFU data in Supplementary Figure 9B). Again, the number of lipid droplets in hepatocytes increase and decrease with the treatment of PPARγ agonist and antagonist respectively as shown in Figure 6G and 6H. Collectively, these studies provide strong evidence that PPARγ activation leads to more lipid droplets that support better Mtb growth.

      We thank the reviewer for finding our human photomicrographs convincing. In the manuscript, we provide evidence for the direct involvement of the hepatocytes (and liver) in Mtb infection. We have performed detailed immunophenotyping of hepatocyte cells in the mice model with ASPGR1 (asialoglycoprotein receptor 1) and in the revised version of record, we have further stained the infected hepatocytes with anti-albumin antibody.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      In my opinion drug tolerant phenotype section should be rewritten for better clarification. The manuscript contains important information about hepatic tuberculosis which are not reported yet.

      We have rewritten the drug tolerant phenotype section for better clarity.

      We appreciate the reviewer’s comments regarding important information about hepatic tuberculosis

      Reviewer #2 (Recommendations for the authors):

      The following are some observations and comments on the manuscript.

      (1) The study delves into the mechanisms related to hepatic TB/miliary TB; however, the introduction and discussion only describe and discuss the data in the context of pulmonary TB giving a sense that the mandate of the MS is the exploration of the role of liver cells in pulmonary TB. There appears a gap in the connection of findings from the Miliary TB to the pulmonary TB. A discussion of the conversion of pulmonary TB to extrapulmonary /hepatic TB in the light of the findings may be helpful.

      We have modified the discussion section to include possible mechanisms that convert pulmonary TB to hepatic TB in the light of findings. Briefly, Pulmonary tuberculosis (TB) can lead to miliary TB probably through hematogenous dissemination, where Mtb spreads from the infected lungs into blood vessels either from a primary lung focus, reactivated TB or caseous necrosis.  Once in blood vessels, the bacteria seed multiple organs, forming tiny granulomas, characteristic of miliary TB. The liver involvement could be either through direct hematogenous spread or extrusion from nearby infected lymph nodes, leading to hepatic TB, which presents with granulomas and liver dysfunction. This spread underscores the severity of untreated pulmonary TB and the need for early intervention. Our in vivo infection data clearly shows that pulmonary infection of Mtb in mice and guinea pigs can steadily leads to significant infection of the liver and metabolic abnormalities in the liver. The study further highlights the need for systemic studies to better understand the route and mode of dissemination from lungs to liver for better pathophysiological understanding of the disease and creating new therapeutic targets.  

      (2) The authors show the presence of Mtb in the liver autopsies of miliary tuberculosis patients. It is well known that Mtb disseminates during the late stages to several organs and liver is a major site (Sharma et al. 2005; 10.1016/S1473-3099(05)70163-8). Other clinical observations also point to the fact that although Mtb infects liver cells, it is cleared (Thandi et al., 2018, https://doi.org/10.4049/jimmunol.200.Supp.173.20). As the samples are from miliary TB, it is expected that the bacterial load must have been very high before spreading to blood. It is known that once in blood, M.tb is expected to spread to various organs, especially highly vascular ones. Were any other tissues (especially with high vasculature) stained and verified? If yes, add to the supplementary data or discuss.

      Other tissues were not collected and stained during this study. Studies are currently underway to understand whether other vasculated organs also harbour Mtb or not. Besides several studies have shown that Mtb can infect a wide range of organs like brain, kidney, bone marrow, etc (PMID: 33142108, PMID: 28046053, PMID: 34269789) during miliary conditions.

      (3) It is not evident from this paper if hepatic infiltration occurs in pulmonary TB patients? It may therefore be important to discuss the status of liver infections in the primary pulmonary infection.

      Based on the available data from human biopsied liver samples, there is an indication of liver involvement in systemic tuberculosis (TB). However, to gain a more comprehensive understanding of hepatic infiltration in pulmonary TB patients, it is essential to conduct well-organized clinical studies. These studies should specifically target pulmonary TB patients and explore the extent and nature of liver involvement in these individuals (discussion). As suggested by the reviewer it is in the discussion

      (4) Similarly, in the mice model, M.tb was shown to localize to liver when aerosolic infection was given. Were any other tissues, such as kidney, bone marrow etc, checked? Is it because of the high dose of M.tb against the standard challenge dose of 50-100 CFU? Further, since the study in the mouse model is to mimic a miliary tuberculosis of liver, did the dissemination occur via bloodstream and if mycobacteremia could be observed in infected mice.

      Currently studies are underway to understand the involvement of other organs like kidney, brain, bone marrow, in aerosol infection mice model and how dissemination occurs in those distant organs.

      The focus of the current study was to understand the role of liver in systemic tuberculosis with emphasis on hepatocytes as a key cell type to be infected. We have also conducted the experiments with lower CFUs and could detect the presence of Mtb colonies in liver, so we do not think that the infection of liver is dependent on the dose of infection.

      (5) There are studies in mouse model which infer that liver carried the lowest bacterial burden, was cleared the fastest, and it is established that as compared to sites persistently seeded by M. tuberculosis, in the liver the bacteria rarely infect cell types other than professional phagocytes. As the observations in this study are contrasting, the discussion section should include a critical comparative analysis to justify why in the conditions used in the study, the hepatocytes and not Kupffer cells are infected. Other than the morphological description to indicate M.tb infection of hepatocytes in the liver section (fig 1E), it will be good to show localization of M.tb specifically to hepatocytes by using hepatocyte specific marker. Unlike as reported, why was a clearance of M.tb not observed even after 10 weeks (figure 2B).

      While some studies show that Mtb from the liver is cleared fast but there are several other studies that report Liver harbours Mtb even after 10 weeks postinfection (PMID: 22359543, PMID: 21533158, PMID: 29242198). We have consistently observed Mtb infection of liver post week 10 in our infection model. 

      We have performed detailed immunophenotyping of hepatocyte cells in the mice model with ASPGR1 (asialoglycoprotein receptor 1) and in the revised version of record, we have further stained the isolated hepatocytes with anti-albumin antibody (albumin is a robust marker of hepatocyte identity) and have showed the presence of Mtb in it. The data has been included in the revised manuscript (Fig 2J)

      (6) While the result section mentions that "individuals with miliary tuberculosis' (line 107), the legend of Figure 1 writes 'Presence of Mtb in human pulmonary tuberculosis patients'. This is confusing. Clarify

      We thank the reviewer for pointing it out, we have changed the figure legends to miliary tuberculosis as most of the liver biopsy samples were obtained from military tuberculosis patients. 

      (7) Supplementary Figure 2D: Corresponding control panel (uninfected) should be added, which will also verify the specificity of Ag85b. As it is known that Ag85B is secreted out from the bacteria and hence the detected signals may not confirm that Mtb is in hepatocytes. Ag85B per bacterium decreases by almost 10,000-fold at later stages of infection because of secretion (Ernst JD, Cornelius A, et al 2019 mBio). In Supl figure 2D, Ag85b signal seems to be present everywhere inside the cells. Hence, it is important that the control panel be added.

      We have included a control image below which shows no staining of Ag85B in the uninfected sample.While we acknowledge with the reviewer’s comment, but Ag85B has been consistently used as a marker for Mtb presence in multiple studies. Nargan et al., uses Ag85B based staining to characterize infection both pulmonary and EPTB samples (PMID: 38880068). Jain et al., uses Ag85B to characterize Mtb infection of Mesenchymal stem cell in lung biopsy samples of pulmonary TB patients (PMID: 32546788)

      Author response image 1.

      Ag85B staining in uninfected mice shows no signals

      (8) The kinetics experiments in Figure 3D-3G should have used time laps microscopy of a few of the infected cells or it should be represented in CFU. If we consider the doubling time of H37Rv is about 22h to 24h, the data showing that MFI increases dramatically from 5 HPI to 120 HPI, gives an impression that the bacterial number inside the cells increased more than its doubling time.

      We have added the modified plot. As suggested, the CFU of Mtb within HepG2, PHCs, THP-1, RAW 264.7 and BMDMs have been included in the revised version (Supplementary Figure 4 D-H)

      (9) What is the effect of C45 and T863 on Mtb growth invitro? The effect of C45 and T863 on Mtb growth invitro should be shown to be ruled out. The representative image in Figure 5F is DMSO or C45 treated cells panel? Please specify it.

      As per the reviewer’s suggestion we have seen the effect of C45 (30 µM) and T863 (25 µM) on Mtb growth in vitro and did not find any difference in the growth kinetics. The representative image in Figure 5F is DMSO treated cells.

      Author response image 2.

      Growth kinetics of Mtb in 7H9 medium with DMSO, C75 and T863

      (10) Supplementary Figure 6B: Correct the Y-axis label from mRNA levels to Fold change (normalised to control). Please do similar changes wherever required.

      We have made the necessary changes as per the suggestion of the reviewer.

      (11) Figure 7B and 7C: How was the normalization performed? Is the data normalized to the number of bacteria that entered the specific cell type or was normalized at 48hrs with respect to DMSO? DMSO alone data should be shown.

      In the drug tolerance assays, we have calculated the ratio of the bacterial burden in hepatocytes treated with drugs compared to hepatocytes treated with DMSO. The infection was given for 48 hours post which the infected cells were treated with the mentioned concentrations of isoniazid and rifampicin for 24 hours. CFU enumeration was conducted after this 24 hour. Figure 7A gives a schematic of the experimental set up.

      % Tolerant Bacterial population= [A/B X 100] % where A is the CFU of Mtb from infected hepatocytes treated with drug and B is the CFU of Mtb infected cells treated with DMSO.Thus the effect of MOI is negated.

      To provide further credence to the CFU data, we have analysed these studies using microscopic studies as well, where no cell death was observed under the conditions. Mouse BMDMs were as a macrophage control. We have calculated the % tolerance as ratio by measuring the mean fluorescent intensity of GFP-Mtb per hepatocyte treated with drug to MFI of GFP-Mtb per hepatocyte treated with DMSO (control). More than 20 fields, each consisting of more than 4 infected cells have been used for analysis providing additional evidence of less killing of Mtb in hepatocytes compared to BMDMs with anti-TB drugs. All these details are included in the manuscript.

      (12) While authors have shown the changes in mRNA levels of CYP3A4, CYP3A43, NAT2, the protein or activities of some of these should be measured to verify the effect.

      Currently studies are underway to understand the activities of the key proteins involved in isoniazid and rifampicin metabolism and will be published as a separate manuscript.

      Reviewer #3 (Recommendations for the authors):

      Additional comments are:

      • Figure 2D, the 20X and 40X magnifications do not look appreciably different in size. Please double-check that the correct images were used.

      We thank the reviewer for pointing it out, we havecorrected it in the revised version.

      • Lines 162-164: The authors state almost 100% purity. However, the contour plot in 2F appears to show 2 cell populations. Figure 2G is missing a legend of which colors correspond to which staining (and again there appears to be highly variable staining).

      We agree with the reviewer that there are two contours observed in Figure 2F. Although both the contours are positive for ASPGR1 protein, but the level of expression of the ASPGR1 protein is variable. The corresponding confocal image (Nucleus stained by DAPI and ASPGR1 stained with ASPGR1 antibody with Alexa fluor 555 conjugated secondary antibody) also indicates a variable staining of isolated primary hepatocytes, where some cells give a stronger intensity signal than the other cells, further visually confirming our statement. Moreover, several studies show differential expression of ASPGR1 protein in hepatocyte like cells (PMID: 27143754)

      To further clarify and be more specific with respect to the identity of the hepatocytes, we have stained primary hepatocytes from infected mouse livers with Albumin antibody (a stable marker for hepatocytes) and Ag85B (2J)

      Multiple figures throughout the manuscript, including this one, would benefit from the use of arrows to depict what is described in the legend and text more clearly, and the use of higher power insets to better define cell architecture. Finally, some images appear blurry to the eye. Improvements are needed throughout.

      As per the suggestion, we have modified the figures and figure legends for better clarity.

      • Lines 153-155. Albumin, AST and GGT appear to be significantly up at week 8, contradicting the statement that there is no change until week 10.

      We thank the reviewer for poiting it out and  have made suitable changes in the write up

      • Lines 203-205: The authors state earlier that bacteria survive in macrophage phagosomes. Do the authors know the niche for bacteria in hepatocytes that enable them to continue to grow? Transcriptome data from HepG2 cells suggest perhaps a phagosomal pathway?

      We thank the reviewer for this insightful question. As rightly pointed out by the reviewer, transcription data indeed suggests changes in several important pathways like macroautophagy, golgi vesicular transport and vacuolar transport, which can affect the subcellular localisation of Mtb within hepatocytes. High resolution microscopic studies with respect to the subcellular localisation of labelled Mtb within Primary hepatocytes, HepG2 and THP-1 has been conducted and the % colocalization within different intra-cellular compartments have been measured. The image of colocalization of labelled Mtb within PHCs is shown below along with the % colocalization within various compartments in PHCs, HepG2 and THP-1 is added. 

      Author response image 3.

      Colocalisation of Mtb-GFP with various intra-cellular markers within PHCs.

      Author response image 4.

      Percentage Colocalisation of Mtb-GFP with various intra-cellular markers within PHCs, HepG2 and THP-1.

      • Validation of some critical genes found in the HepG2 cells should be done by qRTPCR in primary hepatocytes.

      qRT-PCR analysis of some of the key genes in HepG2 have been validated in primary hepatocytes at 24 hours post infection. Majority of the genes show a similar trend.

      Author response image 5.

      Gene expression analysis of the mentioned genes in Mtb infected PHCs as compared to the uninfected control.

      • Lines 259-260: The authors state a high degree of co-localization. The photomicrograph of a single cell in Fig. 5D is not convincing. I'm not even sure that they are really in the same subcellular compartment. Co-localization stated in Fig. S8B is also not convincing as shown.

      The image currently shown in figure 3D is a maximum intensity projection image of multiple z-stacks encompassing the entire cell.

      We agree with the reviewer with respect to figure Fig S8B and will modify the text and the figure legend accordingly.

      Copywriting edits:

      • It is difficult to see individual gene names in Figures 4D and 4E. A higher resolution or larger font would be appreciated for the reader.

      An excel file with the top differentially regulated genes at both 0 hours post infection and 48 hours post infection has been added.

      • Figure 5A has a shadow on the top right image.

      We have changed the image in the revised manuscript

      • Figure 5E is difficult to read the labels on the axes; it would be better in general to make the labels separately instead of relying on the graphing software, since these labels can get stretched when the size of the graph is modified.

      We agree with the reviewer and have made necessary changes.

      • Line 163: should be "percent" and not "perfect."

      We thank the reviewer for pointing it out and have corrected it

      • Line 190: is missing a period at the end of the sentence "...for further experiments"

      We thank the reviewer for pointing it out and have corrected it

      • Line 332: should be "hepatocytes" instead of "hepatoctyte" [sic]

      We thank the reviewer for pointing it out and have corrected it

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review): 

      Summary: 

      In the retina, parallel processing of cone photoreceptor output under bright light conditions dissects critical features of our visual environment and is fundamental to visual function. Cone photoreceptor signals are sampled by several types of bipolar cells and passed onto the ganglion cells. At the output of retinal processing, retinal ganglion cells send about 40 different codes of the visual scene to the brain for further processing. In this study, the authors focus on whether subtype-specific differences in the size of synaptic ribbon-associated vesicle pools of bipolar cells contribute to different retinal ganglion cell (RGC) responses. Specifically, inputs to ON alpha RGCs producing transient versus sustained kinetics (ON-S vs. ON-T, respectively) are compared. The authors first demonstrate that ON-S vs. ON-T RGCs are readily identifiable in a whole mount preparation and respond differently to both static and to a spatially uniform, randomly fluctuating (Gaussian noise) light stimulus. Liner-nonlinear (LN) models were used to estimate the transformation between visual input and excitatory synaptic input for each RGCs; these models suggested the presence of transient versus sustained kinetics already in the excitatory inputs to ON-T and ON-S RGCs. Indeed, the authors show that (glutamatergic) excitatory inputs to ON-S vs. ON-T RGCs are of distinct kinetics. The subtypes of bipolar cells providing input to ON-S are known (i.e., type 6 and 7), but the source of excitatory bipolar inputs to ON-T RGCs needed to be determined. In a tedious process, it is elegantly shown here that ON-T RGCs receive most of their excitatory inputs from type 5 and 6 bipolars. Interestingly, the temporal properties of light-evoked responses of type 5, 6, and 7 bipolars recorded from the somas were indistinguishable and rather sustained, suggesting that the origin of transient kinetics of excitatory inputs to ON-T RGCs suggested by the LN model might be found in the processing of visual signals at the bipolar cell axon terminal. Blocking GABA- or glycinergic inhibitory inputs did not alter the light-evoked excitatory input kinetics to ON-T and ON-S RGCs. Twophoton glutamate sensor imaging revealed significantly faster kinetics of light-evoked glutamate signals at ON-T versus ON-S RGCs. Detailed EM analysis of bipolar cell ribbon synapses onto ON-T and ON-S RGCs revealed fewer ribbon-associated vesicles at ON-T synapses, which is consistent with stronger paired-flash depression of lightevoked excitatory currents in ON-T RGCS versus ON-S RGCs. This study suggests that bipolar subtype-specific differences in the size of synaptic ribbon-associated vesicle pools contribute to transient versus sustained kinetics in RGCs. 

      Strengths: 

      The use of multiple, state-of-the-art tools and approaches to address the kinetics of bipolar to ganglion cell synapse in an identified circuit. 

      Weaknesses: 

      For the most part, the data in the paper support the conclusions, and the authors were careful to try to address questions in multiple ways. Two-photon glutamate sensor imaging experiment showing that blocking GABA- and glycinergic inhibition does not change the kinetics of light-evoked glutamate signals at ON-T RGCs would strengthen the conclusion that bipolar subtype-specific differences in the size of synaptic ribbon-associated vesicle pools contribute to transient versus sustained kinetics in RGCs. 

      Thank you for this suggestion. We have revised the text throughout to be careful not to imply that amacrine cells have no role in shaping EPSCs and spike output, but instead that the transience of the On-T responses persists without amacrine cells (see for example lines 91, 450-453, 514-518, 696-714). We have also added additional iGluSnFR experiments to the paper to further test this conclusion (new Figure 7). The new data shows that the transience of glutamate release from the On-T cells is retained when 1) spiking amacrine cell activity is suppressed by blocking voltage-gated Na<sup>+</sup> channels with TTX or 2) all amacrine cell activity is suppressed by blocking AMPA receptors with NBQX. This does provide nice additional evidence that amacrine cells are not necessary for the sustained/transient distinction.

      Reviewer #2 (Public Review): 

      Summary: 

      Goal of the study. The authors tried to pinpoint the origins of transient and sustained responses measured at retinal ganglion cells (rgcs), which is the output layer of the retina. Response characteristics of rgcs are used to group them into different types. The diversity of rgc types represents the ability of the retina to transform visual inputs into distinct output channels. They find that the physical dimensions of bipolar cell's synaptic ribbons (specialized release sites/active zones) vary across the different types of cone on-bpcs, in ways that they argue could facilitate transient or sustained release. This diversity of release output is what they argue underlies the differences in on-rgcs response characteristics, and ultimately represents a mechanism for creating parallel cone-driven channels. 

      Strengths: 

      The major strengths of the study are the anatomical approaches employed and the use of the "glutamate sniffer" to assay synaptic glutamate levels. The outline of the study is elegant and reflects the strengths of the authors. 

      Weaknesses: 

      The major weakness is that the ambitious outline is not matched with a complete set of results, and the set of physiological protocols is disjointed, not sufficient to bridge the systems-level question with the presynaptic release question. 

      Thank you for this comment as it provides an opportunity (here and in the paper) for us to clarify our main goal. We wanted to link the well-established distinction between transient and sustained retinal responses to anatomy. This required locating where this difference arises within the circuitry – which we show to be at least largely the bipolar output synapse – and then examining the structure of this synapse in detail. While we would certainly be interested in connecting our results to a biophysical description of the synapse, that was not the primary focus of our study and was not something we could add without substantial additional work.  

      Major comments on the results and suggestions. 

      The ribbon model of release has been explored for decades and needs to be further adapted to systems-level work. The study under consideration by Kuo et al. takes on this task. Unfortunately, the experimental design does not permit a level of control over presynaptic/bpc behavior that is comparable to earlier studies, nor do they manipulate release in ways that test the ribbon model (i.e., paired recordings or Ribeye-ko). Furthermore, the data needs additional evaluation, and the presentation and interpretations should draw on published biophysical and molecular studies. 

      As described above, our goal was to test several possible explanations for the difference between transient and sustained responses in OnT and OnS ganglion cells: (1) differences in the light responses of the bipolar cells that convey photoreceptor signals to the relevant ganglion cells; (2) shaping of bipolar transmitter release by presynaptic inhibition; (3) shaping of ganglion cell responses by postsynaptic inhibition or spike generation; (4) differences in feedforward bipolar synapses. We were surprised to find that the feedforward bipolar synapses play a central role in this difference, and your comment nicely prompts us to relate this to the large literature on biophysical studies of release from ribbon synapses. We have made substantial revisions in the text to do this. This includes anticipating the importance of feedforward synaptic properties in the abstract and introduction (lines 36-37 and 61-64), pointers in the results (lines 539-548), and several new paragraphs in the discussion (starting on lines 751, 773 and 787). By showing that the transient/sustained differences originates largely at feedforward bipolar synapses, we set the stage for future work that shows how biophysical properties of the synapse shape physiological signals that traverse it.

      To build a ribbon-centric context, consider recent literature that supports the assertion that ribbons play a role in forming AZ release sites and facilitating exocytosis. Reference Ribeye-ko studies. For example, ribbonless bpcs show an 80% reduction in release (Maxeiner et al EMBO J 2016), the ribbonless retina exhibits signaling deficits at the output layer (Okawa et al ...Rieke, ..Wong Nat Comm 2019), and ribbonless rods show an 80% reduction the readily releasable pool (RRP) of SVs (Grabner Moser, elife 2021). In addition, the authors could refer to whole-cell membrane capacitance studies on mammalian rods, cones, and bpcs, because the size of the RRP of SVs scales with the dimensions and numbers of ribbons (total ribbon footprint). For comparison, bipolars see the review by Wan and Heidelberger 2011. For a comparison of mammalian rods and cones, see, rods: Grabner and Moser (2021 eLife), Mueller.. Regus Leidig et al. (2019; J Neurosci) and cones Grabner ...DeVries (Nat Comm 2023). A comparison of cell types shows that the extent of release is (1) proportional to the total size of the ribbon footprint, and (2) less release is witnessed when ribbons are deleted (also see photo ablation studies by Snellman.... And Mehta..Zenisek, Nat Neurosci and Neuron).

      Thank you for these pointers into the literature.  We have included much of this work in the revised Discussion (see three paragraphs starting on line 751). The revised text focuses on the evidence that larger and more numerous ribbons lead to increased release. The direct evidence from previous work for this relationship supports our (indirect) conclusions in the current paper about the role of ribbon size and associated vesicle pools in transient vs sustained responses.  

      Ribbon morphology may change in an activity-dependent manner. The rod ribbon AZ has been reported to lengthen in the dark (Dembla et al 2020), and deletion of the ribbon shortens the length of the AZ (defined by Cav1,4 or RIM2); in addition, the Ribeye-ko AZs fail to change in size with light and dark conditioning. Furthermore, EM studies on rod and cone AZs in light and dark argue that the number of SVs at the base of the ribbon increases in the dark, when PRs are depolarized (see Figure 10, Babai et al 2016 JNeurosci). Lastly, using goldfish Mb1 on-bipolars, Hull et al (2006, J Neurophysio) correlated an increase in release efficiency with an increase in ribbon numbers, which accompanied daylight. >> When release activity is high, ribbon AZ length increases (Dembla, rods), the number of docked SVs increases (Babai, rods cones), and the number of ribbons increases (Hull, diurnal Mb1s). 

      We have extensively revised the discussion section to include more discussion of ribbons, particularly emphasizing evidence supporting the general argument that larger ribbons support higher release rates. We focused on studies that provided direct links between release rates and ribbon size or number of ribbon-associated vesicles.  This includes studies that pair electrophysiology and anatomy and those that measure the consequences of ablating ribbons,

      The results under review, Kuo et al., were attained with SBF-SEM, which has the benefit of addressing large-volume questions as required here, yet it achieves lower spatial resolution than what is attained with TEM tomography and FIB-EM. Ideally, the EM description would include SV size, and the density of ribbon-tethered SVs that are docked at the plasma membrane, because this is where the SVs fuse (additional non-ribbon release sites may also exist? Mehta ... Singer 2014 J Neurosci). Studies by Graydon et al 2011 and 2014 (both in J Neurosci), and Jean ... Moser et al 2018 (eLife) are good examples of quantitative estimates of SVs docking sites at ribbons. SBF-SEM does not allow for an assessment of SVs within 5 nm of the PM, but if the authors can identify the number of SVs that appear within the limit of resolution (10 to 15 nm) from the PM, then this data would be useful. Also, what dimension(s) of the large ribbons make them larger? Typically, ribbons are fixed in height (at least in the outer retina, 200 to 250 nm), but their length varies and the number ribbons per terminal varies. Is the larger ribbon size observed in type 6 bpcs do to longer ribbons, or taller ribbons? A longer ribbon likely has more docked SVs. An additional possibility is that more SVs are about the ribbon-PM footprint, either more densely packed and/or expanding laterally (see definitions in Jean....Moser, elife 2018). 

      We have included an additional analysis of ribbon surface area from our 3D SBFSEM reconstructions. As with the volume measurements included in the original submission, ribbon surface areas are distinct between type 5i and type 6 bipolar cells (Fig. S10A), ON-T RGCs on average receive input from ribbons with smaller surface area than ON-S RGCs (Fig. S10B), and ribbon surface area predicts the number of adjacent vesicles across bipolar cell types (Fig. S10C).  We agree that a higher resolution view of presynaptic structures would be very helpful, but the resolution of our SBF-SEM data is limited (e.g. each pixel is 40 nm on a side).  This resolution does not allow us to distinguish between vesicles at vs near the membrane. 

      In our observations, both length and height of the ribbons showed variability across individual bipolar cells. And ribbons in type 6 bipolar cells tended to be either longer and/or taller compared to those in type 5 cells. We agree that a longer ribbon may accommodate more docked SVs. A more definitive analysis would benefit from higher-resolution, isotropic 3D reconstructions of ribbons, which would allow more precise shape analysis and ,together with a detailed assessment of docked SVs at the ribbons.

      The ribbon literature given above makes the argument that ribbons increase exocytotic output, and morphological studies suggest that release activity enhances 1) ribbon length (Dembla) and 2) the density of SVs near the PM (Babai). These findings could lead one to propose that type 6 bpcs (inputs to On-sustained) are more active than type 5i (feed into On-transient). Here Kuo et al. show that the bpcs have similar Vm (measured from the soma) in response to light stimulation. Does Vm predict release? Not entirely as the authors acknowledge, because: Cav channel properties, SV availability, and negative feedback are all downstream of bpc Vm. The only experiment performed to test downstream factors focused on negative feedback from amacrines. The data presented in Figures 5C-F led me to conclude the opposite of what the authors concluded. My impression is that the T-ON rgc exhibits strong disinhibition when GABA-blockers are applied (the initial phase is greatly increased in amplitude and broadened with the drug), which contrasts with the S-On rgc responses that show a change in the amplitude of the initial phase but not its width (taus would be nice). Here and in many places the authors refer to changes in release kinetics, without implementing a useful description of kinetics. For instance, take the cumulative current (charge) in Figure 5C and fit the control and drug traces to arrive at taus, and their respective amplitudes, and use these values to describe kinetic phases. One final point, the summary in Figure 5D has a p: 0.06, very close to the cutoff for significance, which begs for more than an n = 5. Given that previous studies have shown that bpc output is shaped by immediate msec GABA feedback, in ways that influence kinetic phases of release (..Mb1 bipolars, see Vigh et al 2005 Neuron), more attention to this matter is needed before the authors rule out feedback inhibition in favor of ribbon size. If by chance, type 5i bpcs are under uniquely strong feedback inhibition, then ribbon size may result from less activity, not less output resulting from smaller ribbons.

      The text surrounding Figure 5 led to some confusion, and we have revised that text and the figure for clarity.  First, the data in that figure is entirely from On-T cells (the upper and lower panels show block of GABA and glycine receptors separately).  Second, the observation that we make there is that block of inhibitory receptors increases the transience of the On-T excitatory input, rather than decreasing it as would be expected if the transience is created by presynaptic inhibition. We have added additional data and that increase in transience is now significant. Inhibitory block does substantially increase the amplitude of the postsynaptic response, and a likely origin of this change in response is inhibitory feedback to the bipolar synaptic terminal. We now indicate this in the text on page 13, lines 438-453. 

      The key result of this figure for our purposes here is that the transience of the excitatory input to the OffT cell remains with inhibitory input blocked. We have clarified throughout the text that our results indicate that inhibitory feedback is not necessary for the difference between transient release into On-T and sustained release onto On-S. This does not mean that inhibitory feedback does not shape the responses in other ways or contribute to the transient/sustained difference - just that for the specific stimuli we use that difference is retained without presynaptic inhibition. We have also added citations to past work showing that activity of amacrine cells can modulate bipolar transmitter release. 

      Whether strong feedback inhibition limits activity and therefore limits ribbon size in an activity-dependent way is an intriguing possibility. Indeed, addressing why ribbons are larger in type 6 bipolar cells vs. other bipolar types will be an interesting avenue of further study. However, it would be surprising if ribbon sizes changed during the acute pharmacological block conditions (~10-15 minutes) we employed in our study. Our point here is that there is an interesting correlation between presynaptic ribbon size and the kinetics of glutamate release. We do not think that the two possibilities stated in the last sentence (“…ribbon size may result from less activity, not less output resulting from smaller ribbons”) are mutually exclusive.

      We have not further quantified the response kinetics in the experiments of Figure 5 as the large changes induced by the pharmacology (especially GABA receptor block) make it unclear how to interpret quantitative differences.  In other places we have quantified kinetics through the STA or specified that our focus was more qualitative (i.e. transient vs sustained kinetics). 

      As mentioned above, the behavior of Cav channels is important here. This is difficult to address with voltage clamps from the soma, especially in the Vm range relevant to this study. Given that it has previously been modeled that the rod bpc to AII pathway adapts to prolonged depolarization of rbcs through downregulating Cav channel-mediated Ca<sup>2+</sup> influx (Grimes ....Rieke 2014 Neuron), it seems important for Kou et al to test if there is a difference in Cav regulation between type 6 and 5i bpcs. Ca<sup>2+</sup>  imaging with a GCaMP strategy (Baden....Lagnado Current Biology, 2011) or filling the presynapse with Ca dyes (see inner hair cells: Ozcete and Moser, EMBO J 2020) would allow for the correlation of [Ca]intra with GluSnf signals (both local readouts).

      This is a good suggestion but is outside the scope of our current paper. Our focus was on the circuit origin of the difference in response of the OnT and OnS responses rather than the specific biophysical mechanism.  We are of course interested in the mechanism, but the additional experiments needed to pin that down would need to be a part of future experiments. The work here represents an important step in that direction as it greatly reduces the number of possible locations and mechanisms for the sustained/transient difference and hence serves to focus any future mechanistic investigations.

      Stimulation protocol and presentation of Glutamate Sniffer data in Figure 6. In all of your figures where you state steady st as a % of pk amplitude, please indicate in the figure where you estimate steady state. Alternatively, if you take the cumulative dF/F signal, then you can fit the different kinetic phases. From the appearance of the data, the Sustained Glu signals look like square waves (Figure 6B ROI1-4), without a transient at onset, which is not predicted in your ribbon model that assumes different kinetic phases (1. depletion of docked SVs, and 2. refilling and repriming). The Transient responses (Figure 6B ROI5-8) are transient and more compatible with a depressing ribbon scheme. If you take the cumulative, for all of the On-S and compare it to all of the On-T responses, my guess is the cumulative dF/F will be 10 to 20 larger for the S-On. Would you conclude that bpc inputs to On-S (type 6) release 20fold more SVs per 4 seconds on a per ribbon basis, and does the surface area of the type 6 bpcs account for this difference? From Figures 8B and D, the volume of the ribbon is ~2 fold greater for type 6 vs 5i, but the Surface Area (both faces of ribbon) is more relevant to your model that claims ribbon size is the pivotal factor. If making cumulative traces, and comparisons on an absolute scale is unfounded, then we need to know how to compare different observations. The classic ribbon models always have a conversion factor such as the capacitance of an SV or q size that is used to derive SV numbers from total dCm or Qcontent. See Kim ....et al von Gersdorff, 2023, Cell Reports. Why not use the Gaussian noise stimulus in Fig 6 as in Figure 1 and 2? 

      For iGluSnFR recordings, steady-state responses were measured from the mean fluorescence over the last 1 sec of the light step (2 sec duration) response. We have included this information in the figure caption and in the Methods. 

      There is a good deal of variability in the iGluSnR responses from one ROI to another, and the ROIs shown in the original submission had a less prominent transient component than many other ROIs. We have replaced this figure with another that is more representative of the average behavior across ROIs. The full range of behavior is captured in Figure 6C; it is clear across ROIs that glutamate release near ON-S dendrites shows both sustained and transient components. The new experiments in which we block amacrine cell activity also include a few more example ROIs from ON-S cells, and those also show both transient and sustained components.

      Your suggestion to integrate the iGluSnFR signals to compare to our structural analysis of ribbons is interesting. However, we are hesitant to make a quantitative comparison between the two without further experiments to validate how the iGluSnFR signals we measure relate to release of single vesicles. For example, a quantitative measure of release based on the iGluSnR experiments would require accounting for possible differences in the expression of the indicator - which could differ both in overall level and/or location relative to release sites. 

      This comment and one above highlight the importance of measures of ribbon surface area, which we now provide (Figure S10).

      Figure 7. What is the recovery time for mammalian cones derived from ribbon-based models? There are estimates from membrane capacitance studies. Ground squirrel cones take 0.7 to 1 sec to recover the ultrafast, primed pool of SVs when probed with a paired-pulse protocol (Grabner ...DeVries 2016, Neuron). Their off-bpcs take anywhere from under 0.2 sec to a second to recover, which is a combination of many synaptic factors (Grabner ...DeVries Nat Comm 2023). Rod On bpcs take over a second (Singer Diamond 2006, reviewed Wan and Heidelberger 2011). In Figure 7B, the recovery time is ~150 ms for the responses measured at rgcs. This brief recovery time is incompatible with existing ribbon models of release. Whole-cell membrane capacitance measurements would be helpful here.

      Thanks for drawing our attention to this issue. Indeed, we see a relatively rapid recovery in the paired-flash experiments. We now discuss this recovery time in the context of past measurements of recovery of responses in cones and bipolar cells (paragraph starting on line 773). There are many factors that could contribute to the relatively rapid recovery we observe - including synaptic factors such as those highlighted by Grabner et al., (2016) either at the cone-to-bipolar synapses or the bipolar-to-RGC synapses. We are certainly interested in a more detailed understanding of this issue, but the additional experiments are outside the scope of this paper.  

      Experimental Suggestion: Add GABA blockers and see if type 5i bpc responds with more release (GluSniff) and prolonged [Ca2+] intra (GCaMP). Compare this to type 6 bpc behavior with GABA/gly blockers. This will rule in or out whether feedback inhibition is involved. 

      Figure 7 in the revised manuscript includes two new experiments examining glutamate release (without the simultaneous measurement of bipolar cell intracellular calcium) while blocking (1) all/most amacrine cell-mediated inhibition via inclusion of NBQX in the bath solution, and (2) blocking spiking amacrine cells via inclusion of TTX in the bath solution. The transient vs sustained difference in light-evoked glutamate release around ON-T and ON-S RGC dendrites remained with amacrine activity suppressed. These new results are consistent with the anatomical and pharmacological data that were included in the initial submission of the manuscript (Fig. 5) that indicate presynaptic inhibition does not have a major role in shaping release kinetics at these synapses. 

      Reviewer #3 (Public Review): 

      Summary: 

      Different types of retinal ganglion cell (RGC) have different temporal properties - most prominently a distinction between sustained vs. transient responses to contrast. This has been well established in multiple species, including mice. In general, RGCs with dendrites that stratify close to the ganglion cell layer (GCL) are sustained; whereas those that stratify near the middle of the inner plexiform layer (IPL) are transient. This difference in RGC spiking responses aligns with similar differences in excitatory synaptic currents as well as with differences in glutamate release in the respective layers - shown previously and here, with a glutamate sensor (iGluSnFR) expressed in the RGCs of interest. Differences in glutamate release were not explained by differences in the distinct presynaptic bipolar cells' voltage responses, which were quite similar to one another. Rather, the difference in transient vs. sustained responses seems to emerge at the bipolar cell axon terminals in the form of glutamate release. This difference in the temporal pattern of glutamate release was correlated with differences in the size of synaptic ribbons (larger in the bipolar cells with more sustained responses), which also correlated with a greater number of vesicles in the vicinity of the larger ribbons. 

      The main conclusion of the study relates to a correlation (because it is difficult to manipulate ribbon size or vesicle density experimentally): the bipolar cells with increased ribbon size/vesicle number would have a greater possibility of sustained release, which would be reflected in the postsynaptic RGC synaptic currents and RGC firing rates. This model proposes a mechanism for temporal channels that is independent of synaptic inhibition. Indeed, some experiments in the paper suggest that inhibition cannot explain the transient nature of glutamate release onto one of the RGC types. Still, it is surprising that such a diverse set of inhibitory interneurons in the retina would not play some role in diversifying the temporal properties of RGC responses. 

      Strengths: 

      (1) The study uses a systematic approach to evaluating temporal properties of retinal ganglion cell (RGC) spiking outputs, excitatory synaptic inputs, presynaptic voltage responses, and presynaptic glutamate release. The combination of these experiments demonstrates an important step in the conversion from voltage to glutamate release in shaping response dynamics in RGCs. 

      (2) The study uses a combination of electrophysiology, two-photon imaging, and scanning block-face EM to build a quantitative and coherent story about specific retinal circuits and their functional properties. 

      Weaknesses: 

      (1) There were some interesting aspects of the study that were not completely resolved, and resolving some of these issues may go beyond the current study. For example, it was interesting that different extracellular media (Ames medium vs. ACSF) generated different degrees of transient vs. sustained responses in RGCs, but it was unclear how these media might have impacted ion channels at different levels of the circuit that could explain the effects on temporal tuning.

      We do not have an explanation for the quantitative differences in response kinetics we observed in Ames’ medium vs. ACSF. There are modest differences in calcium and magnesium concentration and a larger difference in potassium (2.5 mM in ACSF vs 3.6 mM in Ames). It would be interesting to test which of these (or other) differences accounts for the difference in response kinetics.

      (2) It was surprising that inhibition played such a small role in generating temporal tuning. At the same time, there were some gaps in the investigation of inhibition (e.g., IPSCs were not measured in either of the RGC types; pharmacology was used to investigate responses only in the transient RGCs).

      We were also surprised at this result. We have included additional data on inhibition in the revised manuscript. Figure S3 shows light-evoked IPSC data from both RGC types (Fig. S3) and Fig. 7 shows additional iGluSnFR measurements around both ON-T and ON-S RGC dendrites with inhibition blocked via bath application of NBQX (Fig. 7) and separately with inhibition from spiking amacrine cells blocked with TTX. These experiments provide additional evidence for the small role of inhibition. We attempted to measure the kinetics of excitatory input to ON-S cells with inhibition blocked, but we found that the excitatory input showed strong spontaneous oscillations under these conditions and the light responses were changed so drastically that we did not feel we could make a clear comparison with control conditions.

      (3) There could be additional discussion and references to the literature describing several topics, including: temporal dynamics of glutamate release at different levels of the IPL; previous evidence that release sites from a single presynaptic neuron can differ in their temporal properties depending on the postsynaptic target; previous investigations of the role of inhibition in temporal tuning within retinal circuitry. 

      Thanks, we have included more discussion and references to the relevant literature as you have suggested in the recommendations to authors.

      Reviewer #1 (Recommendations For The Authors): 

      The presented raw data of the pharmacological experiments show that SR95531 and TPMPA robustly increased both the amplitude and duration of the transient component of the light step-evoked excitatory currents, with slight, if any enhancement of the sustained component in ON-T RGCs Figure 5C. Statistical analysis of the population data (n=5) with Wilcoxon signed rank test yielded no significant difference (ln 363). However, reanalyzing the data extracted from the graph (Figure 5D) revealed that the difference between the paired observations is normally distributed (Shapiro-Wilk normality test, P=0.48) allowing parametric statistics to be used, which provides higher statistical power. Accordingly, reanalyzing the presented data with paired Student's t-test data revealed significant differences (P=0.01) in the steady-state amplitude normalized to that of the peak, recorded in the presence of SR95531 and TPMPA. In other words, based on the (rough) analysis of the presented pharmacology data GABAergic feedback inhibition significantly contributes to shaping the transient portion of the light-evoked excitatory currents in ON-T RGCs, by making it more transient. I believe a similar analysis based on the actual data is necessary, and the results should be communicated either way. However, if warranted, two-photon glutamate sensor imaging experiments showing that blocking GABA- and glycinergic inhibition does not change the kinetics of light-evoked glutamate signals at ON-T RGCs should also be performed, as these would be critical in drawing a conclusion regarding the effect of feedback inhibition on glutamate release from bipolar cells.

      Thanks for this feedback. We have added another cell to the data set in Fig. 5D. With this addition, SR95531/TPMPA application significantly increases the response transience of excitatory currents measured in ON-T RGCs compared to control. This enhanced transience in GABA<sub>A/C</sub> receptor blockers is due to an increase in the amplitude of the initial peak component of the response (control peak amplitude: -833.7±103.3 pA; SR95531+TPMPA peak amplitude: 2023±372.7pA; p=0.03, Wilcoxon signed rank test), with no change to the later sustained component (control plateau amplitude: -200.7±14.71pA; SR95531+TPMPA plateau amplitude: -290.9±43.69pA; p=0.15, Wilcoxon signed rank test).

      We should clarify that this result indicates that GABAergic inhibition makes the excitatory inputs to ON-T RGCs less transient. Block of GABA receptors increased transience, thus intact GABAergic transmission appears to limit the initial peak of the response and therefore make excitatory currents more sustained. We unfortunately were not able to examine whether sustained excitatory currents in ON-S RGCs would become more transient using the same approach. In our hands, bath application of SR95531+TPMPA led to the generation of large-amplitude (>1nA) oscillatory bursts of excitatory input that developed within 5 minutes and persisted for the duration of the incubation (up to ~30 min) in drugs. Further, presentation of light steps tended to induce variable amplitude responses, likely dependent on the presence of spontaneous bursts; when large amplitude responses were evoked, these typically oscillated for several seconds after the step.

      To examine a potential role for presynaptic inhibition in transient vs. sustained bipolar cell output, we therefore chose to eliminate amacrine cell-mediated inhibition by bath application of the AMPA/kainate receptor antagonist NBQX in additional iGluSnFR measurements. This manipulation should leave ON bipolar cell responses intact while eliminating most amacrine cell-mediated responses (and OFF bipolar cell driven responses). In separate experiments, we also eliminated inhibition from spiking amacrine cells by bath application of TTX. As shown in new Fig. 7, sustained and transient responses persisted in distal versus proximal RGC dendrites, respectively. Compared to SR95531/TPMPA, bath application of NBQX was not associated with spontaneous bursts of glutamate release around ON-S dendrites. These results show that amacrine cell-mediated inhibition is not required for either sustained or transient glutamate release from bipolar cells that provide input to ON-S and ON-T RGCs.

      Small points: 

      (1) The legend of Figure 1 (D) refers to shaded areas to show {plus minus} SEM, but no shade is visible (at least in my printout).

      The SEM shading is there in Fig. 1D but is mostly obscured by the mean lines for the respective RGC types. We have added this to the figure caption.

      (2) I found the reported Vrest for the ON bipolar cells somewhat depolarized. Perhaps due to the uncompensated junction potentials? 

      These measurements are indeed not corrected for the liquid junction potential (which is approximately -10.8 mV between K-gluconate internal and Ames’ solution). We did not apply this correction since the appropriate value is not clear in perforated patch recordings as the intracellular chloride concentration is unknown (and can differ from that in the pipette solution). We have clarified this in the results text where we describe the Vrest values (lines 335-338).

      (3) It is Wilcoxon signed rank test, not Wilcoxan. 

      Thanks for catching this. This has been corrected in the revised manuscript.

      Reviewer #2 (Recommendations For The Authors): 

      Some amacrines express vesicular Glut-3 transporter and are reported to release glutamate (Marshak, Vis Neurosci 2016). Are Amacrine vGlut3 signals postsynaptic (within ~0.5 um) to cone bpc ribbons?

      We did not characterize VgluT3-expressing amacrine cells in our SEM datasets. A recent study by Friedrichson et al. (Nat. Comm. 2024; PMID 38580652) using 3D SEM reconstructions found that Vglut3-amacrines are postsynaptic to both type 5i and type 6 bipolar cells, as well as other type 5/xbc bipolar cells (and receive >50% of their input from type 3a OFF bipolar cells).

      How far apart are the postsynaptic targets from the ribbon release sites? The ribbons at type 5i bpc/On-T input appear separated from the dendrites of On-T rgcs (Figure 8C). At least further away than the type 6 bpc ribbons are from On-S rgc dendrites (Figure 8C). Distance may create a thresholding phenomenon, whereby only multivesicular bouts at the onset of depolarization are able to elevate synaptic Glu to levels needed to activate On-T GluRs. See Grabner et al Nat Comm 2023 for such scenarios in the outer retina.

      This is an intriguing possibility, but we should point out that the presynaptic ribbons in Fig. 9C (former Fig. 8C) are similar distances (within the resolution of our reconstructions) from the ON-T and ON-S dendrites. We have increased the brightness of the dendrite segments for both RGC types in the resubmission figure; note that ON-T RGCs have spine-like protrusions that may not have been as apparent in the previously submitted version of our manuscript.

      In Figures 1 and 2, Sustained responses look like the derivative of Transient responses, minus the negative going inflection. In addition, the sustained responses appear to have a lower threshold of activation than the transient On rgcs, because there are more bouts of action potentials (and membrane depol in V-clamp) with earlier onset in sustained than transients traces. It would be great if the GLuSniff data captured these differences. Take cumulative dF/F and see what the onset time is, or an initial tau if possible.

      This is a good suggestion. However, we are reluctant to make detailed quantitative comparisons such as this without further validation of how the kinetics of the iGluSnFR signals relate to kinetics of glutamate release.  A specific concern is that differences in the location and amount of iGluSnFR expression could impact any such comparisons.

      A recent study by Kim et al von Gersdorff (Cell Reports, 2023) presents interesting phases of release in response to light flashes, measured from AIIs, and complementary results from pairs of rbcs-AIIs. The findings highlight the complexity of SV pools under well-controlled experiments. Could their results be explained as variations in rbc ribbon size through development, and possibly between rbcs or within an rbc? 

      This certainly seems possible and would be consistent with the dependence of release on ribbon size that our results support.  It would be interesting to see if there are clear anatomical correlates of that change in release properties.  

      Figure 5 is a pivotal point in the study, but my review has identified numerous weaknesses. The feedback inhibition onto bipolar cell terminals is likely to sculpt glutamate release, and the results do not convincingly rule out this possibility. The suggestions for improvements range from the data needing to be reanalyzed with regard to statistical tests, and/or adding a few more data points (n = 5) before concluding a p: 0.06 is insignificant. 

      We have added an additional recording to this data set. With n= 6 cells, there is now a statistically significant difference between ON-T RGC excitatory currents measured in control conditions versus during GABA<sub>A/C</sub> receptor blockade. Please note that all the recordings shown in Figure 5C-F are from ON-T RGCs (the two panels show separately block of GABergic and glycinergic receptors). We did not make it sufficiently clear that the original trend (now statistically significant) is opposite of that expected if presynaptic GABAergic inhibition contributes to response transience in ON-T RGCs.  What we see is that excitatory synaptic inputs to ON-T RGCs become more transient (rather than mpre sustained) during GABA<sub>A/C</sub> receptor blockade. We have revised the text in that section to make this point more clearly.

      We have also included new data from iGluSnFR measurements showing that bath application of NBQX does not affect light step-evoked glutamate release kinetics at proximal (sustained) or distal (transient) RGC dendrites (control: steady-state amp. as % of peak amp. 13 ± 10; mean ± S.D.; n = 189 ROIs/4 FOVs for ON-T dendrites vs 40 ± 12; mean ± S.D.; n = 287 ROIs/8 FOVs for ON-S dendrites; NBQX: 6 ± 3; mean ± S.D.; n = 112 ROIs/1 FOV for ON-T dendrites vs 23 ± 9; mean ± S.D.; n = 97 ROIs/2 FOVs for ON-S dendrites; *p<0.001). By blocking glutamate receptors on amacrine cells, NBQX (AMPA/KAR antagonist) eliminates all/most amacrine cell-mediated signaling in the retina and should therefore abolish presynaptic inhibitory input to bipolar cell terminals across the IPL. Taken together, our results indicate that presynaptic inhibition does not play a critical role in establishing transient versus sustained kinetics for the stimulus conditions we employed in our study.

      There is a need to cite more recent literature on bipolar cell ribbons (e.g. see Wakeham et al., Front. Cell. Neurosci., 2023), in order to support experimental design and interpretation of the results. The authors should discuss their Ribeye-KO data from Okawa et al 2019 Nat Comm, Figure 7, in the context of their new iGluSnFR results. 

      Thank you for prompting us on this issue. We have expanded the discussion regarding ribbons and included more citations to the ribbon literature. That is largely in the three paragraphs starting on line 727.

      One point deserves emphasis because it is central to the authors' ribbon model but not consistent with their data. The ribbon model as they put it, and as commonly stated, holds that a transient phase of release at the onset of depolarization indicates the depletion of the primed SVs, and the subsequent slower rate of release (steady state release in the authors' terms) reflects recruiting, priming, and release of new SVs. The On-transient dendrite GluSnf responses agree with this multiphasic process, but the sustained responses show only an elevation in glutamate without a pronounced initial peak, creating a square-wave-shaped response (Figure 6B). This does not agree with the simple ribbon-based release model. I would expect the signals from the T- and S-on dendrites to have a comparable initial phase, while the sustained phase should be greater in amplitude for the S-on dendrites. More discussion may clarify possible mechanisms.

      Thanks for pointing this out. The example iGluSnFR traces we originally included in the manuscript were not entirely representative in that they did not show much initial transient phase. Note there is a distribution of steady-state amplitudes for proximal dendrites in Fig. 6C; the examples are from ROIs from the upper end of the distribution. In the new Figure 7, we have included some additional examples that show both a clear transient and sustained component. The summary data in Figure 6C shows the distribution of sustained/transient ratios across ROIs.  

      Reviewer #3 (Recommendations For The Authors): 

      (1) It would be interesting to understand the differences in IPSCs in the two RGC types. Perhaps they are small in both types, which would explain their apparent lack of impact on temporal tuning. The authors may already have these data.

      We did make measurements of noise-evoked IPSCs (as well as EPSCs) in a subset of ON-T and ON-S recordings. We have now included this data as Figure S3. There are slight differences in the kinetics of inhibition between RGC types (Fig. S3C) and there is a trend towards stronger inhibition (relative to excitation) in ON-T RGCs compared to ON-S RGCs (Fig. S3E), although there is not a statistically significant difference. In both cases excitatory synaptic currents are as large or larger than inhibitory currents, and this does not include the difference in driving force near spike threshold which will favor excitatory input by a factor of 2-3.  Hence our data suggests that postsynaptic inhibition does not play a major role in generating the differential temporal spiking responses of ON-T and ON-S RGCs. However, additional experiments examining the relative contribution of excitation and inhibition to spiking output in these RGCs would be needed to reach a firm conclusion.

      The pharmacological experiments in which we blocked inhibition (Fig. 5C-F, new Fig. 7) were designed to test the effect of presynaptic inhibition on bipolar cell output (voltage-clamp isolation of excitatory currents in Fig. 5; iGluSnFR measurements of glutamate release in Fig. 7). We do not mean to suggest that postsynaptic inhibition does not have any role in shaping the spiking behavior of these RGC types, but that transient vs. sustained kinetics are already present in the bipolar cell output and that presynaptic inhibition of bipolar cell terminals does not appear to account for this difference.  We have revised the text throughout to be clearer on this point.

      (2) It could be convincing to show transient/sustained differences between RGC types in dim light, where the response would depend on the rod bipolar/AII circuit. In this case, any difference in temporal properties would presumably be explained by differences that localize to the cone bipolar cell axon terminals. Indeed, is that the result in Figure 1B? This seems to be a dim stimulus presented on darkness, which may be driven through the rod bipolar pathway. The authors could then discuss the interpretation of this data in terms of the rod bipolar circuit. 

      Yes, Figure 1B is a dim light step (~30R*/rod/s) presented from darkness and the distinction between cells is clear down at still lower light levels that more effectively isolate signaling through the rod bipolar pathway. Thanks for making this point that observation of distinct temporal responses under scotopic conditions where signals suggests these differences must arise at and/or downstream of cone bipolar cell output. We have included additional text (lines 361-365) in the results describing bipolar cell responses that raise this point.

      (3) Glutamate release was already measured across the full IPL depth by Borghuis et al. (2013) and Franke et al. (2017). It would be appropriate to better motivate the current study based on these existing measurements.

      We have clarified that these important studies provided important motivation for measuring excitatory synaptic input to ON-T vs. ON-S RGCs (lines 165-169).   

      (4) Line 212/213. It would be appropriate to add to the list of papers showing the different stratification of transient vs. sustained responses: Borghuis et al. (2013) and Beaudoin et al. (2019).

      Thank you - these references have been added.  

      (5) Line 635-638. It would be useful to discuss papers by Pottackal et al. (2020, 2021), which suggested that a single presynaptic cell (starburst) can signal with different temporal properties depending on the postsynaptic target (other starburst vs. DSGCs). The mechanism was not completely resolved (i.e., it was not explained by differences in presynaptic Ca channels at the two synapse types), but it at least shows that neurotransmitter release can show different filtering depending on the postsynaptic target from the same presynaptic neuron. (This could also be at play for the type 6 bipolar cell inputs to ON-S vs. ON-T RGCs in the present study.)

      We have added a reference to Pottackal et al 2021 in this section.

      (6) Line 714. Should describe the procedure for embedding the tissue in agarose. 

      We have added more detail regarding agarose embedding for preparation of retinal slices in the methods.

      (7) Line 775. Need a better description of the virus (not the construct), what serotype? Provide the Addgene number if available. 

      This has been added to the methods.

      (8) Line 808. Was the SD for the gaussian really 50%? That would cut off a lot of the distribution, i.e., it would get clipped at 0. 

      Yes, the SD for Gaussian noise was 50%. This high contrast stimulus was used in part to achieve measurable signals from bipolar cells. You are correct that some of the distribution was clipped at 0 (it was also clipped at twice the mean to make sure that the distribution remained symmetrical). The clipping was accounted for during our LN analyses.

      (9) The paper should discuss Swygart et al. (2024) results showing different spatial surround properties of neighboring synapses from a type 6 bipolar cell. Based on this result, it would seem very likely that amacrine cells could play a role in shaping the temporal processing of bipolar cell glutamate release as well. Indeed, spatial and temporal processing will not be completely independent in a typical experiment. For example, with the spot stimulus used in the present study, bipolar cells within the center versus the edge of the spot will have different balances of center/surround activation, which could potentially influence their temporal processing.

      We have included discussion of results from Swygart et al 2024 in the section of the Discussion in which we point out differences in surround inhibition between ON-S and ON-T RGCs (lines 710-714). We agree that spatial and temporal processing are not completely independent. Our results with SR95531/TPMPA indicate ON-T RGCs receive stronger GABAergic surround inhibition than ON-S RGCs (Fig. S8). However, our results in Fig. 5C-D show GABAergic surround inhibition makes ON-T excitation more sustained rather than more transient. So even though bipolar cells presynaptic to ON-T RGCs receive stronger surround inhibition (Fig. S8), this inhibition does not establish the transient kinetics of glutamate release from these bipolar cells (in fact, it works to make release more sustained). Additional iGluSnFR experiments where we used NBQX to block all/most amacrine cell-mediated responses also suggest presynaptic inhibition does not have an important role in establishing differential glutamate release kinetics onto ON-S vs. ON-T RGC dendrites (Fig. 7).

      (10) Cui et al. 2016 described ON-S Alpha as having a divisive suppression mechanism that explained the temporal properties of white-noise response better than a standard LN model. Do the authors think the divisive suppression reflects a property of the excitatory synapses independent of inhibition?

      This is an interesting question, but one for which we don’t have a good answer for now. As mentioned in some of the above responses and as we have tried to clarify in the manuscript, we do not mean to imply that there is no role for presynaptic inhibition in modulating bipolar cell output, including for the divisive suppression described by Cui et al. Rather, our point is that the distinction between transient and sustained excitatory input to ON-T and ON-S RGCs does not require presynaptic inhibition and is more likely an intrinsic property of the bipolar cell synapses. 

      (11) Do the authors mean to imply that the pool size at bipolar cell ribbon synapses could depend on the use of Ames vs. ACSF? 

      For now, we do not have a good answer as to why there are quantitative differences in response kinetics between Ames and ACSF. We have not done any experiments to investigate whether ribbon sizes or ribbon pools are different in the different solutions.

      (12) More generally, different mean luminance levels could drive different levels of baseline glutamate release, which could alter the available pool of vesicles at bipolar cell ribbon synapses. Can we explain varying degrees of transient/sustained in the same cell at different levels of mean luminance based on this mechanism (e.g., Grimes et al., 2014)?

      Yes, the emergence of a transient component of excitatory input to ON-S RGCs at ~100 R*/rod/s versus at scotopic levels (0.5 R*/rod/s) in Grimes et al. (2014) could be due to differences in the number of releasable vesicles (due to different type 6 bipolar cell axon terminal membrane potentials and hence differences in spontaneous release rates) at the different light levels.

      We should note that although ON-T and ON-S RGCs exhibit some changes in transient/sustained kinetics across different light levels, the relative differences between these RGC types are preserved across light levels. We have included a statement about this in the text (lines 361-367).

      (13) Figure 1. Have the authors considered performing the LN analysis of the firing responses, to compare the degree of rectification between the two RGC types?

      This is a good suggestions. From an LN analysis of spiking responses, we do not observe a clear difference between the static nonlinearity component of the model for ON-T and ON-S RGCs. Both RGC types are strongly rectified under our experimental conditions.  

      (14) Figure 5. Do the authors have the pharmacology data for the ON-S cells? There are examples of sustained EPSCs in amacrine cells that become more transient after blocking inhibition, which at least suggests that inhibition can play some role in the transient/sustained nature of glutamate release (Park et al., 2015, Figure 3). Perhaps ON-S cells likewise become more transient with inhibition blocked. 

      (The colored symbols in A were not visible in a printout. It would be useful to indicate the cell type (ON-T) in C and E). 

      As described above in the response to reviewer 1’s recommendation for authors, we were not able to use SR95531/TPMPA for recordings from ON-S RGCs. Bath application of these drugs led to oscillatory bursts of excitatory input to ON-S RGCs. However, the lack of effect of bath-applied NBQX on the kinetics of glutamate release around either ON-T or ON-S RGC dendrites (new Fig. 7) suggests that presynaptic inhibition does not contribute to generating sustained excitation to ON-S RGCs (or transient excitation to ON-T RGCs).  

      We have corrected Fig. 5A to include the referenced colored symbols and have also edited Fig 5C and E to clarify that measurements in Fig. 5C-F are from ON-T RGCs.

      (15) Figure 6 legend. Should be Kcng4-Cre, not KCNG-Cre. Also, it should make clear that this is cre-dependent expression of iGluSnFR. For C, were the statistics based on the number of FOVs? 

      Thanks for catching this, we have corrected Figure 6 legend. The methods section includes a description of how we achieved iGluSnFR expression on alpha RGC dendrites via a cre-dependent viral strategy in Kcng4-Cre mice.  We have also clarified that the statistics are based on ROIs in Figure 6C.

      (16) Figure 7, Flashes were apparently 400% contrast on a dim background. What was the background? Is there a rod component to the response in this case? 

      In Figure 7 (now Figure 8), the same background (~3300 R*/rod/s; 2000 P*/Scone/s) was used as in the Gaussian noise and step response experiments. At this light level, the response should be primarily be mediated by cones.

      (17) Figure S1. The colors here differ from those in previous figures (Here, ON-T, magenta; ON-S, cyan). Is something mislabeled? 

      Thanks for catching this. We mistakenly swapped the labels in the legend for Fig. S1. The figure colors were correct, but we have corrected the legend in the revised manuscript.

      (18) Figure S2. For the LN model for RGC synaptic currents, the ON-S are more rectified than some previous recordings (Cui et al., 2016). Is this perhaps explained by different light levels?

      We aren’t sure why ON-S excitatory currents are more strongly rectified in our recordings compared to Cui et al., 2016. Cui et al. used an ~20-fold higher background light intensity (~40,000 P*/cone/s vs. ~2000 P*/cone/s in our study), so different light levels may be a factor (although we should point out that rectification increases in these RGCs between scotopic to low photopic light levels (see Grimes et al., 2014 and Kuo et al., 2016).

      (19) The study is apparently comparing PV1 and PV2 described in Farrow et al. (2013; see Supplementary information for stratification analysis), which should be cited.

      Thanks, we have corrected this oversight in the revised manuscript. We now cite Farrow et al and mention the connection to PV1 and PV2 in the first paragraph of Results (lines 104-108).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Sumary:

      This study evaluates whether species can shift geographically, temporally, or both ways in response to climate change. It also teases out the relative importance of geographic context, temperature variability, and functional traits in predicting the shifts. The study system is large occurrence datasets for dragonflies and damselflies split between two time periods and two continents. Results indicate that more species exhibited both shifts than one or the other or neither, and that geographic context and temp variability were more influential than traits. The results have implications for future analyses (e.g. incorporating habitat availability) and for choosing winner and loser species under climate change. The methodology would be useful for other taxa and study regions with strong community/citizen science and extensive occurrence data.

      We thank Reviewer 1 for their time and expertise in reviewing our study. The suggestions are very helpful and will improve the quality of our manuscript.

      Strengths:

      This is an organized and well-written paper that builds on a popular topic and moves it forward. It has the right idea and approach, and the results are useful answers to the predictions and for conservation planning (i.e. identifying climate winners and losers). There is technical proficiency and analytical rigor driven by an understanding of the data and its limitations.

      We thank Reviewer 1 for this assessment.

      Weaknesses:

      (1) The habitat classifications (Table S3) are often wrong. "Both" is overused. In North America, for example, Anax junius, Cordulia shurtleffii, Epitheca cynosura, Erythemis simplicicollis, Libellula pulchella, Pachydiplax longipennis, Pantala flavescens, Perithemis tenera, Ischnura posita, the Lestes species, and several Enallagma species are not lotic breeding. These species rarely occur let alone successfully reproduce at lotic sites. Other species are arguably "both", like Rhionaeschna multicolor which is mostly lentic. Not saying this would have altered the conclusions, but it may have exacerbated the weak trait effects.

      We thank the reviewer for their expertise on this topic. We obtained these habitat classifications from field guides and trait databases, and reviewed our primary sources to clarify the trait classifications. We reclassified the species according to the expertise of this reviewer and perform our analysis again; please see details below.

      (2) The conservative spatial resolution (100 x 100 km) limits the analysis to wide- ranging and generalist species. There's no rationale given, so not sure if this was by design or necessity, but it limits the number of analyzable species and potentially changes the inference.

      It is really helpful to have the opportunity to contextualize study design decisions like this one, and we thank the reviewer for the query. Sampling intensity is always a meaningful issue in research conducted at this scale, and we addressed it head-on in this work.

      Very small quadrats covering massive geographical areas will be critically and increasingly afflicted by sampling weaknesses, as well as creating a potentially large problem with pseudoreplication. There is no simple solution to this problem. It would be possible to create interpolated predictions of species’ distributions using Species Distribution Models, Joint Species Distribution Models, or various kinds of Occupancy Models. None of these approaches then leads to analyses that rely on directly observed patterns. Instead, they are extrapolations, and those extrapolations typically fail when tested, although they have still been tested (for example, papers by Lee-Yaw demonstrate that it is rare for SDMs to predict things well; occupancy models often perform less well than SDMs and do not capture how things change over time - Briscoe et al. 2021, Global Change Biology). The result of employing such techniques would certainly be to make all conclusions speculative, rather than directly observable. 

      Rather than employing extrapolative models, we relied on transparent techniques that are used successfully in the core macroecology literature that address spatial variation in sampling explicitly and simply. Moreover, we constructed extensive null models that show that range and phenology changes, respectively, are contrary to expectations that arise from sampling difference. 100km quadrats make for a reasonable “middle-ground” in terms of the effects of sampling, and we added a reference to the methods section to clarify this (see details below).

      (3) The objective includes a prediction about generalists vs specialists (L99-103) yet there is no further mention of this dichotomy in the abstract, methods, results, or discussion.

      Thank you for pointing this out - it is an editing error that should have been resolved prior to submission. We replaced the terms specialist and generalist with specific predictions based on traits (see details below).

      (4) Key references were overlooked or dismissed, like in the new edition of Dragonflies & Damselflies model organisms book, especially chapters 24 and 27.

      We thank Reviewer 1 for making us aware of this excellent reference. We have reviewed the text and include it as a reference, in addition to other references recommended by Reviewer 1 and other reviewers (see details below).

      Reviewer #2 (Public review):

      Summary:

      This paper explores a highly interesting question regarding how species migration success relates to phenology shifts, and it finds a positive relationship. The findings are significant, and the strength of the evidence is solid. However, there are substantial issues with the writing, presentation, and analyses that need to be addressed. First, I disagree with the conclusion that species that don't migrate are "losers" - some species might not migrate simply because they have broad climatic niches and are less sensitive to climate change. Second, the results concerning species' southern range limits could provide valuable insights. These could be used to assess whether sampling bias has influenced the results. If species are truly migrating, we should observe northward shifts in their southern range limits. However, if this is an artifact of increased sampling over time, we would expect broader distributions both north and south. Finally, Figure 1 is missed panel B, which needs to be addressed.

      We thank Reviewer 2 for their time and expertise in reviewing our study.

      It is possible that some species with broad niches may not need to migrate, although in general failing to move with climate change is considered an indicator of “climate debt”, signaling that a species may be of concern for conservation (ex. Duchenne et al. 2021, Ecology Letters). We revised the discussion to acknowledge potential differences in outcomes (please see details below).

      We used null models to test whether our results regarding range shifts were robust, and if they varied due to increased sampling over time. We found that observed northern range limit shifts are not consistent with expectations derived from changes in sampling intensity (Figure S1, S2). 

      We thank Reviewer 2 for pointing out this error in Figure 1. This conceptual figure was a challenge to construct, as it must illustrate how phenology and range shifts can occur simultaneously or uniquely to enable a hypothetic odonate to track its thermal niche over time. In a previous version of the figure, we had a second panel and we failed to remove the reference to that panel when we simplified the figure. We have updated the figure and figure caption (please see details below).

      Reviewer #3 (Public review):

      Summary:

      In their article "Range geographies, not functional traits, explain convergent range and phenology shifts under climate change," the authors rigorously investigate the temporal shifts in odonate species and their potential predictors. Specifically, they examine whether species shift their geographic ranges poleward or alter their phenology to avoid extreme conditions. Leveraging opportunistic observations of European and North American odonates, they find that species showing significant range shifts also exhibited earlier phenological shifts. Considering a broad range of potential predictors, their results reveal that geographical factors, but not functional traits, are associated with these shifts.

      We thank Reviewer 3 for their expertise and the time they spent reviewing our study. Their suggestions are very helpful and will improve the quality of our manuscript.

      Strengths:

      The article addresses an important topic in ecology and conservation that is particularly timely in the face of reports of substantial insect declines in North America and Europe over the past decades. Through data integration the authors leverage the rich natural history record for odonates, broadening the taxonomic scope of analyses of temporal trends in phenology and distribution to this taxon. The combination of phenological and range shifts in one framework presents an elegant way to reconcile previous findings improving our understanding of the drivers of biodiversity loss.

      We thank Reviewer 3 for this assessment.

      Weaknesses:

      The introduction and discussion of the article would benefit from a stronger contextualization of recent studies on biological responses to climate change and the underpinning mechanism.

      The presentation of the results (particularly in figures) should be improved to address the integrative character of the work and help readers extract the main results. While the writing of the article is generally good, particularly the captions and results contain many inconsistencies and lack important detail. With the multitude of the relationships that were tested (the influence of traits) the article needs more coherence.

      We thank Reviewer 3 for these suggestions. We revised the introduction and discussion to better contextualize species’ responses to climate change and the mechanisms behind them (see details below). We carefully reviewed all figures and captions, and made changes to improve the clarity of the text and the presentation of results (see details below).

      Reviewer #1 (Recommendations for the authors):

      Comment:

      (1) Following weakness #1 in the public review, the authors should review the habitat classifications, consult with an odonatologist, and reclassify many species from Both to Lentic and redo the analysis.

      Thank you for pointing out this disagreement among expert habitat classifications that we cited and other literature. We reclassified species’ habitat preferences based on classifications by Hof et al., a source that was consistent with your suggestions, and identified additional species as Lentic that our other references had identified as Both. We performed our analysis with this new dataset and, as you suspected, our results did not change qualitatively: species habitat preferences did not predict their range shifts.

      Hof, Christian, Martin Brändle, and Roland Brandl. "Lentic odonates have larger and more northern ranges than lotic species." Journal of Biogeography 33.1 (2006): 63-70.

      Comment:

      (2) Following weakness #2, would it be worthwhile or interesting to analyze a smaller ranging group (e.g. cut the quad size in half, 50 x 50 km) to bring in more species and potentially change the inference? Or is the paper too tightly constructed to allow this, even as a secondary piece?

      Thank you for this comment, as it highlights an important consideration for macroecological analyses, and the importance of balancing multiple factors for determining quadrat size. Issues exist with identifying drivers of range boundaries among species with narrow ranges when they are analyzed separately from wide-ranging species, and examining larger quadrats can actually help clarify drivers (Szabo, Algar, and Kerr 2009). The smaller quadrats are, the higher the likelihood that the species is actually there but was never observed, or that the quadrat only covers unsuitable habitat and the species is absent from the entire (or almost entire) quadrat. Too many absences creates issues with violating model assumptions, and creates noise that makes it difficult to identify drivers of species’ range and phenology shifts.

      Moreover, we constructed extensive null models that show that range and phenology changes, respectively, are contrary to expectations that arise from sampling difference. 100km quadrats make for a reasonable “middle-ground”, and we have included a brief explanation of this in the text: “We assigned species presences to 100×100 km quadrats, a scale that is large enough to maintain adequate sampling intensity but still relevant to conservation and policy (Soroye et al., 2020), to identify the best sampled species.”  (Lines 170-172).

      Szabo, Nora D., Adam C. Algar, and Jeremy T. Kerr. "Reconciling topographic and climatic effects on widespread and range‐restricted species richness." Global Ecology and Biogeography 18.6 (2009): 735-744.

      Comment:

      (3) Following weakness #3, are specialists the ones that "failed to shift" (L18)? If so please specify. The prediction about generalists vs specialists needs to be removed or incorporated in other parts of the paper.

      Thank you for pointing this out, we intended to suggest that species with more generalist habitat requirements might be better able to shift, but ultimately found that traits did not predict species’ shifts. We corrected our prediction regarding habitat generalists as follows: “We predicted that species able to use both lentic and lotic habitats would shift their phenologies and geographies more than those able to use just one habitat type, as generalists outperform specialists as climate and land uses change (Ball-Damerow et al., 2015, 2014; Hassall and Thompson, 2008; Powney et al., 2015; Rapacciuolo et al., 2017).” (Lines 128-132).

      Comment:

      (4) Following weakness #4, cite Pinkert et al at lines 70-73 and Rocha-Ortega et al at lines 73-77 along with https://doi.org/10.1098/rspb.2019.2645. Add Sandall et al https:// doi.org/10.1111/jbi.14457 to L69 references.

      Thank you for the excellent reference suggestions, we have added them as suggested (Lines 80, 86, 77).

      Comment:

      Other comments/suggestions:

      (1) Title: consider adding temp variability 'Range geography and temperature variability, not functional traits,...'.

      Thank you for this suggestion, we have added temperature variability to the title: “Range geography and temperature variability explain cross-continental convergence in range and phenology shifts in a model insect taxon”.

      Comment:

      (2) L125: is (northern) Mexico included in North America?

      Yes, we did include observations from Northern Mexico, and have specified this in the text: “We retained ~1,100,000 records from Canada, the United States, and Northern Mexico, comprising 76 species (Figure 2).” (Lines 174-176).

      Comment:

      (3) L128: I'd label this section 'Temperature variability' rather than 'Climate data'.

      Thank you, we agree that this is a more appropriate title for this section, and have replaced ‘Climate data’ with ‘Temperature variability’ (Line 185).

      Comment:

      (4) Table 2: why are there no estimates for the traits?

      We apologise, this information should have been included in the main body of the manuscript, but was only explained in the Table 2 caption. We have added the following explanation: “Non-significant variables, specifically all functional traits, were excluded from the final models.”. (Line 312-323).

      Comment:

      (5) Figure 2: need to identify the A-D panels.

      We apologise for this error and have clarified the differences between panels in the figure caption:

      “Figure 2: Richness of 76 odonate species sampled in North America and Europe in the historic period (1980-2002; panes A and C) and the recent period (2008-2018; panes B and D). Species richness per 100 × 100 km quadrat is shown in panes A and B, while panes C and D show species richness per 200 × 200 km quadrat. Dark red indicates high species richness, while light pink indicates low species richness.” (Lines 1002-1006).

      Comment:

      (6) L163-173: I am not familiar with this analysis but it sounds interesting and promising, I am not sure if this can be clarified further. Why the -25 to 25, and -30 to 30, doesn't the -35 to 35 cover these? And what is meant by "include only phenology shifts that could be biologically meaningful", that larger shifts would not be meaningful or tied to climate change?

      We used different cutoffs for phenology shifts to inspect for outliers that were likely to be errors, potentially do to insufficient sampling to calculate phenology. We clarified in the text as follows:

      “We retained emergence estimates between March 1st and September 1st, as well as species and quadrats that showed a difference in emergence phenology of -25 to 25 days, -30 to 30 days, or -35 to 35 days between both time periods, to include only phenology shifts that could be biologically meaningful to environmental climate change (i.e. exclude errors).” (Lines 169-173).

      Comment:

      (7) L193-200: I agree but would make a distinction between ecological vs functional traits, as other studies view geographic traits as ecological manifestations of functional biology, e.g. https://doi.org/10.1016/j.biocon.2019.07.001 and https://doi.org/10.1016/ j.biocon.2023.110098.

      Thank you for this suggestion, and for making us aware of the thinking around range geographies as ecological traits. We have specified throughout the manuscript that the ‘traits’ we are considering are ‘functional traits’, changed the methods subsection title to “Range geographies and functional traits” (Line 252), and added a brief discussion of ecological traits: “Geographic range and associated climatic characteristics are often considered ecological traits, as they are consequences of functional traits and their interactions with geographic features (Bried and Rocha-Ortega, 2023; Chichorro et al., 2019).” (Lines 256-259).

      Comment:

      (8) L203: What's the rationale for egg-laying habitat as "biologically relevant to spatial and temporal responses to climate change"? That one's not as obvious as the others and needs a sentence more. Also, I am wondering why other traits were not considered here, like color lightness and voltinism. And why not wing size instead of body size, or better yet the two combined (wing loading) as a proxy for dispersal ability?

      We agree that our rationale for using this trait should be better explained, and we have included the following explanation: “Egg laying habitat was assigned according to whether species use exophytic egg-laying habitat (i.e. eggs laid in water or on land, relatively larger in number), or endophytic egg-laying habitat (i.e. eggs laid inside plants, usually fewer in number); species using exophytic habitats are associated with greater northward range limit shifts (Angert et al., 2011).” (Lines 271-275).

      We considered traits that have been found to be important for range and phenology shifts among odonates, as well as being key traits for expectations for species responses to climate change. Flight duration and body size are correlated with dispersal ability (Powney et al. 2015). Body size is also correlated with competitive ability (Powney et al. 2015), potentially making it an important predictor of a species’ ability to establish and maintain populations in expanding range areas. Traits correlated with range shifts also include breeding habitat type (Powney et al. 2015; Bowler et al. 2021) and egg laying habitat (Angert et al. 2011). Ideally, we would have used dispersal data from mark/release/recapture studies, but it was not available for many of the species included in this study. After finding that none of the functional traits we included were related to range shifts, there was no reason to believe that a further investigation of traits would be meaningful.

      Angert AL, Crozier LG, Rissler LJ, Gilman SE, Tewksbury JJ, Chunco AJ. 2011. Do species’ traits predict recent shifts at expanding range edges? Ecology Letters 14:677–689. doi:10.1111/j.1461-0248.2011.01620.x

      Bowler DE, Eichenberg D, Conze K-J, Suhling F, Baumann K, Benken T, Bönsel A, Bittner T, Drews A, Günther A, Isaac NJB, Petzold F, Seyring M, Spengler T, Trockur B, Willigalla C, Bruelheide H, Jansen F, Bonn A. 2021. Winners and losers over 35 years of dragonfly and damselfly distributional change in Germany.Diversity and Distributions 27:1353–1366. doi:10.1111/ddi.13274

      Powney GD, Cham SSA, Smallshire D, Isaac NJB. 2015. Trait correlates of distribution trends in the Odonata ofBritain and Ireland. PeerJ 3:e1410. doi:10.7717/peerj.1410

      Comment:

      (9) L210: I count at least 5 migratory species in table S3, so although maybe not enough to analyze it's misleading to say "nearly all" were non-migratory, revise to "most" or "vast majority".

      Thank you for pointing this out, we have made the suggested correction (Line 277).

      Comment:

      (10) L252-254: save this for the Discussion and write a more generalized statement for results to avoid citations in the results.

      Thank you for this suggestion, we have moved this to the discussion (Lines 517-527).

      Comment:

      (11) Figures S5 & S6: these are pretty important, I'd consider elevating them to the main document as one figure with two panels.

      Thank you for this suggestion, we agree these figures should be elevated to the main text, and have made them into a panel figure (Figure 4).

      Comment:

      (12) L305-307: great point and recommendation!

      Thank you very much for this positive feedback!

      Comment:

      (13) L335-336: another place to cite https://doi.org/10.1098/rspb.2019.2645 which includes a thermal sensitivity index and would add an odonate citation behind the statement.

      Thank you for this excellent suggestion, we have added this citation (line 480). (Rocha-Ortega et al. 2020)

      Comment:

      (14) L352-353: again see also https://doi.org/10.1098/rspb.2019.2645.

      Thank you for highlighting this reference, we have added it to Line 505 as suggested.

      Comment:

      (15) L355: revise "populations that coexist" to "species that co-occur" (big difference between population and species levels and between coexistence and co-occurrence).

      Thank you very much for pointing this out, we have made the suggested change (Line 507).

      Comment:

      (16) L359-365: are the winners and losers depicted in Figures S5 & S6? If so reference the figure (which I suggest combining and promoting to the main text), if not create a table listing the analyzed species and their winner/loser status.

      We agree that this is an excellent place to bring up Figures S5 and S6 from the supplemental. We have moved them to the main document as one figure and referenced it at line 510.

      Reviewer #2 (Recommendations for the authors):

      Comment:

      (1) Line 53-55: The claim that "These relationships generalize poorly taxonomically and geographically" is valid, but the study only tests Odonata on two continents.

      Thank you for this comment – the word ‘generalize’ may imply that our study tries to find a general pattern across many groups. We have changed the language to: “However, these relationships are inconsistent across taxa and regions, and cross-continental tests have not been attempted (Angert et al., 2011; Buckley and Kingsolver, 2012; Estrada et al., 2016; MacLean and Beissinger, 2017).” (Lines 57-59).

      Comment:

      (2) Line 58-59: Is this statement only true for Odonata? It does not seem to hold for plants, for example.

      Thank you for this comment – this statement references a meta-analysis of multiple animal and plant taxa, but the evidence for the importance of range location comes from animal taxa. We have specified that we are referring to animal species to clarify (Line 60).

      Comment:

      (3) Line 87-91: This section is difficult to understand and needs clarification.

      We have clarified this section as follows: “While warm-adapted species with more equatorial distributions could expand their ranges poleward following warming (Devictor et al., 2008), they could also increase in abundance in this new range area relative to species that historically occupied those areas and are less heat-tolerant (Powney et al., 2015).” (Lines 95-121).

      Comment:

      (4) Line 99-100: Please define "generalist" and "specialist" more clearly here (e.g., based on climate niche?).

      Thank you for pointing this out, we intended to suggest that species with more generalist habitat requirements might be better able to shift, but ultimately found that traits did not predict species’ shifts. We corrected our prediction regarding habitat generalists as follows: “We predicted that species able to use both lentic and lotic habitats would shift their phenologies and geographies more than those able to use just one habitat type, as generalists outperform specialists as climate and land uses change (Ball-Damerow et al., 2015, 2014; Hassall and Thompson, 2008; Powney et al., 2015; Rapacciuolo et al., 2017).” (Lines 128-132).

      Comment:

      (5) Line 122: Replace the English letter "X" in "100x100 km" with the correct mathematical symbol.

      We have made the suggested replacement throughout the manuscript.

      Comment:

      (6) Line 148: To address sampling effects, you could check the paper: https://onlinelibrary.wiley.com/doi/full/10.1111/gcb.15524. Additionally, maximum and minimum values are sensitive to extreme data points, so using 95% percentiles might be more robust.

      Thank you for sharing this paper, as it offers a valuable perspective on the study of species’ ranges. While our dataset is substantially composed of observations from adult sampling protocols, unlike the suggested paper which compares adults and juveniles, this is an interesting alternative approach.

      For our purposes it is meaningful to include outliers, as otherwise we may have missed individuals at the leading edge of range expansions. Our intent here was to detect range limits, as opposed to finding the central tendency of species distributions. This approach is widely accepted in the macroecology literature (i.e. Devictor et al., 2012, 2008; Kerr et al. 2015).

      We have included the following discussion of our approach in the methods section:

      “We followed widely accepted methods to determine species range boundaries (Devictor et al., 2012, 2008; Kerr et al., 2015), although other methods exist that are appropriate for different data types and research questions i.e. (Ni and Vellend, 2021). We assigned species presences to 100×100 km quadrats, a scale that is large enough to maintain adequate sampling intensity but still relevant to conservation and policy (Soroye et al., 2020), to identify the best sampled species.” (Lines 168-173).

      Kerr JT, Pindar A, Galpern P, Packer L, Potts SG, Roberts SM, Rasmont P, Schweiger O, Colla SR, Richardson LL,Wagner DL, Gall LF, Sikes DS, Pantoja A. 2015. Climate change impacts on bumblebees converge across continents. Science 349:177–180. doi:10.1126/science.aaa7031

      Soroye P, Newbold T, Kerr J. 2020. Climate change contributes to widespread declines among bumble bees across continents. Science 367:685–688. doi:10.1126/science.aax8591

      Devictor V, Julliard R, Couvet D, Jiguet F. 2008. Birds are tracking climate warming, but not fast enough.Proceedings of the Royal Society B: Biological Sciences 275:2743–2748. doi:10.1098/rspb.2008.0878

      Devictor V, van Swaay C, Brereton T, Brotons L, Chamberlain D, Heliölä J, Herrando S, Julliard R, Kuussaari M,Lindström Å, Reif J, Roy DB, Schweiger O, Settele J, Stefanescu C, Van Strien A, Van Turnhout C,

      Vermouzek Z, WallisDeVries M, Wynhoff I, Jiguet F. 2012. Differences in the climatic debts of birds and butterflies at a continental scale. Nature Clim Change 2:121–124. doi:10.1038/nclimate1347

      Comment:

      (7) Line 195: The species' climate niche should also be considered a product of evolution.

      Thank you for this suggestion. To address this comment and a comment from another reviewer, we changed the text to the following: “Geographic range and associated climatic characteristics are often considered ecological traits, as they are consequences of functional traits and their interactions with geographic features (Bried and Rocha-Ortega, 2023; Chichorro et al., 2019).” (Lines 256-259).

      Comment:

      (8) Line 244: This speculative statement belongs in the Discussion section.

      Thank you for this suggestion, we have moved this statement to the discussion (Lines 451-453).

      Comment:

      (9) Line 252-254: The projection of Coenagrion mercuriale's range contraction is not part of your results and should be clarified or removed.

      Following this suggestion and a similar suggestion from another reviewer, we moved this text to the discussion (Line 517-527).

      Comment:

      (10) Line 314-316: If the species can tolerate warmer temperatures better, why would they migrate?

      We apologize for the confusion, and we have reworded the section as follows: “Emerging mean conditions in areas adjacent to the ranges of southern species may offer opportunities for range expansions of these relative climate specialists, which can then tolerate climate warming in areas of range expansion better than more cool-adapted historical occupants (Day et al., 2018).” (Lines 445-448).

      Comment:

      (11) Line 334-335: Species' tolerance to temperature likely depends on their traits, which were not tested in this study. This should be noted.

      We agree, and we have removed the wording “rather than traits” from this sentence (Line 479).

      Reviewer #3 (Recommendations for the authors):

      Comment:

      (1) Title: The title is too general not specifying that your results are on odonates only, but also stressing the implicit role of climate change to a degree the tests do not support.

      Following this comment and a suggestion from another reviewer we changed the title to the following: “Range geography and temperature variability explain cross-continental convergence in range and phenology shifts in a model insect taxon”. We wanted to emphasize our use of Odonates as a model species that we used to ask broad questions, while being more specific about the climatic variable that we examined (temperature variability).

      Comment:

      (2) L32: consider including Novella-Fernandez et al. 2023 (NatCommun) which addresses this topic in Odonates.

      Thank you for suggesting this very interesting paper, we have added it as a citation (Line 31-32).

      Comment:

      (3) L35: consider including Grewe et al. 2013 (GEB) and Engelhardt et al. 2022(GCB).

      Thank you for these excellent suggestions, we have added the citations (Line 35).

      Comment:

      (4) L47: rather write 'result from' instead of 'driven by'.

      We agree this is a better characterization and have corrected the wording (Line 48-49).

      Comment:

      (5) L49-52: There has been a recent study on this topic for birds (Neate-Clegg et al., 2024 NEE). However, specifying this to insects would make it not less relevant. This review for odonates might be helpful in this regard (Pinkert et al.. 2022, Chapter: "Odonata as focal taxa for biological responses to climate change" IN Dragonflies & Damselflies: Córdoba-Aguilar et al. (2022) Model Organisms for Ecological and Evolutionary Research.

      Thank you for again suggesting excellent references, we have added them to line 52-53, as well as adding the Pinkert citation to lines 61 and 82.

      Comment:

      (6) L53-66: Combine into one paragraph about drivers. With traits first and the environment second. The natural land cover perspective may be too complicated in this context. Consider focusing on generalities of the impact of changes within species' ranges.

      As suggested we have combined these into one paragraph about drivers (Line 59).

      Comment:

      (7) L67-69: The book from before would be a much stronger reference for this claim. Kalkmann et al (2018) do not address the emphasis of global change research in insects on bees and butterflies. Also, I would highlight that most of the current work is at a national scale, rather than cross-continental.

      Thank you for this suggestion, we have added the suggested reference and included that “…recently assembled databases of odonate observations provide a rare opportunity to investigate species’ spatiotemporal responses at larger taxonomic and spatial scales, particularly as most work has been done at national scales.” (Lines 75-77).

      Comment:

      (8) L68: consider rephrasing this part to '..provide a rare opportunity to investigate spatiotemporal biotic responses at larger taxonomic and spatial scales'

      We appreciate this suggestion and really like the wording. We have changed the phrase to read as follows: “While global change research on insects often emphasizes butterfly and bee taxa, recently assembled databases of odonate observations provide a rare opportunity to investigate species’ spatiotemporal responses at larger taxonomic and spatial scales, particularly as most work has been done at national scales.” (Lines 74-77).

      Comment:

      (9) L69: This characteristic is not unique to odonates and would hamper drawing general conclusions. Honestly, I think the detailed and comprehensive data on them is the selling point.

      Thank you for this suggestion, we have edited the sentence to emphasize their use as an indicator species: “Due to their use of aquatic and terrestrial habitat across life different stages, dragonflies and damselflies are also considered indicator species for both terrestrial and aquatic insect responses to changing climates (Hassall, 2015; Pinkert et al., 2022; Šigutová et al., 2025), giving the study of these species broad relevance for conservation.” (Lines 78-81)

      Comment:

      (10) L73: Indicator for what? The first part of the sentence would suggest lesser surrogacy for responses of other taxa. Reconsider this statement. They are well- established indicators for habitat intactness and freshwater biodiversity. Darwell et al. suggested their diversity can serve as a surrogate for the diversity of both terrestrial and aquatic taxa.

      Thank you for this suggestion, we have edited the sentence to emphasize their use as an indicator species: “Due to their use of aquatic and terrestrial habitat across life different stages, dragonflies and damselflies are also considered indicator species for both terrestrial and aquatic insect responses to changing climates (Hassall, 2015; Pinkert et al., 2022; Šigutová et al., 2025), giving the study of these species broad relevance for conservation.” (Lines 78-81)

      Comment:

      (11) L76: Fritz et al., is a study on mammals, not odonates.

      Thank you for pointing out this error, the reference has been removed (Line 84-85).

      Comment:

      (12) L84: Lotic habitats are generally better connected than lentic ones. Lentic species are considered to have a greater propensity for dispersal DUE to the lower inherent spatiotemporal stability (implying lower connectivity) compared to lotic habitats.

      Thank you for your comment, we have rewritten this section as follows: “For example, differences in habitat connectivity and dispersal ability may constrain range shifts for lentic species (those species that breed in slow moving water like lakes or ponds) and lotic species (those living in fast moving-water) in different ways (Kalkman et al., 2018). More southerly lentic species may expand their range boundaries more than lotic species, as species accustomed to ephemeral lentic habitats better dispersers (Grewe et al., 2013), yet lotic species have also been found to expand their ranges more often than lentic species, potentially due to the loss of lentic habitat in some areas (Bowler et al., 2021).” (Lines 88-95).

      Comment:

      (13) L90: I would be cautious with this interpretation. If only part of the range is considered (here a country in the northern Hemisphere) southern species are moving more of their range into and northern species more of their range out of the study area in response to warming (implying northward shifts).

      We have clarified this section as follows: “While warm-adapted species with more equatorial distributions could expand their ranges poleward following warming (Devictor et al., 2008), they could also increase in abundance in this new range area relative to species that historically occupied those areas and are less heat-tolerant (Powney et al., 2015).” (Lines 95-121)

      Comment:

      (14) L117: Odonata Central contains many county centroids as occurrence records. These could be an issue for your use case. I may have overlooked the steps you took to address this, but I think this requires at least more detail and possibly further removal/checks using for instance CoordinateCleaner. The functions implemented in this package allow you to filter records based on political units to avoid exactly this source of error.

      Thank you for this suggestion, we weren’t aware of this issue with Odonata Central. We used the CoordinaterCleaner tool in R to filter all odonate records that we used in our analyses. Less than 1% of observations in our dataset were identified as having potential problems by the tool, so we would not expect this to affect our inferences. However, in future we will employ this tool when using similar datasets.

      Comment:

      (15) L119: Please add a brief explanation of why this was necessary. I am ok with something along the lines in the supplement.

      We moved this information from the supplemental to the main text as follows: “If a species was found on both continents, we only retained observations from the continent that was the most densely sampled. If we merged data for one species found on both continents, we could not perform a cross-continental comparison. However, if the same species on different continents was treated as different species, this would lead to uninterpretable outcomes (and the creation of pseudo-replication) in the context of phylogenetic analyses. In addition, species found on both continents did not have sufficient data to meet criteria for the phenology analysis.” (Lines 161-167).

      Comment:

      (16) L132: This is the letters 'X' or 'x' are not multiplier symbols! Please change to the math symbol (×), everywhere.

      Thank you for pointing out this error, we have made the correction throughout the manuscript.

      Comment:

      (17) L133: add 'main' before 'flight period'

      Thank you for this suggestion, we have made the change. (Line 190)

      Comment:

      (18) L135: I suggest using the coefficient of variation, as it is controlled for the mean. Otherwise, what you see is partly the signature of temperature and not of its variation. For me, it's very difficult to understand what this variation of the variation means and at least needs more explanation.

      Thank you very much for this suggestion, we agree that using the coefficient of variation is a better fit for the question that we’re asking. We re-ran out analyses with the coefficient of variation as the measure of climate variability: all the results reported in the manuscript are now updated for that analysis (Line 377, Table 2), and we have also updated the methods section (Line 191). The results are qualitatively the same to our previous analysis, but we agree that they are now easier to interpret.            

      Comment:

      (19) L155: Please adequately reference all R packages (state the name, and a reference for them including the authors' names, title, and version).

      Thank you for pointing out this omission, we have added reference information for the glm function in base R (Line 298) and ensured all other packages are properly referenced.

      Comment:

      (20) L207: Mention the literature sources here (again).

      We agree that they should be referenced here again, and we have done so (Lines 267-268).

      Comment:

      (21) L209: You could use the number of grid cells as a proxy for range size.

      Following this excellent suggestion, we re-analysed our data using range size, calculated as the number of quadrats occupied by a species in the historical time period, as a predictor. Range size was not significant in our models, but we believe this is the best way to analyze our data, and so have updated our methods (Lines 261-263) and results (375-378).

      Comment:

      (22) L218: It would be preferable to say 'species-level' instead of 'by-species'.

      Thank you for this suggestion, we agree that this is clearer and made the change (Line 298).

      Comment:

      (23) L219-220: this is unclear. Please rephrase.

      We have clarified as follows: “We used both species-level frequentist (GLM; glm function in R) and Bayesian (Markov Chain Monte Carlo generalized linear mixed model, MCMCglmm; Hadfield, 2010) models to improve the robustness of the results.” (Lines 298-300).

      Comment:

      (24) L224: At least for Europe there is a molecular phylogeny available, which you should preferably use (Pinkert et al. 2018, Ecography). Otherwise, I am ok with using what is available

      We apologize that the nature of the phylogeny that we used was not clear; the phylogeny that we used was built similarly to that in Pinkert et al. 2018, Ecography. It created a molecular phylogeny with a morphological/taxonomic tree as the backbone tree, so that species could only move within their named genera or families. We clarified this in the manuscript as follows:

      “We used the molecular phylogenetic tree published by the Odonate Phenotypic Database (Waller et al., 2019), which used a morphological and taxonomic phylogeny as the backbone tree, allowing species to move within their named genera or families according to molecular evidence (Waller and Svensson, 2017).” (Lines 302-305).

      Comment:

      (25) L233: You said so earlier (1st sentence of this paragraph).

      Thank you for pointing this out, we removed the repetitive sentence (Line 323).

      Comment:

      (26) L236-238: To me, it makes more sense to test this prior to fitting the phylogenetic models.

      MCMC-GLMM is considerably less familiar to most researchers than general linear models or there derivatives/descendants, such as PGLS. We report models both with and without phylogenetic relationships included for the sake of transparency, and we are happy to acknowledge that no interpretation here changes substantially relative to these decisions. However, failing to report models that included possible (if small) effects of phylogenetic relatedness might cause some readers to question what those models might have implied. For the moment, we are opting for the most transparent reporting approach here.

      Comment:

      (27) L241: Rather say directly XX of XX species in our data....

      (28) L245: Same here. Provide the actual numbers, please.

      Thank you for this suggestion, we made this change on Line 332 and Line 334.

      Comment:

      (29) L247-249: Then not necessary.

      This issue highlights a challenge in the global biology literature and around the issue of biodiversity monitoring for understanding global change impacts on species. Almost no studies have been able to report simultaneous range and phenology shifts, and the literature addresses these biotic responses to global change predominantly as distinct phenomena. Differences in numbers of species for which these observations exist, even among the extremely widely-observed odonates, seems to us to be a meaningful issue to report on. If the reviewer prefers that we abbreviate or remove this sentence, we are happy to do so.

      Comment:

      (30) L251:261: That is discussion as you interpret your results.

      Following your suggestion and the suggestion of another reviewer, we moved the following lines to the discussion section: “Species that did not shift their ranges northwards or advance their phenology included Coenagrion mercuriale, a European species that is listed as near threatened by the IUCN Red List (IUCN, 2021), and is projected to lose 68% of its range by 2035 (Jaeschke et al., 2013).” (Lines 517-527).

      Comment:

      (31) 252: Good to mention, but why is the discussion limited to C. mercurial?

      We feel that it is important to link the broad-scale results to the specific biological characteristics of individual species, and C. mercurial is an IUCN threatened species. We are happy to expand links to natural history of this group and have added the following: “This group also includes Coenagrion resolutum, a common North American damselfly (Swaegers et al., 2014), for which we could not find evidence of decline. This may be due in part to the greater area of intact habitat available in North American compared to Europe, enabling C. resolutum to maintain larger populations that are less vulnerable to stochastic climate events. Still, this and other species failing to shift in range or phenology should be assessed for population health, as this species could be carrying an unobserved extinction debt.” (Lines 527-533).

      Comment:

      (32) L264: Insert 'being' before 'consistently'.

      Thank you for the suggestion, we made this change (Line 373).

      Comment:

      (33) L271: .'. However,'.

      Thank you for pointing out this grammatical error, we have corrected it (Line 382).

      Comment:

      (34) L273: 'affected' instead of 'predicted'

      Thank you for the suggestion, we made this change (Line 383).

      Comment:

      (35) L279: 'despite pronounced recent warming' sounds not relevant in this context.

      Thank you for this suggestion, we removed this portion of the sentence (Line 408).

      Comment:

      (36) L281: Rather 'the model performance did not improve....'

      Thank you for the suggestion, we made this change (Line 409).

      Comment:

      (37) L288: Add 'but' before 'not'.

      Thank you for the suggestion, we made this change (Line 416).

      Comment:

      (38) L311-316: Reconsider the causality here. maybe rather rephrase to are associated instead. Greater dispersal ability and developmental plasticity might well lead to higher growth rates, rather than the other way around.

      We agree that plasticity/evolution at range edges is important to consider and have included it as an alternative explanation: “Adaptive evolution and plasticity may enable higher population growth rates in newly-colonized areas (Angert et al., 2020; Usui et al., 2023), but this possibility can only be directly tested with long term population trend data.” (Line 449-451).  

      Comment:

      (39) L313-316: Maybe delete the second 'should be able to'.

      This phrase has been changed in response to other reviewer comments and now reads as follows:

      “Emerging mean conditions in areas adjacent to the ranges of southern species may offer opportunities for range expansions of these relative climate specialists, which can then tolerate climate warming in areas of range expansion better than more cool-adapted historical occupants (Day et al., 2018).” (Lines 445-448).

      Comment:

      (40) L331: Limit this statement ending with 'in North American and European Odonata'.

      Thank you for this suggestion, we made this addition (Lines 475-476).

      Comment:

      (41) L346-347: There are too many of these more-research-is-needed statements in the discussion (at least three in the last paragraphs). Please consider finishing the paragraphs rather with a significance statement.

      Thank you for this suggestion, we have changed the final sentence here to the following: “The extent to which species’ traits actually determine rates of range and phenological shifts, rather than occasionally correlated with them, is worth considering further, but functional traits do not systematically drive patterns in these shifts among Odonates in North America and Europe.” (Lines 480-483).

      We also made additional changes, removing a ‘more-research is needed’ statement from the following paragraph (Line 443), as well as from line 499.

      Comment:

      (42) L349: See also Franke et al. (2022, Ecology and Evolution).

      Thank you for highlighting this excellent reference! We have added it to Line 501.

      Comment:

      (43) L363: Maybe a bit late in the text, but it is important to note that there is the third dimension 'abundance trends' or rather a common factor related to range and phenology shifts. I feel this fits better with the discussion of population growth.

      Thank you for this suggestion, we have addressed the importance of abundance trends in the following sentences: “Further mechanistic understanding of these processes requires abundance data.” (Lines 442-443); “It remains unclear if range and phenology shifts relate to trends in abundance, but our results suggest that there are clear ‘winners’ and ‘losers’ under climate change.” (Lines 509-510).

      Comment:

      (44) L375-377: This last sentence is very similar to L371-373. Please reduce the redundancy. Focus more on specifically stating the process instead of vaguely saying 'new insights into patterns' and 'suggesting processes'. Rather, deliver a strong concluding message here.

      Thank you for this suggestion, we feel that we now have a much stronger concluding message: “By considering both the seasonal and range dynamics of species, emergent and convergent climate change responses across continents become clear for this well-studied group of predatory insects.” (Lines 545-547).

      Comment:

      (45) Table 1: To me, the few estimates presented here do not justify a table. rather include them in the text. OR combine them with Table 2. Also, why not include the traits as predictors (from the range shift models) in these models as well?

      We have clarified in the text that the results displayed in Table 1 are from the analysis of the relationship between range and phenology shifts: “The effect of species’ range shifts on phenology range shifts was significant in our model investigating the relationship between these responses, indicating that species shifting their northern range limits to higher latitudes also showed stronger advances in their emergence phenology (Figure 3).” (Lines 341-344).

      As there were no significant effects in the model of phenology change drivers, we have not shown results of this model: “Emergence phenology shifts were not affected by species’ traits, range geography, nor climate variability; due to this, model results are not displayed here.” (Lines 383-384).

      Comment:

      (46) Table 2: L712-713: What does this mean? Are phenology shifts not used as a predictor of range shifts? (why then this comment?). Or do you want to say phenological shifts are not related to Southern range etc? Why do you present a phylosig here but not in Table 1? Why not include the traits as predictors (from the range shift models) in these models as well? Consider using the range size as a continuous predictor instead of 'Widespread'.

      We are glad the reviewer pointed this out to us. We did not emphasize this issue sufficiently. We DID evaluate traits as predictors both of geographical range and phenological shifts, and species-specific biological traits did not significantly affect models predicting either of those sets of responses. We state this on Lines 312-323, but we have also noted in the discussion (Lines 473-476) that the most commonly assessed traits, like body size, do not alter observed trends here. Instead, where species are found, rather than the characteristics of species, is the key determinant of their overall responses.

      Following this excellent suggestion, we re-analysed our data using range size, calculated as the number of quadrats occupied by a species in the historical time period, as a predictor. Range size was not significant in our models, but we believe this is the best way to analyze our data, and so have updated our methods (Lines 261-263) and results (375-378).

      Comment:

      (47) Figure 1: I don't see any grey points in the figure. Also, there is no A or B. If you are referring to the symbols then write cross and triangle instead and not use capital letters which usually refer to component plots of composite figures. Also, I highly recommend providing a similar figure based on your data (maybe each species as a dot for T1 and another symbol for T2). Given the small number of species, you could try to connect these points with arrows. For the set with only range shifts maybe play the T2-dots at the center of the 'Emergence' axis.

      Thank you for pointing out this error: a previous version of Figure 1 included grey points and multiple panels. We have removed this text from the figure caption to be consistent with the final version of the figure (Line 989).

      The graphical depictions of the conceptual and empirical discoveries in this paper were challenging to create. The reviewer might be suggesting effectively decomposing Figure 3 (change in range on the y axis vs change in phenology among all species into two sets of points on the same graph, where each pair of points is a before and after value for each species. This would make for a very busy figure indeed. We have modified the conceptual Figure 1 to illustrate more clearly, we believe, that species can (in principle) remain within tolerable niche spaces by shifting their activity periods in time (phenology) or in space (geographical range) or both.

      Comment:

      (48) Figure 2: Please add a legend. Also black is a poor background color. The maps appear to be stretched. Please check aspect ratios. Now here are capital letters without an explanation in the caption. From the context I assume the upper panel maps are for the data used to calculate range shifts at the bottom panel maps are for data used to calculate the phenological shifts.

      We apologise for the error in the figure caption and have clarified the differences between panels in the text, as well as changing the map background colour and fixing the aspect ratio:

      “Figure 2: Richness of 76 odonate species sampled in North America and Europe in the historic period (1980-2002; panes A and C) and the recent period (2008-2018; panes B and D). Species richness per 100 × 100 km quadrat is shown in panes A and B, while panes C and D show species richness per 200 × 200 km quadrat. Dark red indicates high species richness, while light pink indicates low species richness.” (Lines 1002-1006).

      Comment:

      (49) Figure 3: Why this citation? Of terrestrial taxa? Please explain. Consider adding some stats here, such as the r-squared value for each of the relationships.

      We have better explained the citation in the figure caption, as well as adding r-squared values:

      “Figure 3: Relationship between range shifts and emergence phenology shifts among North American and European odonate species (N = 66; model R2 = 17.08 for glm, 14.9% for MCMCglmm). For reference, the shaded area shows mean latitudinal range shifts of terrestrial taxa as reported by Lenoir et al. (2020; calculated as the yearly mean dispersal rate of 1.11 +/- 0.96 km per year over 38 years).” (Lines 679-682)

      Comment:

      (50) L801: What are these underscored references?

      This was an issue with the reference software and has been resolved.

      Comment:

      (51) Table S1: L848: Consider starting with 'Samples of 76 North American and European odonate species from between ...'. Please use a horizontal line to separate the content from the table header. Add a horizontal line below the last row. Same for all tables.

      Thank you for this suggestion, we have edited the caption for Figure S1 as suggested (Line 1124). We have also made the suggested line additions to Table S1, S2, and S3.

      Comment:

      (52) Table S3: This is confusing. In Table 1 (main text) both 'southern range' and 'widespread' are used as predictors. Please explain.

      We originally included information on species range geography, including southern versus northern range, and widespread versus not, into one categorical variable. Following additional comments we re-analysed our data using range size, calculated as the number of quadrats occupied by a species in the historical time period, as a predictor. Now the methods section text (Lines 261-263) and Table 1 report results of that variable with distribution options northern, southern, or both. 

      Comment:

      (53) Figure S5 and S6: It would be more coherent if the colors refer to the continents and the suborders are indicated by shading. I would love to see a combination of the two figures with species ordered by the phylogenetic relationship and a dot matrix indicating the traits in the main text! This could really be a good starting point for a synthesis figure.

      The reviewer presents an interesting challenge for us. We have a choice, as we understand things, to present a figure showing phylogeny and traits (as requested here), or an ordered list of species relative to effect sizes in the two main responses to global change. The latter choice centers on the discoveries of the paper, while the former would be valuable for dragonfly biology but would depict information that proved to be biologically uninformative relative to our discovery. That is to say, there is no phylogenetic trend and biological traits among species did not affect results. We have gone some way toward illustrating that issue by retaining phylogeny in the MCMC-GLMM models, but we feel that a figure illustrating phylogeny and traits would (for most readers, at least) illustrate noise, rather than signal. For this reason, we have opted to take on the previous reviewer’s suggestion for a modified, main-text Figure 4, which we include below.

      Figure 4: Distribution of Northern range limit shifts (Panel A, kilometers) and emergence phenology shift (Panel B, Julian day) of 76 European and North American odonate species between a recent time period (2008 - 2018) and a historical time period (1980 - 2002). Anisoptera (dragonflies) are shown in pink, Zygoptera (damselflies) are shown in blue.

      Change last: Figure 3: Relationship between range shifts and emergence phenology shifts among North American and European odonate species (N = 66; model R2 = 17.08 for glm, 14.9% for MCMCglmm). For reference, the shaded area shows mean latitudinal range shifts of terrestrial taxa as reported by Lenoir et al. (2020; calculated as the yearly mean dispersal rate of 1.11 +/- 0.96 km per year over 38 years).

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      (1) The bad equilibria of the model still remain a concern, as well as other features like the transient overshoots that do not match with the data. I think they could achieve more accuracy here by assigning more weight to such specific features, through adding these as separate objectives for the generator explicitly. The traces contain a five-second current steps, and one second before and one second after the training step. This means that in the RMSE, the current step amplitude will dominate as a feature, as this is simply the state for which the data trace contains most time-points. Note that this is further exacerbated by using the IV curve as an auxiliary objective. I believe a better exploration of specific response features, incorporated as independently weighted loss terms for the generator, could improve the fit. E.g. an auxiliary term could be the equilibrium before and after the current step, another term could penalise response traces that do not converge back to their initial equilibrium, etc.

      We thank the reviewer for the suggestion. We supplemented the membrane potential regression loss with errors computed for 3 intervals: pre- post- and mid- stimulation time intervals, improving the accuracy of EP-GAN for baseline membrane potential responses (Figure 2, 3, Table S2, S3). We also changed the simulation protocols for generated parameters by allowing a longer simulation time of 15 seconds, where the stimulation is applied during [5, 10] seconds and no stimulation at t = [0, 5) (pre-stimulation) and t = (10, 15] (post-stimulation). These time intervals are chosen to ensure sufficient stabilization periods before and after stimulation.  

      (2) The explanation of what the authors mean with 'inverse gradient operation' is clear now. However, this term is mathematically imprecise, as the inverse gradient does not exist because the gradient operator is not injective. The method is simply forward integration under the assumption that the derivate of the voltage is known at the grid time-points, and should be described as such.

      We thank the reviewer for the clarification on inverse gradient operation terminology. In the Methods section, we changed the term describing the inverse gradient operation to ‘forward integration’ which is a more accurate description describing the process.

      (3) I appreciate that the authors' method provides parameters of models at a minimal computational cost compared to running an evolutionary optimization for every new recording. I also believe that with some tweaking of the objective, the method could improve in accuracy. However, I share reviewer 2's concerns that the evolutionary baseline methods are not sufficiently explored, as these methods have been used to successfully fit considerably more complex response patterns. One way out of the dilemma is to show that the EP-GAN estimated parameters provide an initial guess that considerably narrows the search space for the evolutionary algorithm. In this context, the authors should also discuss the recent gradient based methods such as Deistler et al. (https://doi.org/10.1101/2024.08.21.608979) or Jones et al (https://doi.org/10.48550/arXiv.2407.04025).

      We supplemented the optimization setup for existing methods (GDE3, NSDE, DEMO, and NSGA2) by incorporating steady-state response constraints as the initial selection process. The process is similar to that of EP-GAN training data generation and DEMO parameter selection process [16] (see Results section, page 6 for detail). We also expanded the testing scenarios by evaluating all methods with respect to both small and large HH-model estimation. The small HH-model scenario estimates 47 parameters consisting of channel conductance, reversal potentials and initial conditions with the channel parameters (n = 129) frozen to default values in [41]. Large HH-model includes estimating channel parameters (i.e. 129) in addition to the 47 parameters by considering +-50% variations from their default values. For both small and large HH-model scenarios, we test total sample sizes of both 32k and 64k for all methods to evaluate their scalability with the number of simulated samples given during optimization. The results show that existing methods show good performances for small HH-model scenarios that scale with sample size consistent with literature. EP-GAN on the other hand shows overall better performance in predicting membrane potential responses on both small and large HH-model scenarios.  

      Reviewer #2 (Public review):

      Major 1: Models do not faithfully capture empirical responses. While the models generated with EPGAN reproduce the average voltage during current injections reasonably well, the dynamics of the response are generally not well captured. For example, for the neuron labeled RIM (Figure 2), the most depolarized voltage traces show an initial 'overshoot' of depolarization, i.e. they depolarize strongly within the first few hundred milliseconds but then fall back to a less depolarized membrane potential. In contrast, the empirical recording shows no such overshoot. Similarly, for the neuron labeled AFD, all empirically recorded traces slowly ramp up over time. In contrast, the simulated traces are mostly flat. Furthermore, all empirical traces return to the pre-stimulus membrane potential, but many of the simulated voltage traces remain significantly depolarized, far outside of the ranges of empirically observed membrane potentials. The authors trained an additional GAN (EPGAN Extended) to improve the fit to the resting membrane potential. Interestingly, for one neuron (AWB), this improved the response during stimulation, which now reproduced the slowly raising membrane potentials observed empirically, however, the neuron still does not reliably return to its resting membrane potential. For the other two neurons, the authors report a decrease in accuracy in comparison to EP-GAN. While such deviations may appear small in the Root mean Square Error (RMSE), they likely indicate a large mismatch between the model and the electrophysiological properties of the biological neuron. The authors added a second metric during the revision - percentages of predicted membrane potential trajectories within empirical range. I appreciate this additional analysis. As the empirical ranges across neurons are far larger than the magnitude of dynamical properties of the response ('slow ramps', etc.), this metric doesn't seem to be well suited to quantify to which degree these dynamical properties are captured by the models.

      We made improvements to the training data generation and architecture of EP-GAN to improve its overall accuracy with predicted membrane potential responses. In particular, we divided training data generation into three neuron types found in C. elegans non-spiking neurons: 1) Transient outward rectifier, 2) Outward rectifier and 3) Bistable [8, 16]. Each randomly generated training sample is categorized into one of 3 types by evaluating its steady-state currents with respect to experimental dI/dV bound constraints (See generating training data section under Methods for more detail). The process is then followed by imposing minimum-maximum constraints on simulated membrane potential responses. The setup allows generations of training samples that are of closer distribution to experimentally recorded neurons. This is further described in Section Methods page 15 in the revised manuscript.

      We also improved the EP-GAN training process by incorporating random masking of input membrane potential responses. The masking forces EP-GAN to make predictions even with missing voltage traces, improving overall accuracy and allowing EP-GAN to use membrane potential inputs with arbitrary clamping protocol (see Methods page 13 for more detail). For the training loss functions, we further supplemented the membrane potential regression loss with errors computed for 2 intervals: pre- and post-stimulation time intervals to improve EP-GAN prediction capabilities for baseline membrane potentials.

      Taken together, these modifications improved EP-GAN’s overall ability to better capture empirical membrane potential responses and we show the results in Figure 2 – 5, Table S2, S3.

      Major 2: Comparison with other approaches is potentially misleading. Throughout the manuscript, the authors claim that their approach outperforms the other approaches tested. But compare the responses of the models in the present manuscript (neurons RIM, AFD, AIY) to the ones provided for the same neurons in Naudin et al. 2022 (https://doi.org/10.1371/journal. pone.0268380). Naudin et al. present models that seem to match empirical data far more accurately than any model presented in the current study. Naudin et al. achieved this using DEMO, an algorithm that in the present manuscript is consistently shown to be among the worst of all algorithms tested. I therefore strongly disagree with the authors claim that a "Comparison of EP-GAN with existing estimation methods shows EP-GAN advantage in the accuracy of estimated parameters". This may be true in the context of the benchmark performed in the study (i.e., a condition of very limited compute resources - 18 generations with a population size of 600, compare that to 2000 generations recommended in Naudin et al.), but while EP-GAN wins under these specific conditions (and yes, here the authors convincingly show that their EP-GAN produces by far the best results!), other approaches seem to win with respect to the quality of the models they can ultimately generate.

      We thank the reviewer for the feedback regarding the comparison with existing methods. We have revised the optimization setup for existing methods (GDE3, NSDE, DEMO, and NSGA2) by incorporating steady-state response constraints as the initial selection process. The process is similar to that of EP-GAN training data generation and DEMO parameter selection process [16] (see Results section, page 6 for detail). Incorporating this process has improved the accuracy of existing methods especially for small HH-model scenarios where DEMO stood out with the best performance alongside NSGA2 (Figure 5, Table 1, 2).

      We also expanded the testing scenarios by evaluating all methods with respect to both small and large HH-model estimation. The small HH-model scenario estimates 47 parameters consisting of channel conductance, reversal potentials and initial conditions with the channel parameters (n = 129) frozen to default values in [41]. Large HH-model includes estimating channel parameters (i.e. 129) in addition to the 47 parameters by considering +-50% variations from their default values. For both small and large HH-model scenarios, we test total sample sizes of both 32k and 64k for all methods to evaluate their scalability with the number of simulated samples given during optimization. The results show that existing methods show good performances for small HH-model scenarios that scale with sample size. EP-GAN on the other hand shows overall better performance in predicting membrane potential responses on both small and large HH-model scenarios. 

      In particular, with extended membrane potential error including pre-, mid- , post-activation periods, EP-GAN (trained with 32k samples, large HH-model, 9 neurons) mean membrane potential responses error of 2.82mV was lower than that of DEMO (12.2mV, 64k samples) trained on identical setup (Table 2) and DEMO (7.78mV, using 36,000k samples, 3 neurons) applied to simpler HHmodel in [16]. With respect to DEMO performance in [16], under identical simulation protocol (i.e., no stimulation during (0, 5s), (10, 15s) and stimulation during (5, 10s)), EP-GAN predicted RIM (large HH-model) showed membrane potential accuracy on par with that of DEMO (simpler HH-model) and EP-GAN predicted AFD showed better accuracy for post-activation membrane potential response where DEMO predicted membrane potentials overshoot above the baseline (not shown in the paper).

      Major 3: As long as the quality of the models generated by the EP-GAN cannot be significantly improved, I am doubtful that it indeed can contribute to the 'ElectroPhysiome', as it seems likely that dynamics that are currently poorly captured, like slow ramps, or the ability of the neuron to return to its resting membrane potential, will critically affect network computations. If the authors want to motivate their study based on this very ambitious goal, they should illustrate that single neuron model generation with their approach is robust enough to warrant well-constrained network dynamics. Based on the currently presented results, I find the framing of the manuscript far too bold.

      We thank the reviewer for the feedback regarding the paper's scope. With revised methods, the overall quality of EP-GAN models is improved with the most significant improvements in baseline membrane potential accuracy. While high quality neuron models could be attained with existing methods given sufficient sample size, our results suggest EP-GAN can predict models with enhanced quality with significantly fewer sample size without a need for retraining, thus complementing the main drawback of evolutionary based methods. While EP-GAN still has limitations (e.g., difficulty in predicting slow ramps) that need to be addressed in the future, we believe its overall performance combined with fast inference speed and flexibility in its input data format (e.g., missing membrane potential traces) is a step forward in the large-scale neuron modeling tasks that can contribute to network models.   

      Major 4: The conclusion of the ablation study 'In addition the architecture of EP-GAN permits inference of parameters even when partial membrane potential and steady-state currents profile are given as inputs' does not seem to be justified given the voltage traces shown in Figure 3. For example, for RIM, the resting membrane potential stays around 0 mV, but all empirical traces are around -40mV. For AFD, all simulated traces have a negative slope during the depolarizing stimuli, but a positive slope in all empirically observed traces. For AIY, the shape of hyperpolarized traces is off. While it may be that by their metric neurons in the 25% category are classified as 'preserving baseline accuracy', this doesn't seem justified given the voltage traces presented in the manuscript. It appears the metric is not strict enough.

      We improved EP-GAN’s training process by incorporating random masking of input membrane potential responses. The masking forces EP-GAN to make predictions even with missing voltage traces, improving overall accuracy and allowing EP-GAN to use membrane potential inputs with arbitrary clamping protocol.

      Such input masking during training has improved the results with ablation studies where EP-GAN now retains baseline membrane potential error (3.3mV, averaged across pre-, mid-, post-activation periods) up to 50% of membrane potential inputs remaining (3.5mV) and up to 25% of steady-state currents remaining (3.5mV).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      The manuscript "Drosophila Visuomotor Integration: An Integrative Model and Behavioral Evidence of Visual Efference Copy" provides an integrative model of the visuomotor control in Drosophila melanogaster. This model presents an experimentally derived model based on visually evoked wingbeat pattern recordings of three strategically selected visual stimulus types with well-established behavioral response characteristics. By testing variations of these models, the authors demonstrate that the virtual model behavior can recapitulate the recorded wing beat behavioral results and those recorded by others for these specific stimuli when presented individually. Yet, the novelty of this study and their model is that it allows predictions for natural visual scenes in which multiple visual stimuli occur simultaneously and may have opposite or enhancing effects on behavior. Testing three models that would allow interactions of these visual modalities, the authors show that using a visual efference copy signal allows visual streams to interact, replicating behavior recorded when multiple stimuli are presented simultaneously. Importantly, they validated the prediction of this model in real flies using magnetically tethered flies, e.g., presenting moving bars with varying backgrounds. In conclusion, the presented manuscript presents a commendable effort in developing and demonstrating the validity of a mixture model that allows predictions of the behavior of Drosophila in natural visual environments.

      Strengths:

      Overall, the manuscript is well-structured and clear in its presentation, and the modeling and experimental research are methodically conducted and illustrated in visually appealing and easy-to-understand figures and their captions.

      The manuscript employs a thorough, logical approach, combining computational modeling with experimental behavioral validation using magnetically tethered flies. This iterative integration of simulation and empirical behavioral evidence enhances the credibility of the findings.

      The associated code base is well documented and readily produces all figures in the document.

      Suggestions:

      However, while the experiments provide evidence for the use of a visual efference copy, the manuscript would be even more impressive if it presented specific predictions for the neural implementation or even neurophysiological data to support this model. Or, at the very least, a thorough discussion. Nonetheless, these models and validating behavioral experiments make this a valuable contribution to the field; it is well executed and addresses a significant gap in the modeling of fly behavior and holistic understanding of visuomotor behaviors.

      We appreciate the reviewer’s thoughtful comments on the strengths and weaknesses of our manuscript. We agree that biophysically realistic model reflecting the structure of neural circuits as well as physiological data from them would be invaluable. However, we are currently unable to provide physiological evidence for EC-based suppression, nor provide circuit architecture for efference copy-based suppression of the stability circuit because the neural pathway underlying this behavior remains unidentified. Extensive recordings from the HS/VS system have revealed cell-type-specific motor-related inputs during both spontaneous and loom-evoked flight turns (Fenk et al., 2021; Kim et al., 2017, 2015). These studies predicted suppression of the optomotor stability response during such turns, and our new experiments confirmed this suppression specifically during loom-evoked turns (Figures 5, 6). However, these neurons are primarily involved in the head optomotor response, not the body optomotor response. We hope to extend our current model in future studies to incorporate more cellular-level detail, as the feedforward circuits underlying stability behavior become more clearly defined.

      Here are a few points that should be addressed:

      (1) The biomechanics block (Figure 2) should be elaborated on, to explain its relevance to behavior and relation to the underlying neural mechanisms.

      We appreciate this suggestion. The mathematical representation of the biomechanics block has been developed by other groups in previous studies (Fry et al., 2003; Ristroph et al., 2010). We used exactly the same model, and its parameters were identical to those used in one of those studies (Fry et al., 2003; Ristroph et al., 2010), in which the parameters were estimated from the stabilizing response in response to magnetic “stumbling” pulses. In the previous version of the manuscript, we had a description of the biomechanics block in the Method section (see Equation 4). In response to the reviewer’s comment, we have made a few changes in Figure 2A and expanded the associated description in the main text, as follows.

      (Line 160) “To test the orientation behavior of the model, we developed an expanded model, termed “virtual fly model” hereafter. In this model, we added a biomechanics block that transforms the torque response of the fly to the actual heading change according to kinematic parameters estimated previously (Michael H Dickinson, 2005; Ristroph et al., 2010) (Figure 2A, see Equation 4 in Methods and Movie S1). The virtual fly model, featuring position and velocity blocks that are conditioned on the type of the visual pattern, can now change its body orientation, simulating the visual orientation behavior of flies in the free flight condition.”

      (2) It is unclear how the three integrative models with different strategies were chosen or what relevance they have to neural implementation. This should be explained and/or addressed.

      Thank you for this valuable comment. We selected the three models based on previous studies investigating visuomotor integration across multiple species, under conditions where multiple sensory cues are presented simultaneously.

      The addition-only model represents the simplest hypothesis, analogous to the “additive model” proposed by Tom Collett in his 1980 study (Collett, 1980). We used this model as a baseline to illustrate behavior in the absence of any efference copy mechanism. Notably, some modeling studies have proposed linear (additive) integration for multimodal sensory cues at the behavioral level (Liu et al., 2023; Van der Stoep et al., 2021). However, experimental evidence demonstrating strictly linear integration—either behaviorally or physiologically—remains limited. In our study, new data (Figure 5) show that bar-evoked and background movement-evoked locomotor responses are combined linearly, supporting the addition-only model.

      The graded efference copy model has been most clearly demonstrated in the cerebellum-like circuit of Mormyrid fish during electrosensation (Bell, 1981; Kennedy et al., 2014). In this system, the efference copy signal forms a negative image of the predicted reafferent input and undergoes plastic changes as the environment changes—an idea that inspired our modifiable efference copy model (Figure 4–figure supplement 1). The all-or-none efference copy model is exemplified in the sensory systems of smaller organisms, such as the auditory neurons of crickets during stridulation (Poulet and Hedwig, 2006). Notably, in crickets, the motor-related input is referred to as corollary discharge rather than efference copy. Typically, “efference copy” refers to a graded, subtractive motor-related signal, while “corollary discharge” denotes an all-or-none signal, both counteracting the sensory consequences of self-generated actions. In this manuscript, we use the term efference copy more broadly, encompassing both types of motor-related feedback signals (Sommer and Wurtz, 2008).

      In response to this comment, we have made the following changes in the main text to enhance its accessibility to general readers.

      (Line#268) “This integration problem has been studied across animal sensory systems, typically by analyzing motor-related signals observed in sensory neurons (Bell, 1981; Collett, 1980; Kim et al., 2017; Poulet and Hedwig, 2006). Building on the results of these studies, we developed three integrative models. The first model, termed the “addition-only model”, assumes that the outputs of the object (bar) and the background (grating) response circuits are summed to control the flight orientation (Figure 4B, see Equation 14 in Methods).”

      (Line#272) “In the second and third models, an EC is used to set priorities between different visuomotor circuits (Figure 4C,D). In particular, the EC is derived from the object-induced motor command and sent to the object response system to nullify visual input associated with the object-evoked turn (Bell, 1981; Collett, 1980; Poulet and Hedwig, 2006). These motor-related inputs fully suppress sensory processing in some systems (Poulet and Hedwig, 2006), whereas in others they selectively counteract only the undesirable components of the sensory feedback (Bell, 1981; Kennedy et al., 2014).”

      (3) There should be a discussion of how the visual efference could be represented in the biological model and an evaluation of the plausibility and alternatives.

      Thank you for this helpful comment. We have now added the following discussion to share our perspective on the circuit-level implementation of the visual efference copy in Drosophila.

      (Line#481) “Efference copy in Drosophila vision

      Under natural conditions, various visual features in the environment may concurrently activate multiple motor programs. Because these may interfere with one another, it is crucial for the central brain to coordinate between the motor signals originating from different sensory circuits. Among such coordination mechanisms, the EC mechanisms were hypothesized to counteract so-called reafferent visual input, those caused specifically by self-movement (Collett, 1980; von Holst and Mittelstaedt, 1950). Recent studies reported such EC-like signals in Drosophila visual neurons during spontaneous as well as loom-evoked flight turns (Fenk et al., 2021; Kim et al., 2017, 2015). One type of EC-like signals were identified in a group of wide-field visual motion-sensing neurons that were shown to control the neck movement for the gaze stability (Kim et al., 2017). The EC-like signals in these cells were bidirectional depending on the direction of flight turns, and their amplitudes were quantitatively tuned to those of the expected visual input across cell types. Although amplitude varies among cell types, it remains inconclusive whether it also varies within a given cell type to match the amplitude of expected visual feedback, thereby implementing the graded EC signal. A more recent study examined EC-like signal amplitude in the same visual neurons for loom-evoked turns, across events (Fenk et al., 2021). Although the result showed a strong correlation between wing response and the EC-like inputs, the authors pointed that this apparent correlation could stem from noisy measurement of all-or-none motor-related inputs.

      Thus, these studies did not completely disambiguate between graded vs. all-or-none EC signaling. Another type of EC-like signals observed in the visual circuit tuned to a moving spot exhibited characteristics consistent with all-or-none EC. That is, it entirely suppressed visual signaling, irrespective of the direction of the self-generated turn (Kim et al., 2015; Turner et al., 2022). 

      Efference-copy (EC)–like signals have been reported in several Drosophila visual circuits, yet their behavioral role remains unclear. Indirect evidence comes from a behavioral study showing that the dynamics of spontaneously generated flight turns were unaffected by unexpected background motion (Bender and Dickinson, 2006a). Likewise, our behavioral experiments showed that, during loom-evoked turns, responses to background motion are suppressed in an all-or-none manner (Figures 6 and 7). Consistent with this, motor-related inputs recorded in visual neurons exhibit nearly identical dynamics during spontaneous and loom-evoked turns (Fenk et al., 2021). Together, these behavioral and physiological parallels support the idea that a common efference-copy mechanism operates during both spontaneous and loom-evoked flight turns.

      Unlike loom-evoked turns, bar-evoked turn dynamics changed in the presence of moving backgrounds (Figure 5), a result compatible with both the addition-only and graded EC models. However, when the static background was updated just before a bar-evoked turn—thereby altering the amplitude of optic flow—the turn dynamics remained unaffected (Figures 5 and 7), clearly contradicting the addition-only model. Thus, the graded EC model is the only one consistent with both findings. If a graded EC mechanism were truly at work, however, an unexpected background change should have modified turn dynamics because of the mismatch between expected and actual visual feedback (Figure 4–figure supplement 1)—yet we detected no such effect at any time scale examined (Figure 7–figure supplement 1). This mismatch would be ignored only if the amplitude of the graded EC adapted to environmental changes almost instantaneously—a mechanism that seems improbable given the limited computational capacity of the Drosophila brain. In electric fish, for example, comparable adjustments take more than 10 minutes (Bell, 1981; Muller et al., 2019). Further investigation is needed to clarify how reorienting flies ignore optic flow generated by static backgrounds, potentially by engaging EC mechanisms not captured by the models tested in this study.

      Why would Drosophila rely on the all-or-none EC mechanism instead of the graded one for loom-evoked turns? A graded EC must be adjusted adaptively depending on the environment, as the amplitude of visual feedback varies with both the dynamics of self-generated movement and environmental conditions (e.g., empty vs. cluttered visual backgrounds) (Figure 4—figure supplement 1). Recent studies on electric fish have suggested that a large array of neurons in a multi-layer network is crucial for generating a modifiable efference copy signal matched to the current environment (Muller et al., 2019). Given their small-sized brain, flies might opt for a more economical design for suppressing unwanted visual inputs regardless of the visual environment. Circuits mediating such a type of EC were identified in the cricket auditory system during stridulation (Poulet and Hedwig, 2006), for example. Our study strongly suggests the existence of a similar circuit in the Drosophila visual system. 

      We tested the hypothesis that efference-copy (EC) signals guide action selection by suppressing specific visuomotor reflexes when multiple visual features compete. An alternative motif with a similar function is mutual inhibition between motor pathways (Edwards, 1991; Mysore and Kothari, 2020). In Drosophila, descending neurons form dense lateral connections (Braun et al., 2024), offering a substrate for such competitive interactions. Determining whether—and how—EC and mutual inhibition operate will require recordings from the neurons that ensure visual stability, which remain unidentified. Mapping these pathways and assessing how they are modulated by visual and behavioral context are important goals for future work.”

      Reviewer #2 (Public Review):

      It has been widely proposed that the neural circuit uses a copy of motor command, an efference copy, to cancel out self-generated sensory stimuli so that intended movement is not disturbed by the reafferent sensory inputs. However, how quantitatively such an efference copy suppresses sensory inputs is unknown. Here, Canelo et al. tried to demonstrate that an efference copy operates in an all-or-none manner and that its amplitude is independent of the amplitude of the sensory signal to be suppressed. Understanding the nature of such an efference copy is important because animals generally move during sensory processing, and the movement would devastatingly distort that without a proper correction. The manuscript is concise and written very clearly. However, experiments do not directly demonstrate if the animal indeed uses an efference copy in the presented visual paradigms and if such a signal is indeed non-scaled. As it is, it is not clear if the suppression of behavioral response to the visual background is due to the act of an efference copy (a copy of motor command) or due to an alternative, more global inhibitory mechanism, such as feedforward inhibition at the sensory level or attentional modulation. To directly uncover the nature of an efference copy, physiological experiments are necessary. If that is technically challenging, it requires finding a behavioral signature that unambiguously reports a (copy of) motor command and quantifying the nature of that behavior.

      We thank the reviewer for this insightful and constructive comment. We agree that our current behavioral evidence does not directly identify the underlying circuit mechanism, and that direct recordings from visual neurons modulated by an efference copy would be critical for distinguishing between potential mechanisms.

      A prerequisite for such physiological investigations would be the identification of both (1) the feedforward neurons directly involved in the optomotor response, and (2) the neurons conveying motor-related signals to the optomotor circuit. Despite efforts by several research groups, the location of the feedforward circuit mediating the optomotor response remains elusive. This limitation has prevented us from obtaining direct cellular evidence of flight turn-associated suppression of optomotor signaling.

      In light of the reviewer’s suggestion, we expanded our investigation to strengthen the behavioral evidence for efference copy (EC) mechanisms. In addition to our earlier experiments involving unexpected changes in the static background, we examined how object-evoked flight turns influence the optomotor stability reflex and vice versa (Figures 5 and 6). To quantify the interaction between different visuomotor behaviors, we systematically varied the temporal relationship between two types of visual motion—loom versus moving background, or moving bar versus moving background—and measured the resulting behavioral responses.

      Our findings support pattern- and time-specific suppressive mechanisms acting between flight turns associated with the different visual patterns. Specifically:

      The responses to a moving bar and a moving background add linearly, even when presented in close temporal proximity.

      Loom-evoked turns and the optomotor stability reflex mutually suppress each other in a time-specific manner.

      For both loom- and moving bar-evoked flight turns, changes in the static background had no measurable effect on the dynamics of the object-evoked responses.

      These results provide a detailed behavioral characterization of a suppressive interaction between distinct visuomotor responses. This, in turn, offers correlative evidence supporting the involvement of an efference copy-like mechanism acting on the visual system. While similar efference copy mechanisms have been documented in other parts of the visual system, we acknowledge that our findings do not exclude alternative explanations. In particular, it is still possible that lateral inhibition within the central brain or ventral nerve cord contributes to the suppression we observed.

      Ultimately, definitive proof will require identifying the specific neurons that convey efference copy signals and demonstrating that silencing these neurons abolishes the behavioral suppression. Until such experiments are feasible, our behavioral approach provides an important contribution toward understanding the nature of sensorimotor integration in this system.

      Reviewer #3 (Public Review):

      Summary:

      Canelo et al. used a combination of mathematical modeling and behavioral experiments to ask whether flies use an all-or-none EC model or a graded EC model (in which the turn amplitude is modulated by wide-field optic flow). Particularly, the authors focus on the bar-ground discrimination problem, which has received significant attention in flies over the last 50-60 years. First, they use a model by Poggio and Reichardt to model flight response to moving small-field bars and spots and wide-field gratings. They then simulate this model and compare simulation results to flight responses in a yaw-free tether and find generally good agreement. They then ask how flies may do bar-background discrimination (i.e. complex visual environment) and invoke different EC models and an additive model (balancing torque production due to background and bar movement). Using behavioral experiments and simulation supports the notion that flies use an all-or-none EC since flight turns are not influenced by the background optic flow. While the study is interesting, there are major issues with the conceptual framework.

      Strengths:

      They ask a significant question related to efference copies during volitional movement.

      The methods are well detailed and the data (and statistics) are presented clearly.

      The integration of behavioral experiments and mathematical modeling of flight behavior.

      The figures are overall very clear and salient.

      Weaknesses:

      Omission of saccades: While the authors ask a significant question related to the mechanism of bar-ground discrimination, they fail to integrate an essential component of the Drosophila visuomotor responses: saccades. Indeed, the Poggio and Reichardt model, which was developed almost 50 years ago, while appropriate to study body-fixed flight, has a severe limitation: it does not consider saccades. The authors identify this major issue in the Discussion by citing a recent switched, integrate-and-fire model (Mongeau & Frye, 2017). The authors admit that they "approximated" this model as a smooth pursuit movement. However, I disagree that it is an approximation; rather it is an omission of a motor program that is critical for volitional visuomotor behavior. Indeed, saccades are the main strategy by which Drosophila turn in free flight and prior to landing on an object (i.e. akin to a bar), as reported by the Dickinson group (Censi et al., van Breugel & Dickinson [not cited]). Flies appear to solve the bar-ground discrimination problem by switching between smooth movement and saccades (Mongeau & Frye, 2017; Mongeau et al., 2019 [not cited]). Thus, ignoring saccades is a major issue with the current study as it makes their model disconnected from flight behavior, which has been studied in a more natural context since the work of Poggio.

      Thank you for this helpful comment. We agree that including saccadic turns is essential and qualitatively improves the model. In the revised manuscript, we therefore expanded our bar-tracking model to incorporate an integrate-and-saccade strategy, now presented in Figure 2—figure supplement

      The manuscript now introduces this result as follows:

      (Line#190) “Finally, one important locomotion dynamics that a flying Drosophila exhibits while tracking an object is a rapid orientation change, called a “saccade” (Breugel and Dickinson, 2012; Censi et al., 2013; Heisenberg and Wolf, 1979). For example, while tracking a slowly moving bar, flies perform relatively straight flights interspersed with saccadic flight turns (Collett and Land, 1975; Mongeau and Frye, 2017). During this behavior, it has been proposed that visual circuits compute an integrated error of the bar position with respect to the frontal midline and triggers a saccadic turn toward the bar when the integrated value reaches a threshold (Frighetto and Frye, 2023; Mongeau et al., 2019; Mongeau and Frye, 2017). We expanded our bar fixation model to incorporate this behavioral strategy (Figure 2--figure supplement 2). The overall structure of the modified model is akin to the one proposed in a previous study (Mongeau and Frye, 2017), and the amplitude of a saccadic turn was determined by the sum of the position and velocity functions (Figure 2--figure supplement 2A; see Equation 13 in Methods). When simulated, our model successfully reproduced experimental observations of saccade dynamics across different object velocities (Figure 2--figure supplement 2B-D) (Mongeau and Frye, 2017). Together, our models faithfully recapitulated the results of previous behavioral observations in response to singly presented visual patterns (Collett, 1980; Götz, 1987; H. Kim et al., 2023; Maimon et al., 2008; Mongeau and Frye, 2017).”

      Apart from Figures 1 and 2, most of our data—whether from simulations or behavioral experiments—use brief visual patterns lasting 200 ms or less. These stimuli trigger a single, rapid orientation change reminiscent of a saccadic flight turn. In this part of the paper, we essentially have examined how multiple visuomotor pathways interact to determine the direction of object-evoked turns when several visual patterns occur simultaneously.

      Critically, recent work showed that a group of columnar neurons (T3) appear specialized for saccadic bar tracking through integrate-and-fire computations, supporting the notion of parallel visual circuits for saccades and smooth movement (Frighetto & Frye, 2023 [not cited]).

      Thanks for bringing up this critical issue. We have now added this paper in the following part of the manuscript.

      (Line#193) “During this behavior, it has been proposed that visual circuits compute an integrated error of the horizontal bar position with respect to the frontal midline and triggers a saccadic turn toward the bar when the integrated value reaches a threshold (Frighetto and Frye, 2023; Mongeau and Frye, 2017).”

      (Line#462) “Visual systems extract features from the environment by calculating spatiotemporal relationships of neural activities within an array of photoreceptors. In Drosophila, these calculations occur initially on a local scale in the peripheral layers of the optic lobe (Frighetto and Frye, 2023; Gruntman et al., 2018; Ketkar et al., 2020).”

      A major theme of this work is bar fixation, yet recent work showed that in the presence of proprioceptive feedback, flies do not actually center a bar (Rimniceanu & Frye, 2023). Furthermore, the same study found that yaw-free flies do not smoothly track bars but instead generate saccades. Thus prior work is in direct conflict with the work here. This is a major issue that requires more engagement by the authors.

      Thank you for your thoughtful comments and for drawing our attention to this important paper. In our experiments, bar fixation on oscillating vertical objects emerges during the “alignment” phase of the magneto-tether protocol. The pattern movement dynamics was similar those used by Rimniceanu & Frye (2023), yet the two studies differ in a key respect: Rimniceanu & Frye employed a motion-defined bar, whereas we presented a dark vertical bar against a uniform or random-dot background. The alignment success rate—defined as the proportion of trials in which the fly’s body angle is within ±25° of the target—was about 50 % (data not shown). Our alignment pattern consisted of three vertical stripes spanning ~40° horizontally; when we replaced it with a single, narrower stripe, the success rate was lowered (data not shown). These observations suggest that bar fixation in the magnetically tethered assay is less robust than in the rigid-tethered assay, although flies still orient toward highly salient vertical objects.

      We also observed that bar-evoked turns were elicited more reliably when the bar moved rapidly (45° in 200 ms) in the magneto-tether assay, although the turn magnitude was significantly smaller than the actual bar displacement (Figure 3).

      In response to the reviewer’s comment, we now added the following description in the paper regarding the bar fixation behavior, citing Rimniceanu&Frye 2023.

      (Line#239) “Another potential explanation arises from recent studies demonstrating that proprioceptive feedback provided during flight turns in a magnetically tethered assay strongly dampens the amplitude of wing and head responses (Cellini and Mongeau, 2022; Rimniceanu et al., 2023).”

      Relevance of the EC model: EC-related studies by the authors linked cancellation signals to saccades (Kim et al, 2014 & 2017). Puzzlingly, the authors applied an EC model to smooth movement, when the authors' own work showed that smooth course stabilizing flight turns do not receive cancellation signals (Fenk et al., 2021). Thus, in Fig. 4C, based on the state of the field, the efference copy signal should originate from the torque commands to initiate saccades, and not from torque to generate smooth movement. As this group previously showed, cancellation signals are quantitatively tuned to that of the expected visual input during saccades. Importantly, this tuning would be to the anticipated saccadic turn optic flow. Thus the authors' results supporting an all-or-none model appear in direct conflict with the author's previous work. Further, the addition-only model is not particularly helpful as it has been already refuted by behavioral experiments (Rimneceanu & Frye, Mongeau & Frye).

      Thank you for this constructive comment. Efference copy is best established for brief, discrete actions like flight saccades. While motor-related modulation of visual processing has been reported across short- and long-duration behaviours (Chiappe et al., 2010; Fujiwara et al., 2017; Kim et al., 2015, 2017; Maimon et al., 2010; Turner et al., 2022), only flight saccade-associated signals exhibit the temporal profile appropriate to cancel reafferent input. However, von Holst & Mittelstaedt (1950) originally formulated efference copy to explain the smooth optomotor response of hoverflies. In HS/VS recordings in previous studies, however, we could not detect membrane-potential changes tied to baseline wing-beat amplitude (data not shown), but further work is needed. 

      Note that visually evoked flight turns analyzed in this paper have relatively fast dynamics. Fenk et al. (2021) showed that HS cells carry EC-like motor signals during both loom-evoked turns and spontaneous saccades. Building on this, we tested whether object-evoked rapid turns modulate other visuomotor pathways. Although Fenk et al. also found that optomotor turns lack motor input to HS cells, the authors did not test whether the optomotor pathway suppresses other reflexes, such as loom-evoked turns. Our new behavioral data (Figure 6) show that optomotor turns indeed suppress loom-evoked turns, suggesting a potential EC signal arising from the optomotor pathway that inhibits loom-responsive visual neurons.

      In Kim et al. (2017), the authors argued that HS/VS neurons receive a “quantitatively tuned” efference copy that varies across cell types: yaw-sensitive LPTCs are strongly suppressed, roll-sensitive cells receive intermediate input, and pitch-sensitive cells receive little or none. We also showed that when the amplitude of ongoing visual drive changes, the amplitude of saccade-related potentials (SRPs) scales linearly. This proportionality does not imply a genuinely graded EC, however, because SRP amplitude could vary solely through changes in driving force (Vm – Vrest) with a fixed EC conductance. Crucially, SRPs do not fully suppress feed-forward visual signalling, arguing against an all-or-none EC mechanism.

      How, then, can the cellular and behavioural data be reconciled? Silencing HS/VS neurons—or their primary inputs, the T4/T5 neurons—does not markedly diminish the optomotor response in flight (Fenk et al., 2014; Kim et al., 2017), indicating the presence of additional, as-yet-unidentified pathways.

      Physiological recordings from other visual neurons that drive the optomotor response in flying Drosophila are therefore needed to determine how strongly they are suppressed during loom-evoked turns.

      Behavioral evidence for all-or-none EC model: The authors state "unless the stability reflex is suppressed during the flies' object evoked turns, the turns should slow down more strongly with the dense background than the sparse one". This hypothesis is based on the fact that the optomotor response magnitude is larger with a denser background, as would be predicted by an EMD model (because there are more pixels projected onto the eye). However, based on the authors' previous work, the EC should be tuned to optic flow and thus the turning velocity (or amplitude). Thus the EC need not be directly tied to the background statistics, as they claim. For instance, I think it would be important to distinguish whether a mismatch in reafferent velocity (optic flow) links to distinct turn velocities (and thus position). This would require moving the background at different velocities (co- and anti-directionally) at the onset of bar motion. Overall, there are alternative hypotheses here that need to be discussed and more fully explored (as presented by Bender & Dickinson and in work by the Maimon group).

      We appreciate the reviewer’s important suggestion. In response, we performed the recommended experiment. In Figures 5 and 6 of the revised manuscript, we now present how bar- or loom-evoked flight turns affect the response to a moving background pattern. These experiments revealed that bar-evoked turns do not suppress the optic flow response, whereas loom-evoked turns strongly suppress it. Specifically, when background motion began 100 ms after the onset of loom expansion, the response to the background was significantly suppressed. Although weak residual responses to the background motion were observed in this case, this could be due to background motion occurring outside of the suppression interval, which may correspond in duration to the duration of flight turns (Figure 6C,D). 

      The lack of suppression of the optic flow response during and after bar-evoked turns appears to suggest that the responses are added linearly (Figure 5), seemingly contradicting the lack of dynamic change when the background dot density was altered (Figure 7, Figure 7–figure supplement 1). That is, the experimental result in Figure 5 supports either an addition-only or a graded efference copy (EC) model. However, the result in Figure 7 supports an all-or-none EC model. If a graded EC were used, the amplitude of the EC should be updated almost instantaneously when the static background changes.

      Another possibility is that the optic flow during self-generated turns in a static background is extremely weak compared to the optic flow input generated by physically moving the pattern, perhaps due to the rapid nature of head movements. Indeed, detailed kinematic analysis of head movement during spontaneous saccades in blow flies revealed that the head reaches the target angle before the body completes the orientation change, making the effective speed of reafferent optic flow higher than the speed of body rotation (Hateren and Schilstra, 1999). To test these hypotheses, further experiments will be needed for bar-evoked flight turns.

      Publishing the reviewed preprint:

      (1) The Reviewed Preprint (including the full text of the preprint we reviewed, the eLife assessment, and public reviews) will typically be published in two weeks' time.

      Please let us know if you would like to provide provisional author responses to be posted at the same time (if so, please send these by email). Please do not resubmit within the next two/three weeks, as we will need to publish the first version of the Reviewed Preprint first.

      If there are any factual errors in the eLife assessment or public reviews, or other issues we should be aware of, please let us know as soon as possible.

      (2) After publication of the Reviewed Preprint, you can use the link below to submit a revised version. There is no deadline to resubmit. Before resubmitting, please ensure that you update the preprint at the preprint server to correspond with the revised version. Upon submitting a revised version, we will ask the editors and reviewers if it's appropriate to update their assessment and public reviews, which will be included alongside the revised Reviewed Preprint. At that time we will also post the recommendations to the authors and the author responses you provide with the revised version. In the author response, please respond to the public reviews (where relevant) and the recommendations to the authors.

      (3) Alternatively, you can proceed with the current version of the Reviewed Preprint (once published), without revisions, and request an eLife Version of Record. See the Author Guide for further information: https://elife-rp.msubmit.net/html/elife-rp_author_instructions.html#vor. However, most authors decide to request a Version of Record after a round of revision.

      (4) After publication of eLife's Reviewed Preprint, you also have the option to submit/publish in another journal instead: if you choose to do this, please let us know so we can update our records.

      The reviewers identified two key revisions that could improve the assessment of the paper:

      (1) Consideration of saccades within the model framework (outlined by reviewer 3).

      (2) Addition of physiology data to support the conclusions of the paper (outlined by reviewer 2). If this is not feasible within the timescale of revisions, the paper would need to be revised to clarify that the model leads to a hypothesis that would need to be tested with future physiology experiments.

      Thank you for these comments.

      Regarding revision point #1, we have added Figure 2–figure supplement 2, where we incorporated our position-velocity model (estimated in Figure 1) into the framework of the integrate-and-saccade model. A detailed description of this model is now provided in the main text (Lines 190–203).

      For revision point #2, obtaining electrophysiological evidence for efference copy remains challenging, as neither the visual neurons nor the efference-copy neuron has been identified for the wing optomotor response. As suggested by the reviewers, we have revised the title of the paper to reduce emphasis on efference copy and have noted electrophysiological recordings as a direction for future work.

      old title: A visual efference copy-based navigation algorithm in Drosophila for complex visual environments

      new title: Integrative models of visually guided steering in Drosophila

      Specific recommendations are detailed below.

      Reviewer #2 (Recommendations For The Authors):

      To directly demonstrate if an efference copy is non-scaled, the following experiments can be helpful: record from HS/VS cells and examine the relation between the amplitude of the succade-suppression signal vs. succade amplitude.

      Thanks for raising this important point. We previously carried out the suggested analysis for loom-evoked saccades in Fenk et al. (2021). There, significant correlations emerged between wing-response amplitude and saccade-related potentials (Figures 2F and 3C). However, we did not interpret the strong correlation (r ≈ 0.8) as evidence for a graded efference copy, because the amplitude of saccade-related potentials appeared to be bimodal. Upon presentation of the looming stimulus, flies either executed large evasive turns or showed minimal changes in wing-stroke amplitude. Large wing responses were accompanied by strong, saturated suppression of HS-cell membrane potential, whereas trials without wing responses produced only weak modulations—reflected in the bimodal distribution of saccade-related potential amplitudes (Figure 3C). 

      Importantly, in rigidly tethered preparations—where these potentials are typically measured—the absence of proprioceptive feedback can itself drive wingbeat amplitudes to saturation during saccades. We therefore reasoned that the lack of intermediate-sized flight saccades would naturally yield correspondingly saturated saccade-related potentials, even if a graded EC system is in play. 

      In Kim et al. (2017), we also performed a comprehensive analysis of spontaneous saccade-related potentials across all HS/VS cell types. When we later examined the relationship between saccade amplitude and the corresponding saccade-related potentials in each cell type, we could not find any statistically significant correlation (unpublished data).

      measure how much a weak visual stimulus and a strong visual stimulus are suppressed by the suppression signal. If the signal is non-scaled, visual stimuli should always be suppressed independently of their intensities.

      Thank you for this important suggestion. As mentioned in our response to the previous comment, we believe it is not feasible to record from neurons responsible for the body optomotor response at this point, as their identity remains unknown. Regarding the HS/VS cells, our previous study showed that HS cells are not always fully suppressed. The changes in saccade-related potential amplitude can be described as a linear function of the pre-saccadic visually-evoked membrane potential (Figure 7 in Kim et al., 2017). 

      As suggested by Fenk et al. 2014 (doi: 10.1016/j.cub.2014.10.042), HS cells might also be responsive to a moving bar. If that is the case, and if you present a bar and background (either sparse or dense) in a closed-loop manner to a head-fixed fly, HS cells might be sensitive only to the bar but not to the background (independently of the density).

      Thanks for pointing out this important issue. HS cells indeed respond strongly to the horizontal movement of a vertical bar, as expected given that their receptive fields are formed by the integration of local optic flow vectors. In one of our previous studies (Supplemental Figure 1 in Kim et al., 2015), we showed that the response amplitude to a single vertical bar is roughly equivalent to that elicited by a vertical grating composed of 12 bars of the same size. Therefore, we believe that HS cells are likely to contribute to the head response to a moving vertical bar. In a body-fixed flight simulator, HS cells would respond only to the bar if the bar runs in a closed loop with a static background. In this scenario, HS cells are likely to play a role in the head optomotor response.

      Note also that the role of HS cells in the wing optomotor response remains unresolved. Unilateral activation of HS cells has been shown to elicit locomotor turns in walking Drosophila (Fujiwara et al., 2017), as well as in flying individuals (unpublished data from our lab). However, a previous study also showed that strong silencing of HS/VS cells significantly reduced the head optomotor response, but not the wing optomotor response (Kim et al., 2017).

      If neurophysiology is technically challenging, an alternative way might pay attention to a head movement that exclusively follows the background (Fox et al., 2014 (doi: 10.1242/jeb.080192)). Because HS cells are thought to promote head rotation to background motion, a non-scaled suppression signal on HS cells would always suppress the head rotation independently of the background density.

      Thanks for this helpful comment. We have analyzed head movements during bar-evoked flight turns (Figure 7–figure supplement 1B) and found no significant changes across different background dot densities. We think that this might suggest that HS cells are unlikely to receive suppressive inputs during bar-evoked turns, akin to the lack of modulation during optomotor turns.

      Another way to separate a potential efference copy from other mechanisms (more global inhibition) is the directionality. A global inhibition would suppress the response to the background even if the background moves in the same direction as self-motion, but the efference copy would not.

      Thanks for this important point. In Heisenberg and Wolf, 1979, it was proposed that modulation might be bidirectional, with behavioral effects observed only for perturbations in the “unexpected” direction. In our new data on loom-evoked turns (Figure 6), the suppression appears equally strong for background motion in either direction, supporting an all-or-none suppression mechanism.

      Besides, in general, it is unclear if you think an efference copy operates both in smooth pursuits and saccades or if such a signal is only present during saccades. Your previous neurophysiological work supports the latter. Are your behavioral results consistent with the previous saccade suppression idea, or do you propose a new type of efference copy that also operates in smooth pursuits?

      Thanks for raising this important point. von Holst and Mittelstaedt (1950) originally introduced the concept of efference copy to explain the smooth optomotor response. We previously analyzed electrophysiological recordings from HS cells for membrane-potential changes associated with slow deviations in wing-steering angle but found none. However, this negative result does not entirely rule out modulation of visual processing during smooth flight turns, given the slow drift in membrane potential observed in most whole-cell recordings.

      In this study, We examined only the interactions among visuomotor pathways during these rapid flight turns as the dynamics of visually evoked turns are almost as rapid as spontaneous saccades. Our data reveal that interactions between distinct visuomotor reflexes are more diverse than previously appreciated.

      Minor comments:

      Line 108, 109: match the description between here and the labels in Fig. 1F.

      Thank you for indicating this issue. We have defined the general equation to obtain the position and velocity components in the main text lines 108,109, but due to a slight asymmetry in the data (Fig. 1E) we used the approach indicated in Fig. 1F. and explained in lines 113-117.

      Fig.1 F: If the position-dependent component is due to fatigue, the tuning curve's shape is likely changed (shrunk or extended) depending on the stimulus speed. How can you generalize the tuning curve shown here? Does the result hold even if the stimulus speed/contrast/spatial frequency is changed?

      We appreciate this indication. We believed that fatigue may be the reason why the wing response to the grating stimulus showed that significant decay (Fig. 1E). As you mention, the stimulus speed would increase the amplitude of the fly’s response up to a saturation point. We addressed this in our model by multiplying the derived value by the angular velocity of the grating.

      Regarding the contrast, and spatial frequency we did not test it experimentally, instead, we simulated our model for changing visual feedback (Fig. 4A, B), which can be seen as increasing/decreasing contrast of a grating. An increase in the contrast would increase the response of the fly to the grating and so will contribute to dampening the response to the foreground object (Fig. 4C).

      Line 233-255: Here, the description sounds like you will consider several parallel objects (e.g., two stripes) in the visual field instead of the combination of the figure and background (which is referred to in the following paragraph).

      Thank you for pointing it out. Indeed it was slightly ambiguous. We have addressed this by explaining the specific situation of a combination of an object and the background in lines 231-233.

      Figure 6C: you kept the foreground visual field between sparse and dense random dot backgrounds to keep the bar's saliency. Is it sure that this does not influence the difference in the fly's response to these two backgrounds (in Figure 6B)?

      This is a good point that we have also discussed internally. We also carried out similar experiments with a fully covered background and found no significant differences (Figure 7–figure supplement 1).

      Reviewer #3 (Recommendations For The Authors):

      Identify and analyze flight saccade dynamics in the raw trajectories (e.g., Fig. 3B). There should be some since the bar is near the 'sweet spot' for triggering saccades (see Mongeau & Frye, 2017).

      Thank you for bringing up this interesting point. In previous work, it was reported that the fly fixated on a vertical bar through saccadic turns rather than smooth-tracking (Mongeau & Frye, 2017). When the bar width was thin (<15 deg) there was barely one saccade per second (Mongeau & Frye, 2017, Fig. 4). In our magno tether essay (Fig. 3A, B) the object width was 11.25 degrees, and the object moved for a short time window, and so the fly only generated the saccade related to the onset of the object. It could not be considered as a saccade some small turns of a few degrees that are likely related to small perturbations in comparison to those previously reported (Mongeau & Frye, 2017). Additionally, in our protocol (Fig. 3A) from onset time (‘go’ mark), only a single object moved, within an empty background, so in principle there is no trigger for a switch to a smooth movement. We addressed this in lines x-x.

      Consider updating the Poggio model with flight saccades (switched, integrate-and-fire).

      We appreciate this suggestion. Following previous work (Mongeau et al., 2017), we expanded our model to include a saccade mechanism: the torque produced by the summed position- and velocity-dependent components is now replaced by an integrate-and-fire saccade (Figure 2—figure supplement 2). We optimized the saccade interval and amplitude so that both vary linearly with stimulus amplitude and faithfully reproduce the kinematic properties reported previously (Mongeau et al., 2017).  

      Please engage more with the literature, especially work that directly conflicts with your conclusions (see above). Also, highly relevant work by Bender & Dickinson was not sufficiently discussed. Spot results presented in Fig. 3 should be contextualized in light of the work of Mongeau et al., 2019, who performed similar experiments and identified a switch in saccade valence.

      We appreciate your pointing out the relevant previous work. We have added references to the following papers and tried to describe the relationship between our data and previous ones.

      Bender & Dickinson 2006

      (Line#162) “This simulation experiment is reminiscent of the magnetically tethered flight assay, where a flying fly remains fixed at a position but is free to rotate around its yaw axis (Bender and Dickinson, 2006b; Cellini et al., 2022; G. Kim et al., 2023; Mongeau and Frye, 2017).”

      (Line#218) “We tested the predictions of our models with flies flying in an environment similar to that used in the simulation (Figure 3A). A fly was tethered to a short steel pin positioned vertically at the center of a vertically oriented magnetic field, allowing it to rotate around its yaw axis with minimal friction (Bender and Dickinson, 2006b; Cellini et al., 2022; G. Kim et al., 2023).”

      (Line#238) “To determine if our assay imposes additional friction compared to other assays used in previous studies, we analyzed the dynamics of spontaneous saccades during the “freeze” phase (Figure 3–figure supplement 1A). We found their duration and amplitude to be within the range reported previously (Bender and Dickinson, 2006b; Mongeau and Frye, 2017) (Figure 3–figure supplement 1B-D). 

      Mongeau et al., 2019

      (Line#196) “During this behavior, it has been proposed that visual circuits compute an integrated error of the bar position with respect to the frontal midline and triggers a saccadic turn toward the bar when the integrated value reaches a threshold (Frighetto and Frye, 2023; Mongeau et al., 2019; Mongeau and Frye, 2017). We expanded our bar fixation model to incorporate this behavioral strategy (Figure 2–figure supplement 2).”

      This paper shows that the dynamics of saccadic flight turns elicited by a rotating bar or spot determine whether flies display attraction or aversion. In that study, the visual stimulus—a bar or spot—rotated slowly at a constant 75 deg s⁻¹. By contrast, in our Figure 3 the object moves much faster, driving the neural “integrator” to saturation and triggering an almost immediate flight turn. In Mongeau et al. (2019), saccades occur at variable times and their amplitudes and directions are more stochastic, again reflecting the slower stimulus speed. Because these differences all arise from the disparity in object speed, we did not cite Mongeau et al. (2019) in Figure 3 or the associated text.

      In addition to the two papers cited above, we have incorporated several relevant studies on the Drosophila visuomotor control identified through the reviewers’ insightful comments. Examples include:

      Frighetto G, Frye MA. 2023 (Line#195, 464)

      Rimniceanu et al., 2023 (Line#241)

      Cellini & Mongeau 2020 (Line#91)

      Cellini & Mongeau 2022 (Line#241)

      Cellini et al., 2022 (LIne#91, 162, 218)

      Many citations are not in the proper format (e.g. using numbers rather than authors' last name).

      Thank you for letting us know. We have changed the remaining citations to the proper format.

    1. Reviewer #2 (Public review):

      Summary:

      Shahbazi et al. trained recurrent neural networks (RNNs) to simulate human upper limb movement during adaptation to a force field perturbation. They demonstrated that throughout adaptation, the pattern of motor commands to the muscles of the simulated arm changed, allowing the perturbed movements to regain their typical, perturbation-free straight-line paths. After this initial learning block (FF1), the network encountered null-fields to wash out the adaptation, before re-experiencing the force in a second learning block (FF2). Upon re-exposure, the network learned faster than during initial learning, consistent with the savings observed in behavioral studies of adaptation. They also found that as the number of hidden units in the RNN increased, so did the probability of exhibiting savings. The authors concluded that these results propose a neural basis for savings that is independent of context and strategic processes.

      Strengths:

      The paper addresses an important and controversial topic in motor adaptation: the mechanism underlying motor memory. The RNN simulation reproduces behavioral hallmarks of adaptation, and it provides a useful illustration of the pattern of muscle activity underlying human-like movements under both normal and perturbing conditions. While the savings effect produced by the network, though significant, appears somewhat small, the simulation demonstrating an increase in savings with a greater number of hidden units is particularly intriguing.

      Weaknesses:

      (1) To be transparent, savings in motor adaptation have been a primary focus of my own research. Some core findings presented in this paper are at odds with the ideas I and others have previously put forward. While I don't want to impose my agenda on the authors of this paper, I do think the authors should address these issues.

      a) The authors acknowledge the ongoing debate in the literature regarding the mechanisms underlying savings, particularly whether it stems from explicit or implicit learning processes. However, it remains unclear how the current work addresses this debate. There is already a considerable body of research, particularly in visuomotor adaptation, demonstrating that savings is predominantly driven by explicit strategies. For example, when people are asked to report their strategy, they recall a strategy that was useful during the first learning block (Morehead et al. 2015). Furthermore, savings are abolished under experimental manipulations designed to eliminate strategic contributions (e.g., Haith et al., 2015; Huberdeau et al., 2019; Avraham et al., 2021). The authors briefly state that their findings support the hypothesis that a neural basis of memory retention underlying savings can be independent of cognitive or strategic learning components, and that savings can be characterized as implicit. While these statements may be true, it is not clear how this work substantiates these claims.<br /> b) Our research has also demonstrated that if implicit adaptation is completely washed out after the initial learning block, it not only fails to exhibit savings but is actually attenuated relative to the first learning block (Avraham et al., 2021). This phenomenon of attenuation upon relearning can also be seen in other studies of visuomotor adaptation (e.g., Leow et al., 2020; Yin and Wei, 2020; Hamel et al., 2021; Hamel et al., 2022; Wang and Ivry, 2023; Hadjiosif et al., 2023). More recently, we have shown that this attenuation is due to anterograde interference arising from the experience with the washout block experience (Avraham and Ivry, 2025). We illustrated that the implicit system is highly susceptible to interference; it doesn't require exposure to salient opposite errors and can occur even following prolonged exposure to veridical feedback. The central thesis of this paper, namely that implicit savings can emerge through RNNs, is at odds with these empirical results. The authors should address this discrepancy.

      (2) This brings me to the question about neural correlates: The results are linked to activity in the primary motor cortex. How does that align with the well-established role of the cerebellum in implicit motor adaptation? And with the studies showing that savings are due to explicit strategies, which are generally associated with prefrontal regions?

      (3) The analysis on the complexity of the neural network (i.e., the number of hidden units) and its relationship to savings is very interesting. It makes sense to me that more complex networks would show more savings. I'm not sure I follow the author's explanation, but my understanding is that increased network complexity makes it more difficult to override the formed memory through interference (e.g., from the experience with NF2). Also, the results indicate that a network with 32 units led to a less-than-chance level of networks exhibiting savings (Figure 3b). What behavioral output does this configuration produce? Could this behavior manifest as attenuation upon relearning? Furthermore, if one were to examine an even smaller, simpler network (perhaps one more closely reflecting cerebellar circuits), would such a model predict attenuation rather than savings?

      (4) The authors emphasize that their network did not receive any explicit contextual signals related to the presence or absence of the force field (FF), thus operating in a 'context-free' manner. From my understanding, some existing models of context's role in motor memories (e.g., Oh and Schweighofer, 2019; Heald et al., 2021) propose that memory-related changes can be observed even without explicit contextual information, as contextual changes can be inferred from sudden or significant environmental shifts (e.g., the introduction or removal of perturbations). Given this, could the observed savings in the current simulation be explained by some form of contextual retrieval, inferred by the network from the re-presentation of the perturbation in FF2?

      (5) If there is residual hidden unit activity related to the FF at the end of the NF2 phase, how does the simulated movement revert back to baseline? Are there any differences in the movement trajectory, beyond just lateral deviation, between NF1 and NF2? The authors state that "changes in the preparatory hidden unit activity did not result in substantive changes in the motor commands (Figure 5b), which emphasizes that the uniform shift resides in the null space of motor output." However, Figure 5b appears to show visible changes in hidden unit activity. Don't these changes reflect a pattern of muscle activity that is the basis for behavior? These changes are indeed small, but it seems that so is the effect size for savings (Figure 3a). Could this suggest that there is not, in fact, a complete washout of initial learning during NF2 within the network?

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      *The authors have a longstanding focus and reputation on single cell sequencing technology development and application. In this current study, the authors developed a novel single-cell multi-omic assay termed "T-ChIC" so that to jointly profile the histone modifications along with the full-length transcriptome from the same single cells, analyzed the dynamic relationship between chromatin state and gene expression during zebrafish development and cell fate determination. In general, the assay works well, the data look convincing and conclusions are beneficial to the community. *

      Thank you for your positive feedback.

      *There are several single-cell methodologies all claim to co-profile chromatin modifications and gene expression from the same individual cell, such as CoTECH, Paired-tag and others. Although T-ChIC employs pA-Mnase and IVT to obtain these modalities from single cells which are different, could the author provide some direct comparisons among all these technologies to see whether T-ChIC outperforms? *

      In a separate technical manuscript describing the application of T-ChIC in mouse cells (Zeller, Blotenburg et al 2024, bioRxiv, 2024.05. 09.593364), we have provided a direct comparison of data quality between T-ChIC and other single-cell methods for chromatin-RNA co-profiling (Please refer to Fig. 1C,D and Fig. S1D, E, of the preprint). We show that compared to other methods, T-ChIC is able to better preserve the expected biological relationship between the histone modifications and gene expression in single cells.

      *In current study, T-ChIC profiled H3K27me3 and H3K4me1 modifications, these data look great. How about other histone modifications (eg H3K9me3 and H3K36me3) and transcription factors? *

      While we haven't profiled these other modifications using T-ChIC in Zebrafish, we have previously published high quality data on these histone modifications using the sortChIC method, on which T-ChIC is based (Zeller, Yeung et al 2023). In our comparison, we find that histone modification profiles between T-ChIC and sortChIC are very similar (Fig. S1C in Zeller, Blotenburg et al 2024). Therefore the method is expected to work as well for the other histone marks.

      *T-ChIC can detect full length transcription from the same single cells, but in FigS3, the authors still used other published single cell transcriptomics to annotate the cell types, this seems unnecessary? *

      We used the published scRNA-seq dataset with a larger number of cells to homogenize our cell type labels with these datasets, but we also cross-referenced our cluster-specific marker genes with ZFIN and homogenized the cell type labels with ZFIN ontology. This way our annotation is in line with previous datasets but not biased by it. Due the relatively smaller size of our data, we didn't expect to identify unique, rare cell types, but our full-length total RNA assay helps us identify non-coding RNAs such as miRNA previously undetected in scRNA assays, which we have now highlighted in new figure S1c .

      *Throughout the manuscript, the authors found some interesting dynamics between chromatin state and gene expression during embryogenesis, independent approaches should be used to validate these findings, such as IHC staining or RNA ISH? *

      We appreciate that the ISH staining could be useful to validate the expression pattern of genes identified in this study. But to validate the relationships between the histone marks and gene expression, we need to combine these stainings with functional genomics experiments, such as PRC2-related knockouts. Due to their complexity, such experiments are beyond the scope of this manuscript (see also reply to reviewer #3, comment #4 for details).

      *In Fig2 and FigS4, the authors showed H3K27me3 cis spreading during development, this looks really interesting. Is this zebrafish specific? H3K27me3 ChIP-seq or CutTag data from mouse and/or human embryos should be reanalyzed and used to compare. The authors could speculate some possible mechanisms to explain this spreading pattern? *

      Thanks for the suggestion. In this revision, we have reanalysed a dataset of mouse ChIP-seq of H3K27me3 during mouse embryonic development by Xiang et al (Nature Genetics 2019) and find similar evidence of spreading of H3K27me3 signal from their pre-marked promoter regions at E5.5 epiblast upon differentiation (new Figure S4i). This observation, combined with the fact that the mechanism of pre-marking of promoters by PRC1-PRC2 interaction seems to be conserved between the two species (see (Hickey et al., 2022), (Mei et al., 2021) & (Chen et al., 2021)), suggests that the dynamics of H3K27me3 pattern establishment is conserved across vertebrates. But we think a high-resolution profiling via a method like T-ChIC would be more useful to demonstrate the dynamics of signal spreading during mouse embryonic development in the future. We have discussed this further in our revised manuscript.

      Reviewer #1 (Significance (Required)):

      *The authors have a longstanding focus and reputation on single cell sequencing technology development and application. In this current study, the authors developed a novel single-cell multi-omic assay termed "T-ChIC" so that to jointly profile the histone modifications along with the full-length transcriptome from the same single cells, analyzed the dynamic relationship between chromatin state and gene expression during zebrafish development and cell fate determination. In general, the assay works well, the data look convincing and conclusions are beneficial to the community. *

      Thank you very much for your supportive remarks.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      *Joint analysis of multiple modalities in single cells will provide a comprehensive view of cell fate states. In this manuscript, Bhardwaj et al developed a single-cell multi-omics assay, T-ChIC, to simultaneously capture histone modifications and full-length transcriptome and applied the method on early embryos of zebrafish. The authors observed a decoupled relationship between the chromatin modifications and gene expression at early developmental stages. The correlation becomes stronger as development proceeds, as genes are silenced by the cis-spreading of the repressive marker H3k27me3. Overall, the work is well performed, and the results are meaningful and interesting to readers in the epigenomic and embryonic development fields. There are some concerns before the manuscript is considered for publication. *

      We thank the reviewer for appreciating the quality of our study.

      *Major concerns: *

        • A major point of this study is to understand embryo development, especially gastrulation, with the power of scMulti-Omics assay. However, the current analysis didn't focus on deciphering the biology of gastrulation, i.e., lineage-specific pioneer factors that help to reform the chromatin landscape. The majority of the data analysis is based on the temporal dimension, but not the cell-type-specific dimension, which reduces the value of the single-cell assay. *

      We focused on the lineage-specific transcription factor activity during gastrulation in Figure 4 and S8 of the manuscript and discovered several interesting regulators active at this stage. During our analysis of the temporal dimension for the rest of the manuscript, we also classified the cells by their germ layer and "latent" developmental time by taking the full advantage of the single-cell nature of our data. Additionally, we have now added the cell-type-specific H3K27-demethylation results for 24hpf in response to your comment below. We hope that these results, together with our openly available dataset would demonstrate the advantage of the single-cell aspect of our dataset.

      1. *The cis-spreading of H3K27me3 with developmental time is interesting. Considering H3k27me3 could mark bivalent regions, especially in pluripotent cells, there must be some regions that have lost H3k27me3 signals during development. Therefore, it's confusing that the authors didn't find these regions (30% spreading, 70% stable). The authors should explain and discuss this issue. *

      Indeed we see that ~30% of the bins enriched in the pluripotent stage spread, while 70% do not seem to spread. In line with earlier observations(Hickey et al., 2022; Vastenhouw et al., 2010), we find that H3K27me3 is almost absent in the zygote and is still being accumulated until 24hpf and beyond. Therefore the majority of the sites in the genome still seem to be in the process of gaining H3K27me3 until 24hpf, explaining why we see mostly "spreading" and "stable" states. Considering most of these sites are at promoters and show signs of bivalency, we think that these sites are marked for activation or silencing at later stages. We have discussed this in the manuscript ("discussion"). However, in response to this and earlier comment, we went back and searched for genes that show H3K27-demethylation in the most mature cell types (at 24 hpf) in our data, and found a subset of genes that show K27 demethylation after acquiring them earlier. Interestingly, most of the top genes in this list are well-known as developmentally important for their corresponding cell types. We have added this new result and discussed it further in the manuscript (Fig. 2d,e, , Supplementary table 3).

      *Minors: *

        • The authors cited two scMulti-omics studies in the introduction, but there have been lots of single-cell multi-omics studies published recently. The authors should cite and consider them. *

      We have cited more single-cell chromatin and multiome studies focussed on early embryogenesis in the introduction now.

      *2. T-ChIC seems to have been presented in a previous paper (ref 15). Therefore, Fig. 1a is unnecessary to show. *

      Figure 1a. shows a summary of our Zebrafish TChIC workflow, which contains the unique sample multiplexing and sorting strategy to reduce batch effects, which was not applied in the original TChIC workflow. We have now clarified this in "Results".

      1. *It's better to show the percentage of cell numbers (30% vs 70%) for each heatmap in Figure 2C. *

      We have added the numbers to the corresponding legends.

      1. *Please double-check the citation of Fig. S4C, which may not relate to the conclusion of signal differences between lineages. *

      The citation seems to be correct (Fig. S4C supplements Fig. 2C, but shows mesodermal lineage cells) but the description of the legend was a bit misleading. We have clarified this now.

      *5. Figure 4C has not been cited or mentioned in the main text. Please check. *

      Thanks for pointing it out. We have cited it in Results now.

      Reviewer #2 (Significance (Required)):

      *Strengths: This work utilized a new single-cell multi-omics method and generated abundant epigenomics and transcriptomics datasets for cells covering multiple key developmental stages of zebrafish. *

      *Limitations: The data analysis was superficial and mainly focused on the correspondence between the two modalities. The discussion of developmental biology was limited. *

      *Advance: The zebrafish single-cell datasets are valuable. The T-ChIC method is new and interesting. *

      *The audience will be specialized and from basic research fields, such as developmental biology, epigenomics, bioinformatics, etc. *

      *I'm more specialized in the direction of single-cell epigenomics, gene regulation, 3D genomics, etc. *

      Thank you for your remarks.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      *This manuscript introduces T‑ChIC, a single‑cell multi‑omics workflow that jointly profiles full‑length transcripts and histone modifications (H3K27me3 and H3K4me1) and applies it to early zebrafish embryos (4-24 hpf). The study convincingly demonstrates that chromatin-transcription coupling strengthens during gastrulation and somitogenesis, that promoter‑anchored H3K27me3 spreads in cis to enforce developmental gene silencing, and that integrating TF chromatin status with expression can predict lineage‑specific activators and repressors. *

      *Major concerns *

      1. *Independent biological replicates are absent, so the authors should process at least one additional clutch of embryos for key stages (e.g., 6 hpf and 12 hpf) with T‑ChIC and demonstrate that the resulting data match the current dataset. *

      Thanks for pointing this out. We had, in fact, performed T-ChIC experiments in four rounds of biological replicates (independent clutch of embryos) and merged the data to create our resource. Although not all timepoints were profiled in each replicate, two timepoints (10 and 24hpf) are present in all four, and the celltype composition of these replicates from these 2 timepoints are very similar. We have added new plots in figure S2f and added (new) supplementary table (#1) to highlight the presence of biological replicates.

      2. *The TF‑activity regression model uses an arbitrary R² {greater than or equal to} 0.6 threshold; cross‑validated R² distributions, permutation‑based FDR control, and effect‑size confidence intervals are needed to justify this cut‑off. *

      Thank you for this suggestion. We did use 10-fold cross validation during training and obtained the R2 values of TF motifs from the independent test set as an unbiased estimate. However, the cutoff of R2 > 0.6 to select the TFs for classification was indeed arbitrary. In the revised version, we now report the FDR-adjusted p-values for these R2 estimates based on permutation tests, and select TFs with a cutoff of padj supplementary table #4 to include the p-values for all tested TFs. However, we see that our arbitrary cutoff of 0.6 was in fact, too stringent, and we can classify many more TFs based on the FDR cutoffs. We also updated our reported numbers in Fig. 4c to reflect this. Moreover, supplementary table #4 contains the complete list of TFs used in the analysis to allow others to choose their own cutoff.

      3. *Predicted TF functions lack empirical support, making it essential to test representative activators (e.g., Tbx16) and repressors (e.g., Zbtb16a) via CRISPRi or morpholino knock‑down and to measure target‑gene expression and H3K4me1 changes. *

      We agree that independent validation of the functions of our predicted TFs on target gene activity would be important. During this revision, we analysed recently published scRNA-seq data of Saunders et al. (2023) (Saunders et al., 2023), which includes CRISPR-mediated F0 knockouts of a couple of our predicted TFs, but the scRNAseq was performed at later stages (24hpf onward) compared to our H3K4me1 analysis (which was 4-12 hpf). Therefore, we saw off-target genes being affected in lineages where these TFs are clearly not expressed (attached Fig 1). We therefore didn't include these results in the manuscript. In future, we aim to systematically test the TFs predicted in our study with CRISPRi or similar experiments.

      4. *The study does not prove that H3K27me3 spreading causes silencing; embryos treated with an Ezh2 inhibitor or prc2 mutants should be re‑profiled by T‑ChIC to show loss of spreading along with gene re‑expression. *

      We appreciate the suggestion that indeed PRC2-disruption followed by T-ChIC or other forms of validation would be needed to confirm whether the H3K27me3 spreading is indeed causally linked to the silencing of the identified target genes. But performing this validation is complicated because of multiple reasons: 1) due to the EZH2 contribution from maternal RNA and the contradicting effects of various EZH2 zygotic mutations (depending on where the mutation occurs), the only properly validated PRC2-related mutant seems to be the maternal-zygotic mutant MZezh2, which requires germ cell transplantation (see Rougeot et al. 2019 (Rougeot et al., 2019)) , and San et al. 2019 (San et al., 2019) for details). The use of inhibitors have been described in other studies (den Broeder et al., 2020; Huang et al., 2021), but they do not show a validation of the H3K27me3 loss or a similar phenotype as the MZezh2 mutants, and can present unwanted side effects and toxicity at a high dose, affecting gene expression results. Moreover, in an attempt to validate, we performed our own trials with the EZH2 inhibitor (GSK123) and saw that this time window might be too short to see the effect within 24hpf (attached Fig. 2). Therefore, this validation is a more complex endeavor beyond the scope of this study. Nevertheless, our further analysis of H3K27me3 de-methylation on developmentally important genes (new Fig. 2e-f, Sup. table 3) adds more confidence that the polycomb repression plays an important role, and provides enough ground for future follow up studies.

      *Minor concerns *

      1. *Repressive chromatin coverage is limited, so profiling an additional silencing mark such as H3K9me3 or DNA methylation would clarify cooperation with H3K27me3 during development. *

      We agree that H3K27me3 alone would not be sufficient to fully understand the repressive chromatin state. Extension to other chromatin marks and DNA methylation would be the focus of our follow up works.

      *2. Computational transparency is incomplete; a supplementary table listing all trimming, mapping, and peak‑calling parameters (cutadapt, STAR/hisat2, MACS2, histoneHMM, etc.) should be provided. *

      As mentioned in the manuscript, we provide an open-source pre-processing pipeline "scChICflow" to perform all these steps (github.com/bhardwaj-lab/scChICflow). We have now also provided the configuration files on our zenodo repository (see below), which can simply be plugged into this pipeline together with the fastq files from GEO to obtain the processed dataset that we describe in the manuscript. Additionally, we have also clarified the peak calling and post-processing steps in the manuscript now.

      *3. Data‑ and code‑availability statements lack detail; the exact GEO accession release date, loom‑file contents, and a DOI‑tagged Zenodo archive of analysis scripts should be added. *

      We have now publicly released the .h5ad files with raw counts, normalized counts, and complete gene and cell-level metadata, along with signal tracks (bigwigs) and peaks on GEO. Additionally, we now also released the source datasets and notebooks (.Rmarkdown format) on Zenodo that can be used to replicate the figures in the manuscript, and updated our statements on "Data and code availability".

      *4. Minor editorial issues remain, such as replacing "critical" with "crucial" in the Abstract, adding software version numbers to figure legends, and correcting the SAMtools reference. *

      Thank you for spotting them. We have fixed these issues.

      Reviewer #3 (Significance (Required)):

      The method is technically innovative and the biological insights are valuable; however, several issues-mainly concerning experimental design, statistical rigor, and functional validation-must be addressed to solidify the conclusions.

      Thank you for your comments. We hope to have addressed your concerns in this revised version of our manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary

      In this study, Takagi and colleagues demonstrate that changes in axonal arborization of the segmental wave motor command neurons are sufficient to change behavioral motor output.

      The authors identify the Wnt receptors DFz2 and DFz4 and the ligand Wnt4 as modulators of stereotypic segmental arborization patterns of segmental wave neurons along the anterior-posterior body axis. Based on both embryonic expression pattern analysis and genetic manipulation of the signaling components in wave neurons (receptors) and the neuropil (Wnt4) the authors convincingly demonstrate that Wnt4 acts as a repulsive ligand for DFz2 that restricts posterior axon guidance of both anterior and posterior wave neurons. They also provide the first evidence that Wnt4 potentially acts as an attractive ligand for Df4 to promote the posterior extension of p-wave neurons. Interestingly, artificial optogenetic activation of all wave neurons that normally induces backward locomotion due to the activity of anterior wave neurons, fails to induce backward locomotion in a DFz2 knockdown condition with altered axonal extensions of all wave neurons towards posterior segments. In addition, the authors now observe enhanced fast-forward locomotion, a feature normally induced by posterior wave neurons. Consistent with these findings, they observe that the natural response to an anterior tactile stimulus is similarly altered in DFz2 knockdown animals. The animals respond with less backward movement and increased fast forward motion. These results suggest that alterations in the innervation pattern of wave motor command neurons are sufficient to switch behavioral response programs.

      Strengths

      The authors convincingly demonstrate the importance of Wnt signaling for anteriorposterior axon guidance of a single class of motor command neurons in the larval CNS. The demonstration that alteration of the expression level of a single axon guidance receptor is sufficient to not only alter the innervation pattern but to significantly modify the behavioral response program of the animal provides a potential entry point to understanding behavioral adaptations during evolution.

      Weaknesses

      While the authors demonstrate an alteration of the behavioral response to a natural tactile stimulus the observed effects, a reduction of backward motion and increased fast-foward locomotion, currently cannot be directly correlated to the morphological alterations observed in the single-neuron analyses. The authors do not report any loss of innervation in the "normal" target region but only a small additional innervation of more posterior regions. An analysis of synaptic connectivity and/or a more detailed morphological analysis that is supported by a larger number of analyzed neurons both in control and experimental animals would further strengthen the confidence of the study. As the authors suggest an alteration of the command circuitry, a direct observation of the downstream activation pattern in response to selective optogenetic stimulation of anterior wave neurons would further strengthen their claims (analogous to Takagi et al., 2017, Figure 4).

      We sincerely thank the reviewer for their insightful comments, which were instrumental in improving our manuscript. In response to the reviewers’ suggestion, we have now studied Brp expression and demonstrate that the ectopically extending Wave axons in the posterior region do contain synapses (new Figure 2). This finding supports the idea that these axons are functionally connected to ectopic downstream circuits. 

      Additionally, we have increased the number of analyzed Wave clones in Figure 1F-J (WT and DFz2 KD) and new Figure 3C-G (WT; formerly Figure 2C-G) to strengthen the morphological analyses. We fully agree with the reviewer that “direct observation of the downstream activation pattern in response to selective optogenetic stimulation” would further reinforce our conclusions. However, this was not feasible in the current study since we found that the Wave-Gal4 driver used in this study, which drives expression during embryonic stages, does not drive sufficiently strong expression in the larvae to enable selective optogenetic stimulation (please see below for details). 

      Reviewer #2 (Public Review):

      Summary:

      The authors previously demonstrated that anterior-located a-Wave neurons (neuromeres A1-A3) extend axons anteriorly to connect to circuits inducing backward locomotion, while p-Wave axon (neuromeres A4-A7) project posteriorly to promote forward locomotion in Drosophila larvae. In the manuscript, the authors aim to determine the molecular mechanisms involved in wiring the segmentally homologous Wave neurons distinctively and thus are functionally different in modulating forward or backward locomotion. The genetic screen focused on Wnt/Fz-signaling due to its known anterior-to-posterior guidance roles in mammals and nematodes.

      Strengths:

      Knock-down (KD) DFz2 with two independent RNAi-lines caused ectopic posterior axon and dendrite extension for all a- and p-Wave neurons, with a-Wave axon extending into regions where p-Wave axons normally project. Both behavioral assays (optogenetic stimulation of all Wave neurons or tactile stimuli on heads using a von Frey filament) show that backward movement is reduced or absent and that the speed of evoked fast-forward locomotion is increased. This demonstrates that altered projections of Wave do alter behavior and the DFz2 KD phenotype is consistent with the potential aberrant wiring of a-Wave neurons to forward locomotion-promoting circuits instead of to backward locomotion-promoting circuits.

      The main conclusion, that Wnt/Fz-signaling is essential for the guidance of Wave neurons and in diversifying their protection pattern in a segment-specific manner, is further supported by the results showing that DFz2 gain of function causes shortening of a-Wave but not p-Wave axon extensions towards the posterior end and that KD of DFz4 causes axonal shortening only in A6-p-Wave neurons but does not affect dendrites or processes of other Wave neurons. A role for ligand Wnt4 is demonstrated by results indicating that WNT4 mutants' posterior extension of aWave axons was elongated similar to DFz2 KD animals and p-Wave axon extension towards the posterior end was shortened similar to DFz2 KD animals. Finally, a DWnt4 gradient decreasing from the posterior (A8) to the anterior end (A2), similar to that described in other species, is supported by analyses of DWnt4 gene expression (using Wnt4 Trojan-Gal4) and protein expression (using antibodies). In contrast, DFz2 receptor levels seemed to decrease from the anterior (A2) to the posterior end (A5/6). Together the results support the conclusion that opposing Wnt/Fz ligand-receptor gradients contribute to the diversification of Wave neurons in a location-dependent manner and that DFz2 and DFz4 have opposing effects on axon extension.

      Weaknesses:

      Wave axon and dendrite projections are not exclusively determined by Wnt4, DFz2, and DFz4, and are likely to involve other Fz receptors, Wt ligands, and other types of receptor-ligand signaling pathways. This is in part supported by the fact that Wnt4 loss of function also resulted in phenotypes that do not mimic DFz2 KD or DFz4 KD (Figures 3D, E, and F) and that other Fz/Wnt mutants caused wave neuron phenotypes (Figure 1-supplement 2, D+E). This is not a weakness per se, since it doesn't affect the main conclusion of the manuscript. However, the description and analyses of the data in particular for Figure 1-supplement 2 D should be clarified in the legend. The number within the bars and the asterisks are not defined. It's presumed they refer to numbers of animals assessed and the asterisk next to DFz2 and DFz4 indicate statistically significant differences. However, only one p-value is provided in the legend. It is also unclear if p-values for the other mutants have not been determined or are non-significant. At least for mutants like Corin, which also exhibit altered axon projections, the p-values should be provided.

      We appreciate this reviewer’s careful attention to detail and intellectual curiosity. We apologize for the confusions caused by the statistical reporting in Figure 1 – figure supplement 2D. The numbers shown in the bars represent the number of neurons (i.e. Wave neurons from left or right hemisphere). As mentioned in Materials and Methods section, we applied Chi-square test followed by Haberman's adjusted residual analysis to determine the statistical significance of each RNAi group. The p-value provided in the figure legend corresponds to the Chi-square test. P-values for Haberman's adjusted residual analysis were calculated for all RNAi groups and groups without the asterisk are not statistically significant. We have clarified these points in the corresponding figure legend.

      Figure 4 D, F. The gradient for Wnt4 was determined by comparison of expression levels of other segments to A8 but the gradient for DFz2 was by comparison to A2 and the data supports opposing gradients. However, for DFz2 (Figure 4, F) it seems that the gradient is bi-directional with the lowest being in A5 and increasing towards A2 as well as A8. Analysis should be performed in reference to A8 as well to determine if it is indeed bi-directional. While such a finding would not affect the interpretation of aWave neurons, it may impact conclusions about p-Wave neuron projections.

      We thank the reviewer for highlighting this interesting possibility. In response, we performed an additional analysis of the DFz2 gradient by comparing the signal from each neuromere to that from A8 (new Figure 5—figure supplement 3). This analysis confirmed that the gradient is indeed bidirectional. We revised the description of DFz2 expression accordingly in the revision. We believe this finding does not affect our main conclusions since only the anterior gradient is relevant for a-Wave axon guidance. 

      As discussed above, the DFz2 KD phenotypes are consistent with the potential aberrant wiring of a-Wave neurons to forward locomotion-promoting circuits instead of to backward locomotion-promoting circuits. However, since the axon and dendrites of a-Wave and p-Wave are affected the actual dendritic and axonal contributions for the altered behavior remain elusive. The authors certainly considered a potential contribution of altered dendrite projection of a-Wave neurons to the phenotype and their conclusion that altered axonal projections are involved is supported by the optogenetic experiment "bypassing" sensory input (albeit it seems unlikely that all Wave neurons are activated simultaneously when perceiving natural stimuli).However, the author should also consider that altered perception and projection of pWave neuron may directly (e.g. extended P-wave axon projections increase forward locomotion input thereby overriding backward locomotion) or indirectly (e.g. feedback loops between forward and backward circuits) contribute to the altered behavioral phenotypes in both assays. It is probably noteworthy that the more complex behavioral alterations observed with mechanical stimulation are likely to also be caused by altered dendritic projections.

      We fully agree with the reviewer’s thoughtful interpretation. We have now included these important possibilities in the revised Discussion section. Specifically, we acknowledge that while the DFz2 knockdown phenotypes are consistent with aberrant wiring of a-Wave neurons to forward locomotion-promoting circuits, the contributions of both axonal and dendritic alterations remain unclear. We also recognize that altered perception and projection of p-Wave neurons may directly or indirectly contribute to the observed behavioral phenotypes, particularly in response to mechanical stimulation.

      Presynaptic varicosities of a-Wave neurons in DFz2 KD animals are indicated by orange arrows in Figure 1. However, no presynaptic markers have been used to confirm actual ectopic synaptic connections. At least the authors should more clearly define what parameters they used to "visually" define potential presynaptic varicosities. Some arrows seem to point to more "globular structures" but for several others, it's unclear what they are pointing at.

      As mentioned in our response to Reviewer #1, we have now performed Brp immunostaining to confirm the presence of ectopic synaptic connections (new Figure 2). This analysis supports the interpretation that the presynaptic varicosities observed in DFz2 knockdown animals represent actual synaptic sites. We also clarified in the figure legend the visual criteria used to identify potential presynaptic varicosities.

      Reviewing Editor (Recommendations For The Authors):

      There are a few major concerns that we recommend the authors address:

      (1) Neuroanatomy: The point aberrant synaptic connectivity of a-Wave neurons following Dfz2 knockdown could be substantiated. This could be done by using a presynaptic marker and showing ectopic posterior presynaptic sites ( and/or reduced anterior presynaptic sites) in a-wave neurons.

      As mentioned in our response to the public review, we now have used Brp as a presynaptic marker to quantify the number and distribution of presynaptic sites along the normal and ectopic a-Wave axons (new Figure 2). We show that ectopic posterior Wave axons do contain presynaptic sites.  

      (2) Gradient calculations: As detailed in the reviews below, the Dfz2 gradient looks like it may be bidirectional. Changing the way the gradient is calculated might help address this point.

      As mentioned in our response above, we now have recalculated the gradient by comparing the DFz2 signal to A8 and show that it indeed is bidirectional (new Figure 5—figure supplement 2; formerly Figure 4—figure supplement 2).

      (3)  Statistics and sample sizes: As detailed in the reviews, some of the statistical reporting could be improved. Further, increasing sample sizes could help bolster confidence in the data as well.

      As mentioned above, we have added a description on the sample size, asterisks, and p-values in Figure 1 – figure supplement 2 legend. We also increased sample sizes of single Wave neurons in control and DFz2 knock-down animals (Figure 1F-J (WT and DFz2 KD) and new Figure 3C-G (WT; formerly Figure 2C-G)).

      (4) It would help to include some discussion of the potential contributions of altered p-wave neurons to the observed phenotypes.

      As described above, we have added in the Discussion potential contributions of altered p-wave neurons to the observed phenotypes. 

      Reviewer #1 (Recommendations For The Authors):

      (1) In the current model the authors assume that posterior elongation of a-wave neuron connectivity (axonal projections) induces a loss of connectivity to their natural targets, as backward motion is no longer induced, and a gain of connectivity to posterior wave neuron targets. Is this at the cost of innervation of p-wave neurons, e.g. did these neurons now lose connectivity to their natural targets as well? Therefore, it would be very interesting if the authors would test the behavioral responses to tactile stimuli in the posterior parts of the animal - does the response pattern change?

      This is indeed an interesting possibility that p-Wave function is altered upon DFz2 knock-down and hence behavioral response to posterior touch is changed. However, it is technically challenging to test this with tactile stimuli, due to the difficulty of (1) distinguishing between normal and fast-forward locomotion and (2) delivering a posterior touch stimulus while the larva is moving forward, which is the default behavior of the larvae on an agar plate.

      As highlighted above, the authors should provide additional evidence that the circuit response to a-wave neurons is changed after a DFz2 knockdown. The authors should monitor the activation wave in response to optogenetic activation of anterior wave neurons - analogous to the data provided in Figure 4 of their 2017 paper. If this response is now switched for a-wave activation but not p-wave activation it would greatly support their claims and this data would be less ambiguous compared to the behavioral locomotion data.

      As described in our response to the public review, we attempted this approach but found that the in vitro optogenetics experiment is unfortunately not feasible due to relatively weak expression of R60G09-GAL4 in the larvae. Local activation of control aWave induced fictive backward locomotion only at low frequencies, making comparison with the experimental a-Wave very difficult.  The MB120B-spGAL4 used in our 2017 study could not be employed in this study as it does not drive expression during the embryonic stages and thus cannot be used to knock down DFz2 during development. 

      (2) Related to this point. Why would the normal "backward" circuitry of a-wave neurons be functionally suppressed in Dfz2 knockdowns? Do the authors observe reduced synaptic connectivity in these segments? Vesicle clustering of synaptotagmin or other presynaptic markers could be used as a first. As the innervation pattern is only extended by approximately one segment, it is surprising that the changes are so significant.

      We agree that these are important and interesting points, which remain to be explored in the future study. As described above, we have performed Brp immunostaining and showed that the posterior ectopic axons of a-Wave do contain synapses (new Figure 2). We also found a slight decrease in the number of synapses in the anterior region, which could partially contribute to the weaker activation of downstream neurons responsible for eliciting backward locomotion. Another possibility is that backward suppression occurs through lateral interaction among downstream circuits. Since forward and backward locomotion do not occur simultaneously, it is likely that the circuits driving these two behaviors are mutually inhibitory. Upon DFz2 knock down in a-Wave, downstream neurons inducing fastforward locomotion may become more strongly activated than those inducing backward locomotion, resulting in inhibition of the latter via a “winner-take-all” mechanism. Since these discussions are highly speculative, we chose not to include them in the revised manuscript.  

      (3) The low number of neurons analyzed per segment is of slight concern. This is particularly the case for the control data set used in Figure 1 and Figure 2. As stated, the same datasets are used for both figures. However, at most 6 neurons were analyzed (and for two segments only 3). The control morphology may be more variable than indicated by this data.

      As mentioned above, we now have dissected 50 larvae each for the control and experimental groups, obtained seven and six clones respectively, and included these data in the revised manuscript. We apologize that the sample sizes are still relatively small but hope the reviewer understands the inherently low “hit rate” of the stochastic labelling method.

      It is somewhat curious that in Figure 1- Supplement 3 the authors report the same number of control clones per segment as in Figure 1/2 - is this simply a coincidence? And if this is an independent dataset why did the author use new controls here but not for Figure 2? It is clear that it is very difficult to generate this data but increasing the n-number beyond 3-6 per segment would significantly increase the confidence in the presented data.

      We apologize for the confusion. The data in Figure 1 – figure supplement 3 represent the innervation pattern of dendrites, not axons. We have corrected the figure caption accordingly. These data were obtained from the same samples used to analyze axonal innervation, as shown in the original version of Figure 1F-J.

      (2) The name of the RNAi lines should be indicated in Figure 1 and Figure Supplement 3 to facilitate reading - at least the precise names should be given in both figure legends.

      We have added these labels in the revised figure legends as requested.

      (3) In Figure 4E again the control numbers of Figure 1 for the A2-wave axon are reused. This does not seem appropriate as now a different Gal4 driver is used and a different method to induce individual neuronal clones. Both components may induce significant variability in expression or arborization. As only 3 clones for the wnt4 mutant condition are analyzed (and compared to 5 control clones), this data does not allow for strong conclusions. The authors clearly state the reuse and different methods in the legend of Figure 4 F/G but should also highlight it for the E panel.

      Here, we assume that the reviewer is referring to the former Figure 3 (now Figure 4). We have added a note in the legend that the control data, obtained using a different method, were reused in this panel.

      (4) The expression levels of DWnt4 and DFz2 were analyzed at the end of embryogenesis. At what developmental stage does the axonal extension of wave neurons take place? Is the gradient maintained throughout the first larval stages?

      Based upon the lateral view of Wave neurons in Figure 1—figure supplement 1D, we think that the axonal extension is already established by approximately 20 hr after egg laying. Previously, we performed Wnt4<sup>MI03717-Trojan-GAL4</sup> > GFP.nls immunostaining in the third instar larva and observed a similar gradient of GFP signals towards the posterior end of the ventral nerve cord (VNC). We have included this data in the revised manuscript (new Figure 5—figure supplement 1).

      (5) The authors state that either 2nd or 3rd instar larvae were used for the optogenetic experiments. This may induce unnecessary variation in their assay and should be avoided. As natural variance exists in larvae regarding forward stride duration, the comparison of "on" state forward stride duration between control and experimental genotype is potentially not the best measurement of effect size. What is the difference between OFF and ON stage within the control and experimental genotype? In both cases stride duration decreases but there may not be a significant difference between the delta of the two genotypes. Thus, the observed effect may in part be due to "slower" animals in the control pool. The authors should discuss this more carefully.

      We thank the reviewer for bringing up this critical issue. Indeed, the stride durations of larvae between the control and DFz2 knock-down are slightly different in the OFF condition, although this is not statistically significant. In addition, the effect size of Wave activation on mean stride duration is -0.14 (s) in control while -0.21 (s) in DFz2 knock-down, which we interpret as DFz2 knock-down resulting in stronger fastforward locomotion upon Wave activation. We have incorporated this note in the corresponding figure legends (new Figure 6; formerly Figure 5).

      (6) While the study clearly provides convincing evidence for their model, the authors should tune down their conclusions in the discussion a little bit and highlight that parts of their discussion are speculative.

      We have revised the discussion as suggested.

      Reviewer #2 (Recommendations For The Authors):

      Albeit the optogenetic behavioral experiments strongly support that the altered axonal projection affect normal locomotion, simultaneous labeling of Wave neurons in DFz2 KD animals with presynaptic markers would strengthen the conclusion of ectopic connection of the extended axon with other circuits.

      Please see our response to your public review.

      Figure 1 K+L, Figure 2H, I, Figure 3 F+G: many of the individual data points are not visible in the Whisker plot- changing their color would be useful to visualize them better.

      We have changed the outline width of the box plots to make the individual data points visible.

      Figure 1-Supplement 2: In addition to the comments in the public review- a) the asterisk font size changes in the different panels, e.g. it is much smaller in G', b) font size in some graphs/legends should be increased - in particular in E the hyphenated letters in the genotypes are so small rendering them almost illegible.

      We have unified the font size to make them readable in the figure. We thank the reviewer for the suggestions.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      This is an exploratory study that doesn't explore quite enough. Critically, the authors make a point of mentioning that neuronal firing properties vary across cell types, but only use baseline firing rate as a proxy metric for cell type. This leaves several important explorations on the table, not limited to the following:”

      1a: “Do waveform shape features, which can also be informative of cell type, predict the effect of stimulation?”

      To address this question, we modeled our approach to cell type classification after Peyrache et al. 2012. More specifically, we extracted two features from the mean unit waveforms—the valley-to-peak time (VP) and the peak half-width (PHW). These features were then used to classify units into two distinct clusters (k-means, clusters = 2, based on a strong prior from existing literature), representing putative excitatory and inhibitory neurons. Our approach recapitulated many of the same observations in Peyrache et al. 2012, namely (1) identification of two clusters (low PHW/VP: inhibitory, high PHW/VP: excitatory), (2) an ~80/20 ratio of excitatory/inhibitory neurons, and (3) greater baseline firing rates in the inhibitory vs. excitatory neurons. However, we did not observe a preferential modulation of one cell type compared to another (see newly created Figure 4). A description of this analysis and its takeaways has been incorporated into the manuscript.

      Change to Text:

      Created Figure 4 (Separation of presumed excitatory and inhibitory neurons by waveform morphology).

      Caption: (A) Two metrics were calculated using the averaged waveforms for each detected unit: the valley-to-peak width (VP) and peak half-width (PHW). (B) Scatterplot of the relationship between VP and PHW; note that units with identical metrics are overlaid. Using k-means clustering, we identified two distinct response clusters, representing presumed excitatory (E, blue) and inhibitory (I, red) neurons. The units from which the example waveforms were taken are outlined in black. Probability distributions for each metric are shown along the axes. (C) Total number of units within each cluster, separated by region. (D) Comparison of baseline firing rates, separated by cluster. (E) Percent of modulated units in each cluster. * p < 0.05, NS = not significant.

      Added a description of clustering methodology to lines 132-137: “We calculated two metrics from the averaged waveform from each detected unit: the valley-to-peak-width (VP) and the peak half-width (PHW) (Figure 4A); previously, these two properties of waveform morphology have been used to discriminate pyramidal cells (excitatory) from interneurons (inhibitory) in human intracranial recordings (Peyrache et al., 2012). Next, we performed k-means clustering (n = 2 clusters) on the waveform metrics, in line with previous approaches to cell type classification.

      Added a section in the Results titled “Theta Burst Stimulation Modulates Excitatory and Inhibitory Neurons Equally”. Lines 370-378: “Using k-means clustering, we grouped neurons into two distinct clusters based on waveform morphology, representing neurons that were presumed to be excitatory (E) and inhibitory (I) (Figure 4B). Inhibitory (fast-spiking) neurons exhibited shorter waveform VP and PHW, compared with excitatory (regular-spiking) neurons (I cluster centroid: VP = 0.50ms, PHW = 0.51ms; E cluster centroid: VP = 0.32ms, PHW = 0.31ms), and greater baseline firing rates (U(N<sub>I</sub> = 23, N<<sub>E</sub> = 133) = 1074.50, p = 0.023) (Figure 4D). Although we observed a much greater proportion of excitatory vs. inhibitory neurons (E: 85.3%, I: 14.7%), stimulation appeared to affect excitatory and inhibitory neurons equally, suggesting that one cell type is not preferentially activated over another (Figure 4E).

      Modified discussion of the effects of stimulation on different cell types. Lines 475-483: “…To test these hypotheses directly, we clustered neurons into presumed excitatory and inhibitory neurons based on waveform morphology. In doing so, we observed ~85% excitatory and ~15% inhibitory neurons, which is very similar what has been reported previously in human intracranial recordings (Cowan et al. 2024, Peyrache et al., 2012). Interestingly, stimulation appeared to modulate approximately the same proportion of neurons for each cell type (~30%), despite the differently-sized groups. Recent reports, however, have suggested that the extent to which electrical fields entrain neuronal spiking, particularly with respect to phase-locking, may be specific to distinct classes of cells (Lee et al., 2024).”

      1b:  “Is the autocorrelation of spike timing, which can be informative about temporal dynamics, altered by stimulation? This is especially interesting if theta-burst stimulation either entrains theta-rhythmic spiking or is more modulatory of endogenously theta-modulated units.”

      The reviewer is correct in suggesting that rate-modulation represents only one of many possible ways by which exogenous theta burst stimulation may influence neuronal activity. Indeed, intracranial theta burst stimulation has previously been shown to evoke theta-frequency oscillatory responses in local field potentials (Solomon et al. 2021), and other forms of stimulation (i.e., transcranial alternating current stimulation) may modulate the rhythm, rather than the rate, of neuronal spiking (Krause et al. 2019).

      To investigate whether stimulation altered rhythmicity in neuronal firing, we contrasted the spike timing autocorrelograms, as suggested. More specifically, we computed the pairwise differences in spike timing for each trial, separating spikes into the same pre-, during-, and post-stimulation epochs described in the manuscript (bin size = 5 ms, max lag = 250 ms), grouped neurons by whether they were modulated, and then contrasted the differences in the latencies of the peak normalized autocorrelation value between epochs. Only neurons with a firing rate of ≥ 1 Hz (n = 70/203, 34.5%) were included in this analysis since sparse firing resulted in noisy autocorrelation estimates. Subsequent statistical testing of the peak latency differences between pre-/during- and pre-/post-stimulation did not reveal any group-level differences (Mann-Whitney U tests, p > 0.05). Thus, we were not able to identify neuronal responses suggestive of altered rhythmicity (see Figure S5). A description of this analysis and its takeaways has been incorporated into the manuscript.

      Of note, there are two elements of the data that constrain our ability to detect modulation in the rhythm of firing. First, the baseline activity recorded across neurons modulated by stimulation was relatively low (i.e., median firing rate = 1.77 Hz). Second, stimulation often resulted in a suppression, rather than an enhancement, of firing rate. Taken together, the sparse firing afforded limited opportunity to characterize changes to subtle patterns of spiking. 

      Change to Text:

      Created Figure S5 (Analysis of modulation in spiking rhythmicity)

      Caption: (A) Representative autocorrelograms ACG) for a single neuron. The pairwise differences in spike timing were computed for each trial and epoch (bin size = 5 ms, max lag = 250 ms), then smoothed with a Gaussian kernel. The peak in the normalized ACG across trials was computed for each epoch. (B) Kernel density estimate of the peak ACG lag, separated by epoch. (C) The peak ACG lags were split by whether the neuron was modulated (Mod) or unaffected by stimulation (NS = not significant) for each of the two contrasts: pre- vs. during-stim (left) and pre- vs. post-stim (right).

      Details about the autocorrelation methodology have been incorporated. Lines 166-172: “To investigate whether stimulation altered rhythmicity in neuronal firing, we analyzed the spike timing autocorrelograms. More specifically, we computed the pairwise differences in spike timing for each trial (bin size = 5 ms, max lag = 250 ms) and then contrasted the differences in the latencies of the peak normalized autocorrelation value between epochs (pre-, during-, post-stimulation). Only neurons with a firing rate of ≥ 1 Hz (n = 70/203, 34.5%) were included in this analysis since sparse firing resulted in noisy autocorrelation estimates.

      The results from contrasting the autocorrelograms are now mentioned briefly. Lines 297-298: “Stimulation, however, did not appear to alter the rhythmicity in neuronal firing, as measured by spiking autocorrelograms (Figure S5).”

      1c: “The authors reference the relevance of spike-field synchrony (30-55 Hz) in animal work, but ignore it here. Does spike-field synchrony (comparing the image presentation to post-stimulation) change in this frequency range? This does not seem beyond the scope of investigation here.”

      We agree that a further characterization of spike-field and spike-phase relationships may provide rich insights into more complex regional and interregional dynamics that may be altered by stimulation. Given that many metrics are biased by sample size (e.g., number of spikes), which can vary considerably, computing the pairwise phase consistency (PPC) between spikes and LFP is a preferred metric (Vinck et al. 2010). Although PPC is unbiased, its variance nonetheless increases considerably with low spike counts; pooling spike counts across trials, however, decouples the temporal relationship between spiking and the LFP phase for each trial, confounding results and yielding an unstable estimate.

      To determine whether such an analysis is indeed possible, we calculated the percentage of stimulation trials with ≥ 10 spikes in both the 1s pre- and post-stimulation epochs (a relatively low threshold for inclusion). Only a very small proportion of the total number of trials across all neurons met this criterion (2.5%). Thus, because of the sparse spiking in our data, we are unable to reliably characterize spike-field or spike-phase modulation in detected neurons.

      Change to Text:

      In the manuscript, we have added a description of why our data is not well-suited to investigate these relationships.

      Lines 532-538: “The present study did not investigate interactions between spiking activity and local field potentials because neuronal spiking was sparse at baseline and often further suppressed by stimulation; only a very small proportion of the total number of trials across all neurons exhibited ≥ 10 spikes in both the 1s pre- and post-stimulation epochs (~2.5%). Although certain metrics are not biased by sample size (e.g., pairwise phase consistency), low spike counts can dramatically affect variance and, therefore, result in unstable estimates (Vinck et al., 2011).

      1d: “How does multi-unit activity respond to stimulation? At this somewhat low count of neurons (total n=156 included) it would be valuable to provide input on multi-unit responses to stimulation as well.”

      We thank the reviewer for this suggestion. We have incorporated an analysis of multiunit activity (MUA), which similarly identifies robust modulation via permutation-based statistical testing and characterizes the different profiles of responses (i.e., increased vs. decreased MUA threshold crossings pre- vs. post-stimulation).

      Change to Text:

      Created Figure S8 (Analysis of multiunit activity response to stimulation)

      Caption: (A) Example trace of multiunit activity (MUA) in one channel during a single stimulation trial. Threshold crossings are highlighted with a pink dot overlaid on the MUA signal with a corresponding hash below. (B) The percentage of channels with significantly modulated MUA, separated by the direction of effect. (C) The percentage of channels with significantly modulated MUA, separated by direction effect and region. Inc (red; post > pre) vs. Dec (blue; post < pre). HIP = hippocampus, OFC = orbitofrontal cortex, AMY = amygdala, ACC = anterior cingulate cortex. *** p < 0.001, NS = not significant.

      Details about the MUA methodology have been incorporated. Lines 174-180: “Finally, we measured modulation in multiunit activity (MUA) by filtering the microleectrode signals in a 300-3,000 Hz window and counting the number of threshold crossings. Thresholds were determined on a per-channel basis and defined as -3.5 times the root mean square of the signal during the baseline period; activity during stimulation was excluded since stimulation artifact is difficult to separate from MUA in the absence of spike sorting.

      MUA results are now incorporated. Lines 365-367: “Additional characterization of MUA revealed a dominant signature of increased activity post- vs. pre-stimulation, in line with these trends observed at the single-neuron level (Figure S8).”

      1e: “Several intracranial studies have implicated proximity to white matter in determining the effects of stimulation on LFPs; do the authors see an effect of white matter proximity here?”

      We thank the reviewer for the interesting question. Subsequent characterization revealed only small differences in the proximity of stimulation contacts to white matter (range 1.5-8.0 mm), likely because the chosen target (i.e., basolateral amygdala) has several nearby white matter structures (e.g., stria terminalis). Nonetheless, we performed a linear regression between the proximity to white matter and the stimulation-induced effect on behavior (stimulation vs. no-stimulation d’ difference), the results of which indicate no clear association (p > 0.05; see Figure S9). Critically, this is not to suggest that white matter proximity has no interaction with the reported behavioral effects, but rather, that we could not identify such an association within our data.

      Change to Text:

      Created Figure S9 (The effect of stimulation proximity to white matter and distance to recorded neurons).

      Caption: (A) Kernel density estimate of the Euclidean distance from stimulation contacts to nearest WM structure (in mm); hash marks represent individual observations. (B) The change in memory performance (Δd’) was linearly regressed onto the distance from the stimulated contacts to white matter.

      The following has been added to lines 405-426: “Proximity to white matter has been shown to influence the effects of stimulation on behavior and the strength of evoked responses (Mankin et al., 2021; Mohan et al., 2020; Paulk et al., 2022). Across all stimulated contacts, we observed only small differences in the proximity of stimulation contacts to white matter (median = 4.5 mm, range = 1.5-8.0 mm), likely because the chosen target (i.e., basolateral amygdala) has several nearby white matter structures (e.g., stria terminalis). Nonetheless, we performed a linear regression between the proximity to white matter and the stimulation-induced effect on behavior (stimulation vs. no-stimulation d’ difference), the results of which indicate no clear association (p > 0.05; see Figure S9).

      Comment 2: “It is a little confusing to interpret stimulation-induced modulation of neuronal spiking in the absence of stimulation-induced change in behavior. How do the authors findings tell us anything about the neural mechanisms of stimulation-modulated memory if memory isn't altered? In line with point #1, I would suggest a deeper dive into behavior (e.g. reaction time? Or focus on individual sessions that do change in Figure 4A?) to make a stronger statement connecting the neural results to behavioral relevance.”

      We agree that the connection between the observed stimulation-induced neuronal modulation and effects on behavior is unclear and has proven challenging to elucidate. Per the reviewer’s suggestion, we further focused our analyses on the neuronal modulation effects in the individual sessions that resulted in a robust change in memory performance (stimulation vs. no-stimulation d’ difference threshold of ± 0.5, based on a moderate effect size for Cohen’s d); both a positive and negative threshold were used to capture robust changes in memory performance associated with firing rate modulation, whether enhancement or suppression. To this end, we contrasted the proportion of modulated neurons in the sessions where stimulation resulted in a robust behavioral change (Δd’) with those that did not (~d’). We did not observe a difference in the proportions between groups when collapsed across all sampled regions, or when separately evaluated (Fisher’s exact tests, p > 0.05; see Figure 5C).

      Given that this approach did not further clarify the connection between our neural and behavioral results, we believe it is most appropriate to deemphasize claims in the manuscript regarding the potential insights for behavioral modulation (e.g., memory enhancement), and have done so.

      Change to Text:

      Toned down reference to the memory-related effects of stimulation in the abstract by removing the following lines from the abstract: “Previously, we demonstrated that intracranial theta burst stimulation (TBS) of the basolateral amygdala (BLA) can enhance declarative memory, likely by modulating hippocampal-dependent memory consolidation…” and “…and motivate future neuromodulatory therapies that aim to recapitulate specific patterns of activity implicated in cognition and memory.”

      Changed Figure 4 to Figure 5

      Created Figure 5C (Interaction between behavioral effects and neuronal modulation)(C)  Change in recognition memory performance was split into two categories using a d’ difference threshold of ± 0.5: responder (positive or negative; Δd’, pink) and non-responder (~d’, grey). Individual d’ scores are shown (left) with points colored by outcome category; dotted lines demarcate category boundaries, and the grey-shaded region represents negligible change. The number of sessions within each outcome category (middle) and the proportion of modulated units as a function of outcome category, separated by region (right). NS = not significant.

      The description of the behavioral results has been updated. Lines 394-403: “At the level of individual sessions, we observed enhanced memory (Δd’ > +0.5) in 36.7%, impaired memory (Δd’ < -0.5) in 20.0%, and negligible change (-0.5 ≤ Δd’ ≤ 0.5) in 43.3% when comparing performance between the stim and no-stim conditions; a threshold of Δd’ ± 0.5 was chosen for this classification based on the defined range of a “medium effect” for Cohen’s d. To test our hypothesis that neuronal modulation would be associated with changes in memory performance, we combined the sessions that resulted in either memory enhancement or impairment and contrasted the proportion of modulated units across regions sampled. We did not, however, observe a meaningful difference in the proportion of modulated units when grouped by behavioral outcome (all contrasts p > 0.05) (Figure 5C).

      Lines 213-214 and 394-397 have been edited to reflect a change in the d’ threshold used for categorizing behavioral results (from Δd’ ± 0.2 to Δd’ ± 0.5).

      Comment 3: “It is not clear to me why the assessment of firing rates after image onset and after stim offset is limited to one second - this choice should be more theoretically justified, particularly for regions that spike as sparsely as these.”

      We thank the reviewer for this question and acknowledge that no clear justification was provided for this decision in the manuscript. Our decision to limit each of the analysis epochs to 1s was chosen for two reasons. First, the maximum possible length of the during-stimulation epoch was 1 s (stim on for 1 s). Although the pre- and post-stimulation epochs could be extended without issue, we were concerned that variable time windows could introduce a bias, for instance, resulting in different variances between epochs. Second, we anticipated, both from empirical observations and prior literature, that the neural response following stimulation or task features (e.g., image onset/offset) was likely to be transient, rather than sustained for a period of many seconds. By keeping the windows short, we ensured that our approach to detecting modulation (i.e., contrasting trial-wise spike counts between each pair of epochs) captured the intended effect rather than random noise. We have incorporated a discussion of this rationale in the Peri-Stimulation Modulation Analyses section.

      Change to Text:

      Lines 156-158 have been added: “Each epoch was constrained to 1 s to ensure that subsequent firing rate contrasts were unbiased and to capture potential transient effects (e.g., image onset/offset).”

      Comment 4: “This work coincides with another example of human intracranial stimulation investigating the effect on firing rates (doi: https://doi.org/10.1101/2024.11.28.625915). Given how incredibly rare this type of work is, I think the authors should discuss how their work converges with this work (or doesn't).”

      Thank you for bringing this highly relevant work to our attention. We were unaware of this recent preprint and have incorporated a discussion of its main findings into the manuscript.

      Change to Text:

      New citations: van der Plas et al. 2024 (bioRxiv), Cowan et al. 2024 (bioRxiv)

      The discussion of related studies has been updated. Lines 447-457: “Few studies, however, have characterized the impact of electrical stimulation via macroelectrodes on the spiking activity of human cortical neurons, none of which involve intracranial theta burst stimulation. One study reported a long-lasting reduction in neural excitability among parietal neurons, with variable onset time and recovery following continuous transcranial TBS in non-human primates (Romero et al., 2022). In a similar vein, it was recently shown that human neurons are largely suppressed by single-pulse electrical stimulation (Cowan et al., 2024; Plas et al., 2024). Other emerging evidence suggests that transcranial direct current stimulation may entrain the rhythm rather than rate of neuronal spiking (Krause et al., 2019) and that stimulation-evoked modulation of spiking may meaningfully impact behavioral performance on cognitive tasks (Fehring et al., 2024).”

      Comment 5: “What information does the pseudo-population analysis add? It's not totally clear to me.”

      We recognize the need to further contextualize the motivation for the exploratory pseudo-population analysis and appreciate the reviewer for bringing the lack of detail to our attention. In brief, the analysis allowed us to observe trends in activity across populations of neurons, which, in principle, are not visible by characterizing modulation solely in discrete neurons. Additional details have been incorporated into the manuscript, as suggested.

      Change to Text:

      Additional justification has been incorporated in the description of the methodology. Lines 185-187: “…This approach enables the identification of dominant patterns of coordinated neural activity that may not be apparent when examining individual neurons in isolation.”, lines 192-194: “…By collapsing across subjects into a common pseudo-population, this analysis provides a mesoscale view of how stimulation modulates shared activity patterns across anatomically distributed neural populations.”

      A summary interpretation has been added to the paragraph describing the results. Lines 326-328: “Taken together, these analyses reveal global structure in the state space of responses to BLA stimulation within hippocampal circuits.”

      Reviewer #2 (Public review):

      Comment 1 “Authors suggest that the units modulated by stimulation are largely distinct from those responsive to image offset during trials without stimulation. The subpopulation that responds strongly also tends to have a higher baseline of firing rate. It's important to add that the chosen modulation index is more likely to be significant in neurons with higher firing rates.”

      This is an important point that was not previously addressed in our manuscript. We suspect there are likely two factors at play worth considering with respect to our chosen nonparametric modulation index: neurons with lower activity require smaller changes in spike counts to be significantly modulated (easier to flip ranks), and neurons with higher activity empirically exhibit greater absolute shifts in the number of spikes. Our further use of permutation testing, while mitigating false positives, may also somewhat constrain the ability to detect modulation in sparsely active neurons. Nonetheless, given that many trials entailed few or no spikes, we believe this approach is preferable to alternatives that may be more susceptible to noise (e.g., percent change in trial-averaged firing rate from baseline).

      To better understand the tradeoffs with detection probability, we performed a sensitivity analysis. We generated synthetic data with different baseline firing rates (0.1-5.0 Hz) and effect sizes (± 0.1-0.7 Hz) and simulated the likelihood of detection with our given modulation index across neurons. The results of the simulation support the notion that the probability of detecting modulation is lower for sparsely active neurons (Figure S8C). Further discussion of this consideration for the chosen modulation index, as well as details regarding the sensitivity analysis, have been incorporated into the manuscript.

      Change to Text:

      Created Figure S7C (Detection probability analysis)

      Caption: The same permutation-based analyses reported in the manuscript were repeated under different control conditions… (C) Visualization of the predicted probability of detecting modulation across synthetic neurons with variable firing rates and modulation effect sizes; FR = firing rate.

      Lines 223-224 have been added to the Methods section titled “Firing Rate Control Analyses”: “We performed a series of control analyses to test whether our approach to firing rate detection was robust…”

      A description of the simulation has been incorporated into the same section as above. Lines 234-237: “Finally, to better understand the tradeoffs with our statistical approach, we generated synthetic data with different baseline firing rates (0.1-5.0 Hz) and effect sizes (± 0.1-0.7 Hz), then simulated the likelihood of detecting modulation across variable conditions (Figure S7C).”

      The description of the results from the control analyses has been updated. Lines 330-339: “Finally, we performed three supplementary analyses to evaluate the robustness of our approach to detecting firing rate modulation: a sensitivity analysis assessing the proportion of modulated units at different firing rate thresholds for inclusion/exclusion, a data dropout analysis designed to control for the possibility that non-physiological stimulation artifacts may preclude the detection of temporally adjacent spiking, and a synthetic detection probability analysis. These results recapitulate our observation that units with higher baseline firing are most likely to exhibit modulation (though the probability of detecting modulation is lower for sparsely active neurons) and suggest that suppression in firing rate is not solely attributable to amplifier saturation following stimulation (Figure S7).

      Comment 2: “Readers can benefit from understanding with more details the locations chosen for stimulation - in light of previous studies that found differences between effects based on proximity to white matter (For example - PMID 32446925, Mohan et al, Brain Stimul. 2020 and PMID 33279717 Mankin et al Brain Stimul. 2021).”

      This has been addressed in the above response to Reviewer’s 1 comment 1.1e.

      Change to Text:

      See changes related to Reviewer 1 comment 1.1e.

      Comment 3: “Missing information in the manuscript…”

      3a: “Images of stimulation anatomical locations for all subjects included in this study. Ideally information about the impedance of the contacts to be able to calculate the actual current used.”

      As requested, we have provided an image from the coronal T1 MRI sequence, which highlights the position of the stimulated contacts for each of the 16 patients. Though we did not measure the impedances directly, the stimulation was current-controlled, which ensured that the desired current and charge density were consistent regardless of the tissue or electrode impedance.

      Change to Text:

      Created Figure S1 (Anatomical location of stimulated electrodes).

      Caption: A coronal slice from the T1-weighted MRI scan is shown for each patient who participated in the study (n = 16). Electrode contacts within the same plane of the image are shown with blue circles, and the bipolar pair of stimulated contacts within the basolateral amygdala is highlighted in red.

      Lines 144-145 have been edited to reflect that the delivered stimulation was current-controlled: “Specifically, we administered current-controlled, charge-balanced, …”

      3b: “The studied population is epilepsy patients, and the manuscript lacks description of their condition, proximity to electrodes included in the study to pathological areas, and the number of units from each patient/hemisphere.”

      We agree that additional information regarding patient demographics, experimental details, and clinical characteristics would further contextualize this unique patient population. A new table has been included, which contains the following information: patient ID, sex, age, # experimental session, # SEEG leads (and # microelectrodes), # detected units (L vs. R hemisphere), and suspected seizure onset zone.

      Change to Text:

      Created Table S1 (Patient demographics and clinical characteristics).

      Lines 258-259 have been added: “…(see Table S1 for patient demographics).”

      3c: “I haven't seen any comments on code availability (calculating modulation indices and statistics) and data sharing.”

      For clarification, a section titled Resource Availability is already appended to the end of the manuscript following the Conclusion, which describes the data and code availability.

      Change to Text:

      None

      3d: “Small comment - Figure legend 3E - Define gray markers (non-modulated units?)”

      Thank you for highlighting this omission. We have updated the relevant figure caption.

      Change to Text:

      The following has been added to the Figure 3 caption: “…whereas units without a significant change in activity are shown in grey.”

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Amaral et al. presents a study investigating the mesoscale modelling and dynamics of bolalipids.

      Strengths:

      The figures in this paper are exceptional. Both those to outline and introduce the lipid types, but also the quality and resolution of the plots. The data held within also appears to be outstanding and of significant (hopefully) general interest.

      We thank the reviewer for their kind words and the appreciation of our work.

      Weaknesses:

      In the introduction, I would like to have read more specifics on the biological role of bolalipids. Archaea are mentioned, but this kingdom is huge - there must be specific species that can be discussed where bolalipids are integral to archaeal life. The authors should go beyond ’extremophiles’. In short, they should unpack why the general audience should be interested in these lipids, within a subset of organisms that are often forgotten about.

      Following the reviewer’s advice we have revised the introduction of the manuscript, in which we now discuss specific species (Sulfolobus acidocaldarius and Thermococcus kodakarensis) and how in these species bolalipids are integral to archaeal life. We explain that the ratio between bilayer and bolalipids, and the number of cyclopentane rings contained within bolalipids can change to adapt to the environment. The revised parts of the introduction read (p.1 ):

      “Like for bacteria and eukaryotes, archaea must keep their lipid membranes in a fluid state (homeoviscous adaptation). This is important even under extreme environmental conditions, such as hot and cold temperatures, or high and low pH values [7]. Because of this, many archaea adapt to changes in their environment by tuning the lipid composition of their membranes: altering the ratio between bola- and bilayer lipids in their membranes [8, 9] and/or by changing the number of cyclopentane rings in their lipid tails, which are believed to make lipid molecules more rigid [5]. For example, Thermococcus kodakarensis increases its tetraether bolalipid ratio from around 50% to over 80% when the temperature of the environment increases from 60 to 85 C [10]. Along the same lines, the cell membrane of Sulfolobus acidocaldarius, can contain over 90 % of bolalipids with up to 8 cyclopentane rings at 70 C and pH 2.5 [5, 11]. It is worth mentioning that in exceptional cases bacteria also synthesise bolalipids in response to high temperatures [12], highlighting that the study of bolalipid membranes is relevant not only for archaeal biology but also from a general membrane biophysics perspective.”

      Reviewer #2 (Public review):

      Summary:

      The authors aimed to understand the biophysical properties of archeal membranes made of bolalipids. Bacterial and eukaryotic membranes are made of lipids that self-assemble into bilayers. Archea, instead, use bolalipids, lipids that have two headgroups and can span the entire bilayer. The authors wanted to determine if the unique characteristics of archaea, which are often extremophiles, are in part due to the fact that their membranes contain bolalipids.

      The authors develop a minimal computational model to compare the biophysics of bilayers made of lipids, bolalipids, and mixtures of the two. Their model enables them to determine essential parameters such as bilayer phase diagrams, mechanical moduli, and the bilayer behaviour upon cargo inclusion and remodelling.

      The author demonstrates that bolalipid bilayers behave as binary mixtures, containing bolalipids organized either in a straight conformation, spanning the entire bilayer, or in a u-shaped one, confined to a single leaflet. This dynamic mixture allows bolalipid bilayers to be very sturdy but also provides remodelling. However, remodelling is energetically more expensive than with standard lipids. The authors speculate that this might be why lipids were more abundant in the evolutionary process. Strengths:

      This is a wonderful paper, a very fine piece of scholarship. It is interesting from the point of view of biology, biophysics, and material science. The authors mastered the modelling and analysis of these complex systems. The evidence for their findings is really strong and complete. The paper is written superbly, the language is precise and the reading experience is very pleasant. The plots are very well-thought-out.

      Weaknesses:

      I would not talk about weaknesses, because this is really a nice paper. If I really had to find one, I would have liked to see some clear predictions of the model expressed in such a way that experimentalists could design validation experiments.

      We thank the reviewer for their very kind assessment. We incorporated their recommendations regarding experimental validation in the discussion section, as follows (p.14):

      “Our model makes a number of predictions that could be tested by experiment either in cells or in vitro. First, it predicts that a small increase in the fraction of archaeal bilayer lipids should be sufficient to soften a bolalipid-rich membrane. While this could be tested in the future, so far only very few studies have yet reported experimental analysis of archaeal membrane mixtures [18, 50]. Second, we observed that membranes with moderate bolalipid molecular rigidity k<sub>bola</sub> exhibit curvature-dependent bending rigidity. To experimentally verify this, one could extrude membrane tethers from cells while controlling for membrane tension. Finally, to get to the core mechanism underlying our findings, it will be important to develop experimental methods that will allow the fraction of U-shaped bolalipid conformers per leaflet to be imaged and measured.”

      Reviewer #3 (Public review):

      Summary:

      The authors have studied the mechanics of bolalipid and archaeal mixed-lipid membranes via comprehensive molecular dynamics simulations. The Cooke-Deserno 3-bead-per-lipid model is extended to bolalipids with 6 beads. Phase diagrams, bending rigidity, mechanical stability of curved membranes, and cargo uptake are studied. Effects such as the formation of U-shaped bolalipids, pore formation in highly curved regions, and changes in membrane rigidity are studied and discussed. The main aim has been to show how the mixture of bolalipids and regular bilayer lipids in archaeal membrane models enhances the fluidity and stability of these membranes.

      Strengths:

      The authors have presented a wide range of simulation results for different membrane conditions and conformations. For the most part, the analyses and their results are presented clearly and concisely. Figures, supplementary information, and movies very well present what has been studied. The manuscript is well-written and is easy to follow.

      We thank the reviewer for the detailed assessment of our work and their constructive feedback.

      Major issues

      R3.Q1: The Cooke-Deserno model, while very powerful for biophysical analysis of membranes at the mesoscale, is very much void of chemical information. It is parametrized such that it is good in producing fluid membranes and predicting values for bending rigidity, compressibility, and even thermalexpansioncoefficientfallingintheacceptedrangeofvaluesforbilayermembranes. But it still represents a generic membrane. Now, the authors have suggested a similar model for the archaeal bolalipids, which have chemically different lipids (the presence of cyclopentane rings for one), and there is no good justification for using the same pairwise interactions between their representative beads in the coarse-grained model. This does not necessarily diminish the worth of all the authors’ analyses. What is at risk here is the confusion between ”what we observe this model of bolalipidor mixed-membranes do” and ”how real bolalipid-containing archaeal membranes behave at these mechanical and thermal conditions.”.

      As the reviewer correctly notes, Cooke and Deserno used a minimal model, devoid of chemical detail, to represent fluid lipid membranes composed of bilayer lipids. Indeed archaeal lipids are chemically different compared to non-archaeal lipids, but just like non-archaeal lipids, they can be very different from one another. Given the chemical diversity of bolalipids between each other, instead of representing their complexity in a complicated model with many experimentally unconstrained parameters, we here defined a minimal model for bolalipids. The power of this minimal model is to represent the key physical/geometrical characteristics of archaeal membranes, namely the fact that lipid heads on two sides of the membrane are often connected, that bolalipids can exhibit a conformational change, and that bolalipids mix with some percentage of bilayer molecules. We then ask a general question: how do these unique geometrical characteristics of archaeal membranes influence their mechanics and reshaping? The reviewer is however right in pointing out that a model, regardless of its level of details (atomistic, coarse-grained, minimal), is still a model.

      Our approach of extending an established coarse-grained model for bilayer lipids to bolalipids is further supported by experimental observations, which report that archaeal bilayer lipids can form membranes of comparable bending rigidity to those of non-archaeal bilayer membranes [53]. Hence, different lipid linkages (archaeal vs. non-archaeal) give rise to fluid, deformable membranes of not too dissimilar rigidities, suggesting that both archaeal and non-archaeal bilayer lipids can be represented by a similar minimal coarse-grained model for the purpose of mesoscopic biophysical investigations. Since archaeal bolalipids have the same core chemical structure as two archaeal bilayer lipids joined by their tail ends, similarly we model a bolalipid by joining two bilayer lipids. Such an approach also efficiently enables us to compare bolalipid with bilayer membranes, and connect to the large body of knowledge on the physics of bilayer membranes.

      To conclude, our coarse-grained model is indeed intended to capture the main physical properties of bolalipid membranes, and not their chemical diversity.

      R3.Q2: Another more specific, major issue has to do with using the Hamm-Kozlov model for fitting the power spectrum of thermal undulations. The 1/q<sup>2</sup> term can very well be attributed to membrane tension. While a barostat is indeed used, have the authors made absolutely sure that the deviation from 1/q<sup>4</sup> behaviour does not correspond to lateral tension?

      To the casual observer, any 1/q<sup>2</sup> trend might point at membrane tension. However, the precise functional form is relevant as it determines whether the 1/q<sup>2</sup> dominates the 1/q<sup>4</sup> trend for small or large values of the wave number q in the fitted power spectrum.

      The first model (including lipid tilt) exhibits the functional form 1/(kq<sup>4</sup>) + 1/(kq<sup>2</sup>). In contrast, the second model (including membrane tension) exhibits the functional form 1/(kq<sup>4</sup> + ∑q<sup>2</sup>). Importantly, the two models obey a different functional form. Here k and k<sub>θ</sub>, are the bending and tilt moduli, which are assumed positive, and ∑ is the membrane tension, which can be either positive or negative. For the first model (with tilt), while for small q the amplitude is proportional to q<sup>-4</sup>, for large q the amplitude is proportional to q<sup>-2</sup>. In contrast, for the second model (with positive tension) while for small q the amplitude is proportional to q<sup>-2</sup>, for large q the amplitude is proportional to q<sup>-4</sup>. If membrane tension were to be negative in the second model, the slope would cross from negative infinity for small q to -4 for large q. The functional dependencies are summarized in Author response image 1A.

      For rigid bolalipid membranes, it is clearly visible that the slope of the power spectrum plotted against the wave number q decreases with increasing q (Author response image 1B). While the slope initially assumes a value close to 4, it gradually approaches 2 for larger values of q. We conclude that only the model including lipid tilt can fit the power spectrum of membrane fluctuations appropriately (solid-dashed line), whereas the model with tension fails to fit the data (dashed line). We note that the combined model containing both lipid tilt and membrane tension does not give a better fit (dotted line).

      To demonstrate that the tension model cannot fit the data, we included the best fits for both models for rigid bolalipid membranes in the new SI section 16 (p. S22) and show that only the tilt model leads to acceptable fits. We also measured the projected membrane tension - , where P<sub>x</sub>,P<sub>y</sub> are respectively the pressure in x and y direction and  L<sub>z</sub> is the dimension of the simulation box in z axis. We found the projected membrane tension to give a negligible value similarly to the one that we indirectly measured by fitting a combined model with both tension and tilt, further confirming our conjecture.

      Author response image 1.

      (A) Schematic showing the decay of the power spectrum as a function of the wave number q in the tilt model (top), in the tension model with positive membrane tension (middle), and in the tension model with negative membrane tension (bottom). (B) Fitted power spectrum as a function of q for rigid bolalipid membranes (k<sub>bola</sub>=5k<sub>B</sub>T). The fit shows that while the model with tension (dashed line) cannot fit the data, the model with tilt nicely fits the spectrum (solid-dashed line). The combined model including both tension and tilt does not fit the spectrum any better (dotted line).

      R3.Q3: I got more worried when I noticed in the SI that the simulations had been done with combined ”fix langevin” and ”fix nph” LAMMPS commands. This combination does not result in a proper isothermal-isobaric ensemble. The importance of tilt terms for bolalipids is indeed very interesting, but I believe more care is needed to establish that.

      In what follows, we show that there is no reason to worry. First of all we want to clarify that the physical setup we simulate is that of a membrane contained in a heat bath under negligible tension with correct diffusional dynamics. To achieve this physical setup, for which we use a Langevin thermostat combined with pressure control via an overdamped barostat, which we implement in LAMMPS by combining ”fix langevin” and ”fix nph”.

      In more detail: we simulated particles in an implicit solvent, for which we use a Langevin thermostat to get the right diffusional dynamics. To apply the theory of fitting fluctuation spectrums the simulation box length needs to be (near) constant. However, simulating membranes at a fixed box size results in an average non-zero membrane tension, making it hard to measure bending rigidity. The reason is that the effect of membrane tension is most influential on the largest wavelength modes, which are also most decisive when determining mechanical membrane properties like membrane rigidity. To minimize the effect of tension, we perform our simulation with an overdamped barostat (𝜏<sub>baro</sub> = 10 𝜏 <sub>langevin</sub>), which keeps the membrane near tensionless, as also done before [32]. In the revised manuscript, we have clarified the statement on the physical ensemble used (p.S2):

      “For simulating flat membrane patches of bolalipids, we combined the previously used Langevin thermostat with relaxation time of 1𝜏 with a Nosé–Hoover barostat with relaxation time of 10𝜏. In LAMMPS this amounts to combining the commands ’fix langevin’ with ’fix nph’. We configured the barostat to set lateral pressure P<sub>xy</sub> to zero by re-scaling the simulation box in the x-y plane. We compare this setup to a fixed box length setup, and an NPT ensemble setup, in SI section 17.”

      To connect our results with statistical mechanics ensemble theory we tested alternative setups. Similar setups, including the formal isothermal-isobaric ensemble, where N,P,T are kept constant using Nose-Hoover style equations for thermostating and barostating with modern corrections [34], which the reviewer refers to, result in very similar fluctuation spectrums. Consequently, our measurements of bending and tilt modulus hold true regardless of the integration scheme. However, such a setup does not correctly capture implicit solvent and diffusional dynamics.

      In even more detail: we tested our setup (implemented via ”fix langevin”+”fix nph”) versus a isothermal-isobaric ensemble (implemented via ”fix npt”). We measured volume mean and standard deviation, and found them matching for a reference LJ gas.

      To be completely sure, and to please the reviewer, we have performed additional verifications in the new SI section 17, which we summarize in the following. We simulated three representative membranes with different integration schemes: ”fix npt”, ”fix langevin”+”fix nph”, and ”fix langevin” (Langevin dynamics with projected area fixed at the average value obtained from a ”langevin+nph”). We checked that the ”fix nph” barostat is merely equilibrating the membrane to a tensionless configuration, after which the projected membrane area (A<sub>p</sub> = L<sub>x</sub>L<sub<y</sub>) is practically constant. Consequently, the different schemes resulted in minor changes in the longest wavelength modes that we tracked down to small changes in the negligible tension. The resulting measurements of bending modulus change by less than 10%, and our main text conclusions do not change. Author response image 2 compares the fluctuation spectrums for the different integration schemes.

      Author response image 2.

      Height fluctuation spectrum, for a bilayer membrane at T<sub>eff</sub> =1.1, simulated with Langevin dynamics (pink, ‘langevin‘), our setup (purple, ‘nph+langevin‘), and under an isothermal-isobaric ensemble (blue, ‘npt‘); fits are shown as dotted lines.

      R3.Q4: This issue is reinforced when considering Figure 3B. These results suggest that increasing the fraction of regular lipids increases the tilt modulus, with the maximum value achieved for a normal Cooke-Deserno bilayer void of bolalipids. But this is contradictory. For these bilayers, we don’t need the tilt modulus in the first place.

      We understand the concern why this might be counter-intuitive, and we thank the reviewer for pointing it out. We first want to stress that the tilt modulus can also be measured for bilayer membranes even if it is not needed to fit the fluctuation spectrum. If we measure the tilt modulus for a bilayer membrane, we obtain a value similar to the previously measured one [36]. Importantly, here we also report measurements for the tilt modulus for bolalipid membranes.

      To understand the seemingly contradictory behaviour of the tilt modulus, it is insightful to rewrite the expression for the fluctuation spectrum as done in Eq. (1):

      where is a characteristic length scale related to tilt, which we call the tilt persistence length. From the last equation it is easy to see that the tilt modulus 𝜅<sub>𝜃</sub> becomes relevant for the fluctuation spectrum if the tilt persistence length l<sub>𝜃</sub>  is not negligible. In other words, this means that we have to consider the tilt modulus 𝜅<sub>𝜃</sub> as relevant, if it is sufficiently small compared to the bending rigidity 𝜅.

      However, this is not only counter-intuitive, but also difficult to communicate graphically. Per the excellent reviewer’s suggestion, to make the interpretation more accessible, we converted in the main text and its figures the tilt modulus to the more directly interpretable tilt persistence length l<sub>𝜃</sub>, as this is small when tilt is irrelevant (for bilayer lipids and flexible bolalipids) and large otherwise (for rigid bolalipids). This includes changes to the main text on p.6 and p.8 , and to the insets in Figs. 2C and 3B. We note that for completeness we also report the tilt modulus 𝜅<sub>𝜃</sub>  in the SI.

      R3.Q5: Also, from the SI, I gathered that the authors have neglected the longest wavelength mode because it is not equilibrated. If this is indeed the case, it is a dangerous thing to do, because with a small membrane patch, this mode can very well change the general trend of the power spectrum. As a lot of other analyses in the manuscript rely on these measurements, I believe more elaboration is in order.

      We thank the reviewer for the careful examination of our supplementary material. For each fluctuation spectrum measurement, we ran multiple replicas. We observed that the largest wavelength modes were not fully equilibrated. In the simulations the first mode of the fluctuation spectrum is probed at different amplitudes and phases. We thus expected the potential systematic error would show up clearly when comparing spectrums of the different replicas. As we saw no correlation in these systematic offsets between replicas, we concluded that the simulations are sufficiently equilibrated and we could safely exclude the first mode of the fluctuation spectrum from our analysis.

      To show without doubt that this procedure does not randomly bias our results, we also ran simulations for three representative membranes until all modes were equilibrated. On the modes previously equilibrated, the resulting spectrums agree with our previous shorter simulations. On the largest wavelength modes that were previously not fully equilibrated, we noticed a small deviation from theory, specifically for flexible membranes (small bending modulus). These small deviations can be explained by including a negligible negative tension. Importantly, however, the resulting bending modulus σ stays nearly the same. We note that the small negative tension disappears when we halve the timestep (see Author response image 3). This verification is shown in SI section 17.

      R3.Q6: The authors have found that ”there is a strong dependency of the bending rigidity on the membrane mean curvature of stiffer bolalipids.” The effect is negative, with the membrane becoming less stiff at higher mean curvatures. Why is that? I would assume that with more flexible bolalipids, the possibility of reorganization into U-shaped chains should affect the bending rigidity more (as Figure 2E suggests). While for a stiff bolalipid, not much would change if you increase the mean curvature. This should be either a tilt effect, or have to do with asymmetry between the leaflets. But on the other hand, the tilt modulus is shown to decrease with increasing bolalipid rigidity. The authors get back to this issue only on page 10, when they consider U-shaped lipids in the inner and outer leaflets and write, ”this suggested that an additional membrane-curving mechanism must be involved.” But then again, in the Discussion, the authors write, ”It is striking that membranes made from stiffer bolalipids showed a curvature-dependent bending modulus, which is a clear signature that bolalipid membranes exhibit plastic behaviour during membrane reshaping,” adding to the confusion.

      Author response image 3.

      Height fluctuation spectrum, for a bilayer membrane at T<sub>eff</sub> =1.1, as simulated in the main text (grey, for 60⇥10<sup>3</sup>τ), for longer duration (1_.44⇥10<sup>6</sup>τ) (pink), and with the longer duration and halved timestep =0.005_τ(purple); fits are shown as dotted lines (tension and tilt) or dash-dot lines (tilt only).

      We thank the reviewer for asking this important question. Membrane bending rigidity in bolalipid membranes decreases dramatically once a small fraction of U-shapes is allowed to form, but then plateaus once this U-shape fraction reaches 20%. In a curved bolalipid membrane, U-shapes must accumulate in the outer leaflet to accommodate for area difference. Together, the bending rigidity non-linear dependence on U-shape fraction, and the promotion of U-shapes by curvature, explain why in a membrane made of moderately stiff bolalipids (k<sub>bola</sub> = 1k<sub>B</sub>T), which contain very few U-shapes in the flatstate, the bending rigidity of the membrane decreases as curvature increases. While in a membrane made of flexible bolalipid molecules (k<sub>bola</sub> = 0), where many U-shapes are present in the flat membrane, the bending rigidity does not change with curvature.

      Bending rigidity 𝜅 in flat membranes composed of bolalipids decreases dramatically once a small fraction of U-shapes is allowed to form, but plateaus once more than 20% of U-shaped bolalipids are present. In details, our data shows that with an increasing bolalipid molecular rigidity k<sub>bola</sub>, both the number of U-shaped bolalipids decreases (Fig. 2B) and the membrane rigidity 𝜅 increases (Fig. 2C). Thus, the correlation suggests that U-shaped bolalipids soften the membrane, in a non-linear way where most of the change in membrane bending rigidity happens for U-shaped bolalipid fraction < 20% (Figure S11).

      Separately, membrane curvature affects the area difference between curved membrane leaflets and thus drives U-shape accumulation. To be specific, a cylindrical membrane with area A, mean curvature H and thickness h has the outer leaflet with area A(1 + Hh) and the inner leaflet with smaller area A(1 Hh). This can be large, in our simulations up to an area change of Hh \= 25%. For pure bolalipid membranes, straight bolalipids occupy the same space in each leaflet. Area difference can then be achieved only by having a different amount of U-shaped bolalipids in each leaflet, which can result in a different U-shape fraction between leaflets and thus ’asymmetry between leaflets’. Figure S10 confirms U-shape head fraction asymmetry that increases with curvature, for both flexible (k<sub>bola</sub> = 0) and moderately stiff bolalipids (k<sub>bola</sub> = 1k<sub>B</sub>T).

      Together, these two effects result in membrane softening under curvature for the moderately stiff bolalipids, but constant rigidity for flexible bolalipids (Fig. 2F). In details: for membranes composed of moderately stiff bolalipid molecules (k<sub>bola</sub> = 1k<sub>B</sub>T), the U-shape bolalipid head fraction only increases in the outer leaflet, goingfrom10to20%(Figure S10). This is in the high sensitivity region where the bending rigidity is expected to change the most (Figure S11). We hypothesize that the molecular rigidity of a U-shaped bolalipid creates compression on the outer leaflet that stabilizes the membrane curvature and thus causes membrane softening. We suspect that for membranes composed of rigid bolalipids (k<sub></sub> > 1k<sub>B</sub>T), the effect is likely not present due to the absence of U-shape formation even under strong bending.

      By contrast, for membranes composed of flexible bolalipids (k<sub></sub> = 0), the U-shaped bolalipid head fraction changes relatively little from its value for flat membranes (from 50% to respectively 60 and 40% for the outer and inner leaflet, Figure S10). This is in the region where the membrane bending rigidity is expected to respond weakly to U-shape fraction (Figure S11). Additionally, the change is symmetric, so presumably the outer leaflet becomes softer as the inner leaflet becomes stiffer, thus creating opposing effects and only weakly affecting the membrane bending rigidity as a whole. We note that the distinction between the U-shape head fraction that we plot (Figure S10) and U-shape fraction (Figure S11) matters little for this analysis.

      We have added this deduction and its plots to SI section 8, and revised the corresponding statement in the main text accordingly (p.7 ).

      “Changing membrane curvature alters the area differently in the two membrane leaflets. To adapt to the area difference, we thus expect the fraction of U-shaped bolalipids to change as the membrane curvature changes. Moreover, the results of Fig. 2B and Fig. 2C showed that the U-shaped bolalipid fraction and the membrane bending rigidity are correlated. As a result, we predict that the fraction of straight versus U-shaped bolalipids in a membrane will change in response to membrane bending, in a way that makes the bending rigidity of a bolalipid membrane curvature dependent.”

      R3.Q7: This issue is repeated when the authors study nanoparticle uptake. They write: ”to reconcile these seemingly conflicting observations we reason that the bending rigidity, similar to Figure 2F, is not constant but softens upon increasing membrane curvature, due to dynamic change in the ratio between bolalipids in straight and U-shaped conformation. Hence, bolalipid membranes show stroking plastic behaviour as they soften during reshaping.” But the softening effect that they refer to, as shown in Figure 4B, occurs for very stiff bolalipids, for which not much switching to U-shaped conformation should occur.

      We thank the reviewer for locating a particularly dense sentence. We changed the text to explicitly refer to the range k<sub></sub> 2 [0,2] k<sub>B</sub>T for which there is significant change in U-shape fraction (p.8 ):

      “To reconcile these seemingly conflicting observations we reason that the bending rigidity κ, similar to Fig. 2F, is not constant but softens in the range k<sub></sub> 2 [0,2] k<sub>B</sub>T, upon increasing membrane curvature. This is due to the dynamic change in the ratio between bolalipids in straight and U-shaped conformation.”

      As for Fig. 4B, for k<sub></sub> > 2k<sub>B</sub>T, pores form thus explaining the plateau in adsorption energy.

      R3.Q8: Another major issue is with what the authors refer to as the ”effective temperature”. While plotting phase diagrams for kT/eps value is absolutely valid, I’m not a fan of calling this effective temperature. It is a dimensionless quantity that scales linearly with temperature, but is not a temperature. It is usually called a ”reduced temperature”. Then the authors refer to their findings as studying the stability of archaeal membranes at high temperatures. I have to disagree because eps is not the only potential parameter in the simulations (there are at least space exclusion and angle-bending stiffnesses) so one cannot identify changing eps with changing the global simulation temperature. This only works when you have one potential parameter, like an LJ gas.

      We indeed thought about this before and found that it makes little difference in our set-up. To thoroughly show that the distinction matters very little, per reviewer’s question, we computed our phase diagrams by scaling temperature T explicitly (and not lipid tail interactions T<sub>eff</sub> = k<sub>B</sub>T /ϵ<sub>p</sub>). We added these results to the SI section 14 and found no significant difference when comparing scaling tail interactions (Figure S15A) with scaling temperature explicitly (Figure S15B).

      We also computed Fig. 2A-C for scaling interactions (Figure S17A) and scaling temperature explicitly (Figure S17B). We found a slightly increased U-shaped bolalipid fraction for low k<sub></sub> when comparing scaling interactions (Figure S17A) with temperature scaling (Figure S17B). The reason is that the U-shaped fraction depends on temperature, as with higher temperature bolalipids can easier transition into the U-shape. Most importantly, however, we found no qualitative changes on the liquid region or the mechanical membrane properties when we compared the different scaling variants.

      The reason why both scaling variants match so well can be understood easily. All pair potentials, including volume exclusion interactions between head beads and other membrane beads, were also scaled in the same manner as tail-to-tail interactions, as described in the SI. In contrast, the energy scales for maintaining the lipid bonds, the bilayer lipid angles and the bolalipid angles are relatively large compared to the energy scales involved in tail-to-tail interactions. This separation of energy scales guarantees that there will be little effect when increasing global temperature. Regarding nomenclature, we take the reviewer’s advice and have added ’reduced temperature’ as an alias for T<sub>eff</sub> in the main text.

      In the revised version of the manuscript, we mention these observations in the SI section 14 and point towards these results in the main text (p.4 ):

      “This interaction strength governs the membrane phase behaviour and can be interpreted as the effective temperature or reduced temperature T<sub>eff</sub> = k<sub>B</sub>T /ϵ<sub>p</sub>. As the distinction between scaling interactions (T<sub>eff</sub>) or temperature (T) is not important for our analysis (see Supplemental Information (SI) section 14), for simplicity we refer to T<sub>eff</sub> as temperature in the following.”

      Minor issues

      R3.Q9: As the authors have noted, the fact that the membrane curvature can change the ratio of U-shaped to straight bolalipids would render the curvature elasticity non-linear (though the term ”plastic” should not be used, as this is still structurally reversible when the stress is removed. Technically, it is hypoelastic behaviour, possibly with hysteresis.) With this in mind, when the authors use essentially linear elastic models for fluctuation analysis, they should make a comparison of maximum curvatures occurring in simulations with a range that causes significant changes in bolalipid conformational ratios.

      We thank the reviewer for their suggestion on calling the non-linear behaviour of the curvature elasticity hypoelastic. We have edited the main text accordingly (p.8 ):

      “In an elastic material, the strain modulus holds constant and deformation is reversible. For bolalipid membranes at k<sub></sub> = 1k<sub>B</sub>T, however, the bending modulus decreases when deformation increases, rendering bolalipid membranes hypoelastic.”

      Moreover, regarding the maximum curvatures occurring in the fluctuation simulations: We first note that the ensemble average of the mean curvature H from the fluctuation measurements is indicated as a vertical line in Fig. 2F. As the average value is nearly zero, the membrane can be considered as flat in good approximation. To investigate the question in more detail, we extended the SI with a careful analysis of the validity of the maximum membrane curvature and the validity of the Monge gauge approximation (SI section 15).

      In short, we found that the involved membrane curvatures are small and therefore are unlikely to trigger any significant changes of the bending modulus. Moreover, since we are dealing with two bolalipid conformations, we also tested the homogeneity of the membrane. In our simulations of flat membrane patches we did not observe clustering or phase separation between the two bolalipid conformations beyond the [2,3]σ range. Furthermore, we get good agreement between our fluctuation measurement and the cylinder simulations in Fig. 2F. We now mention this verification in the revised version of the manuscript (p.8 ):

      “Fortunately, this dependency on curvature does not invalidate our fluctuation results, where the curvature is small enough that its effect on the bending modulus is negligible (SI section 15).”

      Last but least, simulating bending/unbending cycles of an arc-shaped membrane (frozen endpoints) shows agreement with cylinder membrane simulations, and no hysteresis at the rates of deformation employed (cf. M. Amaral’s thesis [54], soon to be out of the embargo period).

      R3.Q10: The Introduction section of the manuscript is written with a biochemical approach, with very minor attention to the simulation works on this system. Some molecular dynamics works are only cited as existing previous work, without mentioning what has already been studied in archaeal membranes. While some information, like the binding of ESCRT proteins to archaeal membranes, though interesting, helps little to place the study within the discipline. The Introduction should be revised to show what has already been studied with simulations (as the authors mention in the Discussion) and how the presented research complements it.

      The present research for the first time covers archaeal membranes with a single coarse-grained model capable of assuming both bolalipid in-membrane conformations and sweeps through temperature, membrane composition, and molecular rigidity. The work shows the first curvature dependent bending modulus for pure bolalipid membranes. It also investigates systematically bending modulus and Gaussian modulus, and tests the model in an all-encompassing budding simulation that incorporates topology changes. Existing atomistic or coarse-grained MD simulations (MARTINI or similar force fields) are limited to small patches of membrane, with no study of large-scale deformations or topology changes; plus, they rely on force fields that were parametrized for bilayer membranes.

      To give a comprehensive overview of the field, we revised the introduction section of the manuscript, in which we now discuss previous computational work investigating membrane diffusivity, U-shaped lipid fraction, and bending rigidity (p.3 ):

      “By contrast, only a few studies have investigated bolalipid membranes applying computational or theoretical tools [24, 25]. Specifically, the pore closure time in bolalipid membranes, and the role of cyclopentane rings for membrane properties has been investigated using all-atom simulations, showing decreased lateral mobility, reduced permeability to water, and increased lipid packing [26–28]. Moreover, using coarse-grained simulations, it was suggested that bolalipid membranes are thicker [29], exhibit a gel-to-liquid phase transition at higher temperature [30], and exhibit a reduced diffusivity [31]. However, little research has been devoted to investigating mechanics and reshaping of bolalipid membranes at the mesoscale despite the obvious importance of this question from evolutionary, biophysics, and biotechnological perspectives and although different membrane physics is expected to manifest.”

      Following the reviewer’s advice and to keep the introduction concise and focused on bolalipid membranes, we have removed the paragraph on ESCRT-III proteins in the revised manuscript.

      R3.Q11: The authors have been a bit loose with using the term ”stability”. I’d like to see the distinction in each case, as in ”chemical/thermal/mechanical/conformational stability”.

      We have clarified when applicable the type of stability throughout the manuscript. In all other instances, if not clear from context, we mean simply that the membrane persists being a membrane. At our coarse-grained level, this means the membrane does not disassemble into a gas phase.

      R3.Q12: In the original Cooke-Deserno model, a so-called ”poorman’s angle-bending term” is used, which is essentially a bond-stretching term between the first and third particle. However, I notice the authors using the full harmonic angle-bending potential. This should be mentioned.

      This is made clear in the SI (Eq. (S3)). Cooke and Deserno mention the harmonic angle potential as a valid alternative in their original publication. We now also added this detail to the main text (p.3 ):

      “The angle formed by the chain of three beads is kept near 180° via an angular potential with strength k<sub>0</sub>, instead of the approximation by a bond between end beads of the original model [32].”

      R3.Q13: The analysis of energy of U-shaped lipids with the linear model E \= c<sub>0</sub> + c<sub>1</sub>k<sub></sub> is indeed very interesting. I am curious, can this also be corroborated with mean energy measurements? The minor issue is calling the source of the favorability of U-shaped lipids ”entropic”, while clearly an energetic contribution is found. The two conformations, for example, might differ in the interactions with the neighbouring lipids.

      We were also curious and thank the reviewer for the suggestion of mean energy measurements. We concluded that there must be either an entropic contribution to the free energy or an intermolecular interaction energy favouring U-shaped bolalipids. We have now included these measurements in SI section 6 (p.S5 ):

      “By splitting the average potential energy between an internal contribution (bonds, angles and pair interactions between particles in the same molecule) and an external contribution (pair interactions between a molecule and its neighbours), we determined the transition energy from straight to U-shaped bolalipids in detail. We found that this transition lowers the internal potential energy of the bolalipid while increasing its interaction energy. In total, we obtained an energy barrier for the transition of ΔE<sub>s→u</sub> = 0.79±0.01k<sub>B</sub>T. Since the fit indicates, however, that the U-shaped bolalipid conformation is preferred over the straight conformation, we conclude that there must be either an entropic contribution to the free energy or an intermolecular interaction energy favouring U-shaped bolalipids.”

      We refer to these measurements in the main text (p.6 ):

      “For the fit it appears that c<sub>0</sub> < 0, which implies that bolalipids in U-shape conformation are slightly favoured over straight bolalipids at k<sub></sub> = 0 (explored in SI section 6).”

      R3.Q14: The authors write in the Discussion, ”In any case, our results indicate that membrane remodelling, such as membrane fission during membrane traffic, is much more difficult in bolalipid membranes [34].” Firstly, I’m not sure if studying the dependence of budding behaviour on adhesion energy with nanoparticles is enough to make claims about membrane fission. Secondly, why is the 2015 paper by Markus Deserno cited here?

      We thank the reviewer for giving us the opportunity to clarify. We make an energetic argument on membrane fission based on the observed difference in the ratio of .

      Splitting a spherical membrane vesicle into two spherical vesicles (fission) increases the bending energy by 8𝜋𝜅 and decreases the energy related to the Gaussian bending modulus by . The second part of the argument is given for example in the review by Markus Deserno (p.23, right column), that’s why we cite the paper here. Together, this gives an energy barrier, required for membrane fission in the considered geometry of ∆E<sub>fission</sub> = . We found that is around 0.5 for bolalipid membranes and around 1 for bilayer membranes. Since 𝜅 was typically larger in bolalipid membranes we thus expect the energy barrier for fission ∆E<sub>fission</sub> to be larger for bolalipid membranes. We therefore predict that membrane remodelling, such as membrane fission during membrane trafficking, is harder in bolalipid membranes. We explain our reasoning in the discussion of the revised manuscript (p.13 ):

      “Membrane remodelling, such as the fission of one spherical vesicle into two, increases the bending energy by 8πκ but decreases the energy related to the Gaussian modulus by – [39], giving rise to a fission energy barrier of ∆E<sub>fission</sub> = . Our results indicated that while in bolalipid membranes 𝜅 is larger, is smaller compared to bilayer membranes. Our results thus predict a larger energy barrier for membrane fission ∆E<sub>fission</sub> in bolalipid membranes compared to bilayer membranes.”

      R3.Q15: In the SI, where the measurement of the diffusion coefficient is discussed, the expression for D is missing the power 2 of displacement.

      We thank the reviewer for spotting this oversight. We corrected it in the revised version of the SI (p.S5 ).

      R3.Q16: Where cargo uptake is discussed, the term ”adsorption energy” is used. I think the more appropriate term would be ”adhesion energy”.

      For the sake of simplicity, we changed the term to adhesion energy (caption of Fig. 4, and p.10). We do not have a strong opinion on this, but we believe that adsorption energy would be equally correct as we describe the adsorption of many lipid head beads to a nanoparticle.

      R3.Q17: Typos:

      Page 1, paragraph 2: Adaption → Adaptation. Page 10, paragraph 1: Stroking → Striking.

      We thank the reviewer for spotting these typos which we have corrected in the revised version of the manuscript.

      Recommendations for the authors

      Reviewer #1 (Recommendations for the authors):

      A few thoughts (likely out of the scope of this paper but possibly to consider upon revision):

      R1.Q1: Do bolalipids always have the same headgroup? I don’t recall reading this in the introduction/discussion. R1 and R2 are in Figure 1, but I don’t know whether there are standard types. Could this be expanded upon? Is the model able to take these differences into account?

      We thank the reviewer for raising this important question. Similar to bacteria and eukaryotes, in archaea there is a huge variety in terms of the different head groups that lipids can contain and thus also lipid variety. Most archaeal lipids have head groups that contain either phosphate groups or sugar residues. Typically, archaeal bolalipids are asymmetric and contain a phosphatidyl and a sugar moiety at the two ends of the lipid molecule. Within the membrane the lipid is oriented such that the phosphatidyl moiety points towards the interior of the cell whereas the sugar moiety points towards the outside of the cell as it occupies more space [5].

      In our computational model, however, we consider symmetric bolalipids for the sake of simplicity and to decouple the role of ”connected geometry” from other effects. In principle, we could investigate the effect of lipid asymmetry by increasing the size of one of the lipid head beads. However, this investigation exceeds the scope of the present study and therefore requires future work.

      In the revised version of the manuscript, we now clarify that bolalipids can have different headgroups (p.1 and the caption of Fig. 1):

      “The hydrophilic heads can be composed of different functional groups with phosphatidyl and sugar being the most relevant moieties. For bolalipids the two head groups at either end of the molecule are typically distinct (Fig. 1A right) [5].”

      “The hydrophilic head of a bolalipid can be composed of different functional groups represented by R1 and R2 (right).”

      We also explicitly state that we neglect lipid head group asymmetry for the sake of simplicity (p.4 ):

      “To decouple the effect of the connected geometry of the bolalipids from that of lipid asymmetry, we assume both head beads of a bolalipid to share the same properties.”

      R1.Q2: Is it possible to compare the mesoscale models to either Coarse-grained or even all-atom lipid models? Have simulations previously been performed for bolalipids at those levels of description?

      A few studies have investigated bolalipids membranes in simulations previously. These studies either used all-atom or coarse-grained simulations. However, none of these studies investigated how bolalipids respond to membrane deformations. Therefore, it is currently not possible to directly compare our results to studies in the literature. However, to recapitulate our predictions experimentally is certainly something that could and should be done in the future. As a reply to this reviewer and reviewer 3, we discuss the current state of modelling bolalipid membranes in simulations in the revised version of the manuscript (p.3 ):

      “By contrast, only a few studies have investigated bolalipid membranes applying computational or theoretical tools [24, 25]. Specifically, the pore closure time in bolalipid membranes, and the role of cyclopentane rings for membrane properties has been investigated using all-atom simulations, showing decreased lateral mobility, reduced permeability to water, and increased lipid packing [26–28]. Moreover, using coarse-grained simulations, it was suggested that bolalipid membranes are thicker [29], exhibit a gel-to-liquid phase transition at higher temperature [30], and exhibit a reduced diffusivity [31]. However, little research has been devoted to investigating mechanics and reshaping of bolalipid membranes at the mesoscale despite the obvious importance of this question from evolutionary, biophysics, and biotechnological perspectives and although different membrane physics is expected to manifest.”

      We want to mention, however, that we do compare membrane diffusivity, U-shaped lipid fraction, and bending rigidity to the behaviour and values that have been previously measured in simulations in the discussion section. In general, we find good agreement between our results and previously reported behaviour/values (p.13 ):

      “While flexible bolalipid membranes are liquid under the same conditions as bilayer membranes, we found that stiff bolalipids form membranes that operate in the liquid regime at higher temperatures. These results agree well with previous molecular dynamics simulations that suggested that bolalipid membranes are more ordered and have a reduced diffusivity compared to bilayer membranes [24, 29]. In our simulations, this is due to the fact that completely flexible bolalipids molecules adopt both straight (transmembrane) as well as the U-shaped (loop) conformation with approximately the same frequency. In contrast, stiff bolalipids typically only take on the straight conformation when assembled in a membrane. These results agree with the previous coarse-grained molecular dynamics simulations using the MARTINI force field which showed that the ratio of straight to U-shaped bolalipids increased upon stiffening the linker between the lipid tails [29].

      [...]

      When we determined the bending rigidity of bolalipid membranes by measuring their response to thermal fluctuations, we found that membranes made from flexible bolalipids are only slightly more rigid than bilayer membranes. This result is consistent with previous atomistic simulations, which showed that the membrane rigidity was similar for membranes composed of bilayer lipids and flexible synthetic bolalipids [45].”

      R1.Q3: How would membrane proteins alter the behaviour of bolalipids? Either those integral to the membrane or those binding peripherally?

      The reviewer asks an important question. However, the question is difficult to answer due to its scope and the gaps in the current literature. Important examples of integral or peripheral membrane proteins that alter the behaviour of bolalipids and archaeal bolalipid membranes are involved in cell homeostasis, cell division, membrane trafficking, and lipid synthesis.

      The cells of many archaeal species are enclosed in a paracrystalline protein layer called the Slayer, which is attached to the lipid membrane [4, 55]. The main function of the S-layer is to keep the cell’s shape and to protect it against osmotic stress. Due to the embedding of the S-layer in the membrane at specific locations, it is to be expected that the membrane properties are influenced by the S-layer. Furthermore, archaea execute cell division by locally reshaping the membrane using FtsZ and ESCRT-III proteins [56]. While Asgard archaeal genomes encode proteins with homology to those regulating aspects of eukaryotic membrane remodelling and trafficking [57], they have yet to be observed undergoing a process like endocytosis [58]. In addition, it has been speculated that the proteins that drive the synthesis of two diether lipids into a tetraether lipid are either membrane associated or integral membrane proteins [59].

      However, to the best of our knowledge it is not known how membrane proteins specifically alter the behaviour of bolalipids. Future work will need to be executed to answer this question. Following the advice of reviewer 3 and to keep the introduction concise and focused on bolalipid membranes, we do not mention these observations in the revised manuscript.

      R1.Q4: Is there a mechanism in cells to convert or switch bolalipids from a straight to a u-shaped description? Does this happen spontaneously or are there enzymes responsible for this?

      We thank the reviewer for bringing up this important point. Despite the relevance of the question, little is currently known about the mechanism that make bolalipids transition between a straight and a U-shaped configuration mainly because there is to date no established experimental method.

      Besides our own results, most of what we know comes from coarse-grained molecular dynamics simulations, which showed that bolalipids can spontaneously transition between the straight and U-shaped configuration [29]. In addition, by using comparative genomic analysis, it has been predicted that many archaeal species contain flippases, i.e., membrane proteins that are able, upon the consumption of energy, to transfer (flipflop) bilayer lipids between the two membrane leaflets [43]. Moreover, it has been shown that Halobacterium salinarum (an archaeon with a bilayer lipid membrane) [44] contains scramblases, which are membrane proteins that passively transfer bilayer lipids from one membrane leaflet to the other. It is therefore tempting to speculate that similar proteins might exist for bolalipids which could facilitate the straight to U-shaped transition.

      In addition, it has been reported that vesicles composed of bolalipid membranes can undergo fusion with enveloped influenza viruses [17]. In this context, it has been suggested that the influenza fusion protein hemagglutinin may locally induce U-shaped bolalipids to facilitate membrane fusion. However, all these hints are by far no proof of a mechanism that can drive the straight to U-shaped bolalipid transition, and further work needs to be done to investigate this question in detail.

      In the revised version of the manuscript, we now discuss what is known about potential mechanisms to facilitate the straight to U-shaped transition in the discussion section (p.13 ):

      “While previous coarse-grained simulations predicted that bolalipids spontaneously transition between the straight and U-shaped conformations [29], how this happens in archaeal membranes and whether membrane proteins are involved in this conformational transition needs to be clarified in the future. Experimental studies suggest that archaeal membranes contain flippases and scramblases for the transitioning of bilayer lipids between membrane leaflets [43, 44], raising the possibility that similar proteins could also facilitate conformational transitions in bolalipids. In addition, it has been suggested that the viral fusion protein hemagglutinin could cause a transition from straight to U-shaped bolalipid conformation during the fusion of bolalipid vesicles with influenza viruses [17]. However, future investigation is required.”

      R1.Q5: Ideally, coordinates and any parameter files required to run the molecular simulations should be included for reproducibility.

      We absolutely share the reviewer’s concern with reproducibility and as such have included in the original submission as part of our data availability section a link to a code repository (available at: https://doi.org/10.5281/zenodo.13934991 [51]) that allows initializing and simulating flat membrane patches, with user control of the parameters explored in this paper (𝜔,T<sub>eff</sub>,k<sub>bola</sub>,f<sup>bi</sup>).

      Reviewer #2 (Recommendations for the authors):

      This is a great paper and I congratulate the authors for writing such a fine piece of scholarship. The only nitty-gritty feedback that I have is summarized in the following three points:

      R2.Q1: In the introduction the authors talk about archaea adapting their membrane to retain membrane fluidity. However, homeoviscous adaptation is also fundamental in bacteria and eukaryotes.

      The reviewer is correct, like archaea the membranes of bacteria and eukaryotes must balance between flexibility and stability. Moreover, the cell membranes in all 3 domains of life need to maintain membrane fluidity and provide mobility to the embedded lipids and membrane proteins (homeoviscous adaptation). The general idea is that these organisms change the ratio of different lipids to change membrane properties and thereby optimally adapt to their environments [10]. Importantly, however, there are differences of how homeoviscous adaptation is maintained across the different domains of life. As a reply to this reviewer and reviewer 3, we now discuss the underlying mechanisms in the revised parts of the introduction (p.1 ):

      “Like for bacteria and eukaryotes, archaea must keep their lipid membranes in a fluid state (homeoviscous adaptation). This is important even under extreme environmental conditions, such as hot and cold temperatures, or high and low pH values [7]. Because of this, many archaea adapt to changes in their environment by tuning the lipid composition of their membranes: altering the ratio between bola- and bilayer lipids in their membranes [8, 9] and/or by changing the number of cyclopentane rings in their lipid tails, which are believed to make lipid molecules more rigid [5]. For example, Thermococcus kodakarensis increases its tetraether bolalipid ratio from around 50% to over 80% when the temperature of the environment increases from 60 to 85 C [10]. Along the same lines, the cell membrane of Sulfolobus acidocaldarius, can contain over 90 % of bolalipids with up to 8 cyclopentane rings at 70 C and pH 2.5 [5, 11]. It is worth mentioning that in exceptional cases bacteria also synthesise bolalipids in response to high temperatures [12], highlighting that the study of bolalipid membranes is relevant not only for archaeal biology but also from a general membrane biophysics perspective.”

      R2.Q2: Uncertainties in Gaussian rigidity modulus estimates are not properly reported.

      The large uncertainties in the Gaussian rigidity modulus were due to the fact how they were calculated. In short, is determined in cap folding simulations [41] (SI section 9), by using the measured values of the dimensionless parameter 𝜉, related to the folding probability, the bending modulus 𝜅, the membrane line tension , and the cap radius R. In our case, the main source of uncertainty for determining comes from the uncertainty in the measurement of the bending rigidity 𝜅. To obtain 𝜅, previously, we fitted fluctuation spectra for different seeds and only then averaged the obtained values. In the revised version of the manuscript, we now first pool the fluctuation spectra of the different simulation seeds before we fit all spectra at the same time. This new approach results in smaller uncertainties for the bending rigidity 𝜅 and also the Gaussian rigidity modulus .

      As a consistency check, in addition to the simulations that we previously performed at T<sub>eff</sub> = 1.3, we have repeated the cap folding and line tension simulations at T<sub>eff</sub> = 1.2, resulting in similar values for . In the revised version of the manuscript, we report the newly calculated values and uncertainties for at T<sub>eff</sub>  = 1.2 in the main text (p.8 ):

      “At T<sub>eff</sub>  = 1.2, we obtained = 4.30±0.22kBT and thus a ratio of = 0.89±0.04 for bilayer membranes, similar to what has been reported previously [41]. For flexible bolalipid membranes, we got a slightly smaller value for = 5.04 ± 0.37kBT. Due to the larger bending modulus, however, flexible bolalipid membranes show a significantly smaller ratio = 0.64± 0.04 (k<sub></sub> = 0). At larger temperature (Teff = 1.3), the ratio can be even smaller = 0.45 ± 0.07 (see SI section 9).”

      In addition, we report the values at T<sub>eff</sub> = 1.3 and T<sub>eff</sub> = 1.2 in the SI (p.S15 , Tabl. S4):

      We have also adapted the discussion of the Gaussian bending modulus accordingly (p.13 ):

      “Another marked difference between bilayer and flexible bolalipid membranes is the ratio of the Gaussian rigidity to the bending modulus. Instead of being around 1 as for bilayer membranes [41], it is around 1/2 and therefore only half of that of bilayer lipids.”

      Reviewer #3 (Recommendations for the authors):

      While I think the bulk of the work presented is useful, some of the issues that I raised in my review are indeed major. Without properly addressing them, it is hard to accept the conclusions of the manuscript. I hope the authors can address them by revising their analysis.

      We thank the reviewer for their constructive feedback, which helped us to improve the manuscript. We have addressed all points raised by the reviewer in our detailed point-by-point response to the reviewer (see above). We hope the reviewer will now find it easier to accept our conclusions.

      (1) R. Phillips, J. Kondev, J. Theriot, and H. Garcia, Physical biology of the cell (Garland Science, New York, 2012).

      (2) H. T. McMahon and J. L. Gallop, Membrane curvature and mechanisms of dynamic cell membrane remodelling, Nature 438, 590 (2005).

      (3) S. B. Gould, Membranes and evolution, Curr. Biol. 28, R381 (2018).

      (4) S.-V. Albers and B. H. Meyer, The archaeal cell envelope, Nat. Rev. Microbiol. 9, 414 (2011).

      (5) P. M. Oger and A. Cario, Adaptation of the membrane in Archaea, Biophys. Chem. 183, 42 (2013).

      (6) K. Rastädter, D. J. Wurm, O. Spadiut, and J. Quehenberger, The Cell Membrane of Sulfolobus spp.—Homeoviscous Adaption and Biotechnological Applications, International Journal of Molecular Sciences 21, 3935 (2020).

      (7) P. L.-G. Chong, Archaebacterial bipolar tetraether lipids: Physico-chemical and membrane properties, Chem. Phys. Lipids 163, 253 (2010).

      (8) M. Tourte, P. Schaeffer, V. Grossi, and P. M. Oger, Functionalized Membrane Domains: An Ancestral Feature of Archaea?, Front. Microbiol. 11, 526 (2020).

      (9) Y. H. Kim, G. Leriche, K. Diraviyam, T. Koyanagi, K. Gao, D. Onofrei, J. Patterson, A. Guha, N. Gianneschi, G. P. Holland, M. K. Gilson, M. Mayer, D. Sept, and J. Yang, Entropic effects enable life at extreme temperatures, Sci. Adv. 5, eaaw4783 (2019).

      (10) M. F. Siliakus, J. van der Oost, and S. W. M. Kengen, Adaptations of archaeal and bacterial membranes to variations in temperature, pH and pressure, Extremophiles 21, 651 (2017).

      (11) D. W. Grogan, Phenotypic characterization of the archaebacterial genus sulfolobus: comparison of five wild-type strains, J. Bacteriol. 171, 6710 (1989).

      (12) D. X. Sahonero-Canavesi, M. F. Siliakus, A. Abdala Asbun, M. Koenen, F. von Meijenfeldt, S. Boeren, N. J. Bale, J. C. Engelman, K. Fiege, L. Strack van Schijndel, J. S. Sinninghe Damsté, and L. Villanueva, Disentangling the lipid divide: Identification of key enzymes for the biosynthesis of membrane-spanning and ether lipids in Bacteria, Sci. Adv. 8, eabq8652 (2022).

      (13) M. van Wolferen, A. A. Pulschen, B. Baum, S. Gribaldo, and S.-V. Albers, The cell biology of archaea, Nat. Microbiol. 10.1038/s41564-022-01215-8 (2022).

      (14) U. Bakowsky, U. Rothe, E. Antonopoulos, T. Martini, L. Henkel, and H.-J. Freisleben, Monomolecular organization of the main tetraether lipid from Thermoplasma acidophilum at the water–air interface, Chem. Phys. Lipids 105, 31 (2000).

      (15) C. Jeworrek, F. Evers, M. Erlkamp, S. Grobelny, M. Tolan, P. L.-G. Chong, and R. Winter, Structure and Phase Behavior of Archaeal Lipid Monolayers, Langmuir 27, 13113 (2011).

      (16) D. P. Brownholland, G. S. Longo, A. V. Struts, M. J. Justice, I. Szleifer, H. I. Petrache, M. F. Brown, and D. H. Thompson, Phase Separation in Binary Mixtures of Bipolar and Monopolar Lipid Dispersions Revealed by 2H NMR Spectroscopy, Small Angle X-Ray Scattering, and Molecular Theory, Biophysical Journal 97, 2700 (2009).

      (17) A. Bhattacharya, I. D. Falk, F. R. Moss, T. M. Weiss, K. N. Tran, N. Z. Burns, and S. G. Boxer, Structure–function relationships in pure archaeal bipolar tetraether lipids, Chem. Sci. 15, 14273 (2024).

      (18) V. Vitkova, D. Mitkova, V. Yordanova, P. Pohl, U. Bakowsky, G. Staneva, and O. Batishchev, Elasticity and phase behaviour of biomimetic membrane systems containing tetraether archaeal lipids, Colloids Surf. A Physicochem. Eng. Asp. 601, 124974 (2020).

      (19) E. Chang, Unusual thermal stability of liposomes made from bipolar tetraether lipids, Biochem. Biophys. Res. Commun. 202, 673 (1994).

      (20) O. V. Batishchev, A. S. Alekseeva, D. S. Tretiakova, T. R. Galimzyanov, A. Y. Chernyadyev, N. R. Onishchenko, P. E. Volynsky, and I. A. Boldyrev, Cyclopentane rings in hydrophobic chains of a phospholipid enhance the bilayer stability to electric breakdown, Soft Matter 16, 3216 (2020).

      (21) U. Seifert, Configurations of fluid membranes and vesicles, Adv. Phys. 46, 13 (1997).

      (22) H. Noguchi, Membrane Simulation Models from Nanometer to Micrometer Scale, J. Phys. Soc. Jpn. 78, 041007 (2009).

      (23) F. Frey and T. Idema, More than just a barrier: using physical models to couple membrane shape to cell function, Soft Matter 17, 3533 (2021).

      (24) C. Huguet, S. Fietz, A. Rosell-Melé, X. Daura, and L. Costenaro, Molecular dynamics simulation study of the effect of glycerol dialkyl glycerol tetraether hydroxylation on membrane thermostability, Biochimica et Biophysica Acta (BBA) - Biomembranes 1859, 966 (2017).

      (25) T. R. Galimzyanov, P. I. Kuzmin, P. Pohl, and S. A. Akimov, Elastic deformations of bolalipid membranes, Soft Matter 12, 2357 (2016).

      (26) T. R. Galimzyanov, P. E. Volynsky, and O. V. Batishchev, Continuum elasticity and molecular dynamics of a pore in archaeal bolalipid membranes, Soft Matter 21, 687 (2025).

      (27) A. O. Chugunov, P. E. Volynsky, N. A. Krylov, I. A. Boldyrev, and R. G. Efremov, Liquid but Durable: Molecular Dynamics Simulations Explain the Unique Properties of Archaeal-Like Membranes, Sci. Rep. 4, 7462 (2015).

      (28) L. F. Pineda De Castro, M. Dopson, and R. Friedman, Biological Membranes in Extreme Conditions: Simulations of Anionic Archaeal, PLoS One 11, e0155287 (2016).

      (29) M. Bulacu, X. Périole, and S. J. Marrink, In Silico Design of Robust Bolalipid Membranes, Biomacromolecules 13, 196 (2012).

      (30) C. H. Davis, H. Nie, and N. V. Dokholyan, Insights into thermophilic archaebacterial membrane stability from simplified models of lipid membranes, Phys. Rev. E 75, 051921 (2007).

      (31) S. Dey and J. Saha, Minimal Coarse-Grained Modeling toward Implicit Solvent Simulation of Generic Bolaamphiphiles, J. Phys. Chem. B 124, 2938 (2020).

      (32) I. R. Cooke and M. Deserno, Solvent-free model for self-assembling fluid bilayer membranes: Stabilization of the fluid phase based on broad attractive tail potentials, J. Chem. Phys. 123, 224710 (2005).

      (33) P. L.-G. Chong, U. Ayesa, V. Prakash Daswani, and E. C. Hur, On Physical Properties of Tetraether Lipid Membranes: Effects of Cyclopentane Rings, Archaea 2012, 1 (2012).

      (34) A. P. Thompson, H. M. Aktulga, R. Berger, D. S. Bolintineanu, W. M. Brown, P. S. Crozier, P. J. in ’t Veld, A. Kohlmeyer, S. G. Moore, T. D. Nguyen, R. Shan, M. J. Stevens, J. Tranchida, C. Trott, and S. J. Plimpton, LAMMPS - a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales, Comput. Phys. Commun. 271, 108171 (2022).

      (35) A. Stukowski, Visualization and analysis of atomistic simulation data with ovito–the open visualization tool, Modelling and Simulation in Materials Science and Engineering 18, 015012 (2009).

      (36) E. R. May, A. Narang, and D. I. Kopelevich, Role of molecular tilt in thermal fluctuations of lipid membranes, Physical Review E 76, 021913 (2007).

      (37) W. Helfrich, Elastic Properties of Lipid Bilayers: Theory and Possible Experiments, Z. Naturforsch. C 28, 693 (1973).

      (38) M. Hamm and M. Kozlov, Elastic energy of tilt and bending of fluid membranes, Eur. Phys. J. E 3, 323 (2000).

      (39) M. Deserno, Fluid lipid membranes: From differential geometry to curvature stresses, Chemistry and Physics of Lipids 185, 11 (2015).

      (40) V. A. Harmandaris and M. Deserno, A novel method for measuring the bending rigidity of model lipid membranes by simulating tethers, The Journal of Chemical Physics 125, 204905 (2006).

      (41) M. Hu, J. J. Briguglio, and M. Deserno, Determining the Gaussian Curvature Modulus of Lipid Membranes in Simulations, Biophys. J. 102, 1403 (2012).

      (42) M. Deserno, Elastic deformation of a fluid membrane upon colloid binding, Phys. Rev. E 69, 031903 (2004), arXiv: cond-mat/0303656.

      (43) K. S. Makarova, M. Y. Galperin, and E. V. Koonin, Comparative genomic analysis of evolutionarily conserved but functionally uncharacterized membrane proteins in archaea: Prediction of novel components of secretion, membrane remodeling and glycosylation systems, Biochimie 118, 302 (2015).

      (44) A. Verchère, W.-L. Ou, B. Ploier, T. Morizumi, M. A. Goren, P. Bütikofer, O. P. Ernst, G. Khelashvili, and A. K. Menon, Light-independent phospholipid scramblase activity of bacteriorhodopsin from Halobacterium salinarum, Sci. Rep. 7, 9522 (2017).

      (45) T. B. H. Schroeder, G. Leriche, T. Koyanagi, M. A. Johnson, K. N. Haengel, O. M. Eggenberger, C. L. Wang, Y. H. Kim, K. Diraviyam, D. Sept, J. Yang, and M. Mayer, Effects of lipid tethering in extremophile-inspired membranes on H(+)/OH(-) flux at room temperature, Biophys. J. 110, 2430 (2016).

      (46) R. Xu, A. Dehghan, A.-C. Shi, and J. Zhou, Elastic property of membranes self-assembled from diblock and triblock copolymers, Chem. Phys. Lipids 221, 83 (2019).

      (47) Z. Dogic and S. Fraden, Ordered phases of filamentous viruses, Curr. Opin. Colloid Interface Sci. 11, 47 (2006).

      (48) E. Barry and Z. Dogic, Entropy driven self-assembly of nonamphiphilic colloidal membranes, Proc. Natl. Acad. Sci. U.S.A. 107, 10348 (2010).

      (49) A. J. Balchunas, R. A. Cabanas, M. J. Zakhary, T. Gibaud, S. Fraden, P. Sharma, M. F. Hagan, and Z. Dogic, Equation of state of colloidal membranes, Soft Matter 15, 6791 (2019).

      (50) M. Saracco, P. Schaeffer, M. Tourte, S.-V. Albers, Y. Louis, J. Peters, B. Demé, S. Fontanay, and P. M. Oger, Bilayer-Forming Lipids Enhance Archaeal Monolayer Membrane Stability, Int. J. Mol. Sci. 26, 3045 (2025).

      (51) M. Amaral, archaeal_membranes : code and examples (2024), available at https://doi.org/10.5281/zenodo. 13934991.

      (52) M. F. Ergüder and M. Deserno, Identifying systematic errors in a power spectral analysis of simulated lipid membranes, The Journal of Chemical Physics 154, 214103 (2021).

      (53) J. Genova, N. Ulrih, V. Kralj-Iglič, A. Iglič, and I. Bivas, Bending Elasticity Modulus of Giant Vesicles Composed of Aeropyrum Pernix K1 Archaeal Lipid, Life 5, 1101 (2015).

      (54) M. Amaral, Archaeal Membranes: In Silico Modelling and Design, Ph.D. thesis, Institute of Science and Technology Austria (2024).

      (55) M. Pohlschroder, F. Pfeiffer, S. Schulze, and M. F. A. Halim, Archaeal cell surface biogenesis, FEMS Microbiol. Rev. 42, 694 (2018).

      (56) K. S. Makarova, N. Yutin, S. D. Bell, and E. V. Koonin, Evolution of diverse cell division and vesicle formation systems in Archaea, Nat. Rev. Microbiol. 8, 731 (2010).

      (57) C. W. Stairs and T. J. Ettema, The Archaeal Roots of the Eukaryotic Dynamic Actin Cytoskeleton, Curr. Biol. 30, R521 (2020).

      (58) B. Baum and D. A. Baum, The merger that made us, BMC Biol. 18, 72 (2020).

      (59) Z. Zeng, H. Chen, H. Yang, Y. Chen, W. Yang, X. Feng, H. Pei, and P. V. Welander, Identification of a protein responsible for the synthesis of archaeal membrane-spanning GDGT lipids, Nat. Commun. 13, 1545 (2022).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      When you search for something, you need to maintain some representation (a "template") of that target in your mind/brain. Otherwise, how would you know what you were looking for? If your phone is in a shocking pink case, you can guide your attention to pink things based on a target template that includes the attribute 'pink'. That guidance should get you to the phone pretty effectively if it is in view. Most real-world searches are more complicated. If you are looking for the toaster, you will make use of your knowledge of where toasters can be. Thus, if you are asked to find a toaster, you might first activate a template of a kitchen or a kitchen counter. You might worry about pulling up the toaster template only after you are reasonably sure you have restricted your attention to a sensible part of the scene.

      Zhou and Geng are looking for evidence of this early stage of guidance by information about the surrounding scene in a search task. They train Os to associate four faces with four places. Then, with Os in the scanner, they show one face - the target for a subsequent search. After an 8 sec delay, they show a search display where the face is placed on the associated scene 75% of the time. Thus, attending to the associated scene is a good idea. The questions of interest are "When can the experimenters decode which face Os saw from fMRI recording?" "When can the experimenters decode the associated scene?" and "Where in the brain can the experimenters see evidence of this decoding? The answer is that the face but not the scene can be read out during the face's initial presentation. The key finding is that the scene can be read out (imperfectly but above chance) during the subsequent delay when Os are looking at just a fixation point. Apparently, seeing the face conjures up the scene in the mind's eye.

      This is a solid and believable result. The only issue, for me, is whether it is telling us anything specifically about search. Suppose you trained Os on the face-scene pairing but never did anything connected to the search. If you presented the face, would you not see evidence of recall of the associated scene? Maybe you would see the activation of the scene in different areas and you could identify some areas as search specific. I don't think anything like that was discussed here.

      You might also expect this result to be asymmetric. The idea is that the big scene gives the search information about the little face. The face should activate the larger useful scene more than the scene should activate the more incidental face, if the task was reversed. That might be true if the finding is related to a search where the scene context is presumed to be the useful attention guiding stimulus. You might not expect an asymmetry if Os were just learning an association.

      It is clear in this study that the face and the scene have been associated and that this can be seen in the fMRI data. It is also clear that a valid scene background speeds the behavioral response in the search task. The linkage between these two results is not entirely clear but perhaps future research will shed more light.

      It is also possible that I missed the clear evidence of the search-specific nature of the activation by the scene during the delay period. If so, I apologize and suggest that the point be underlined for readers like me.

      We have added text related to this issue, particularly in the discussion (page 19, line 6), and have also added citations of studies in humans and non-human primates showing a causal relationship between preparatory activity in prefrontal and visual cortex and visual search performance (page 6, line 16).

      Reviewer #2 (Public review):

      Summary:

      This work is one of the best instances of a well-controlled experiment and theoretically impactful findings within the literature on templates guiding attentional selection. I am a fan of the work that comes out of this lab and this particular manuscript is an excellent example as to why that is the case. Here, the authors use fMRI (employing MVPA) to test whether during the preparatory search period, a search template is invoked within the corresponding sensory regions, in the absence of physical stimulation. By associating faces with scenes, a strong association was created between two types of stimuli that recruit very specific neural processing regions - FFA for faces and PPA for scenes. The critical results showed that scene information that was associated with a particular cue could be decoded from PPA during the delay period. This result strongly supports the invoking of a very specific attentional template.

      Strengths:

      There is so much to be impressed with in this report. The writing of the manuscript is incredibly clear. The experimental design is clever and innovative. The analysis is sophisticated and also innovative. The results are solid and convincing.

      Weaknesses:

      I only have a few weaknesses to point out.<br /> This point is not so much of a weakness, but a further test of the hypothesis put forward by the authors. The delay period was long - 8 seconds. It would be interesting to split the delay period into the first 4seconds and the last 4seconds and run the same decoding analyses. The hypothesis here is that semantic associations take time to evolve, and it would be great to show that decoding gets stronger in the second delay period as opposed to the period right after the cue. I don't think this is necessary for publication, but I think it would be a stronger test of the template hypothesis.

      We conducted the suggested analysis, and we did not find clear evidence of differences in decoding scene information between the earlier and later portions of the delay period. This may be due to insufficient power when the data are divided, individual differences in when preparatory activation is the strongest, or truly no difference in activation over the delay period. More details of this analysis can be found in the supplementary materials (page 12, line 16; Figure S1).

      Type in the abstract "curing" vs "during."

      Fixed.

      It is hard to know what to do with significant results in ROIs that are not motivated by specific hypotheses. However, for Figure 3, what are the explanations for ROIs that show significant differences above and beyond the direct hypotheses set out by the authors?

      We added reasoning for the other a priori ROIs in the introduction (page 4, line 26). There is substantial evidence suggesting that frontoparietal areas are involved in cognitive control, attentional control, and working memory. The ROIs we selected from frontal and parietal cortex are based on parcels within resting state networks defined by the s17-network atlases (Schaefer et al., 2018). The IFJ was defined by the HCP-MMP1 (Glasser et al., 2016). These regions are commonly used in studies of attention and cognitive control, and the exact ROIs selected are described in the section on “Regions of interest (ROI) definition”. While we have the strongest hypothesis for IFJ based on relatively recent work from the Desimone lab, the other ROIs in lateral frontal cortex and parietal cortex, are also well documented in similar studies, although the exact computation being done by these regions during tasks can be hard to differentiate with fMRI.\

      Reviewer #3 (Public review):

      The manuscript contains a carefully designed fMRI study, using MVPA pattern analysis to investigate which high-level associate cortices contain target-related information to guide visual search. A special focus is hereby on so-called 'target-associated' information, that has previously been shown to help in guiding attention during visual search. For this purpose the author trained their participants and made them learn specific target-associations, in order to then test which brain regions may contain neural representations of those learnt associations. They found that at least some of the associations tested were encoded in prefrontal cortex during the cue and delay period.

      The manuscript is very carefully prepared. As far as I can see, the statistical analyses are all sound and the results integrate well with previous findings.

      I have no strong objections against the presented results and their interpretation.

      Reviewer #1 (Recommendations for the authors):

      One bit of trivia. In the abstract, you should define IFJ on its first appearance in the text. You get to that a bit later.

      Fixed.

      Reviewer #2 (Recommendations for the authors):

      I really don't have much to suggest, as I thought that this was a clearly written report that offered a clever paradigm and data that supported the conclusions. My only suggestion would be to split the delay period activity and test whether the strength of the template evolves over time. Even though fMRI is not the best tool for this, still you would predict stronger decoding in the second half of the delay period

      Please see above for our response to the same comment.

      Reviewer #3 (Recommendations for the authors):

      I would just like to point out some minor aspects that might be worth improving before publishing this work.

      Abstract: While in general, the writing is clear and concise, I felt that the abstract of the manuscript was particularly hard to follow, probably because the authors at some point re-arranged individual sentences. For example, they write in line 12 about 'the preparatory period', but explain only in the following sentence that the preparatory period ensues 'before search begins'. This made it a bit hard to follow the overall logic and I think could easily be fixed. 

      We have addressed this comment and updated the abstract.

      Also in the abstract: 'The CONTENTS of the template typically CONTAIN...' sounds weird, no? Also, 'information is used to modulate sensory processing in preparation for guiding attention during search' sounds like a very over-complicated description of attentional facilitation. I'm not convinced either whether the sequence is correct here. Is the information really used to (first) modulate sensory processing (which is a sort of definition of attention in itself) to (then) prepare the guidance of attention in visual search?

      We have addressed this comment and updated the abstract.

      The sentence in line 7, 'However, many behavioral studies have shown that target-associated information is used to guide attention,...' (and the following sentence) assumes that the reader is somewhat familiar with the term 'target-associations'. I'm afraid that, for a naive reader, this term may only become fully understandable once the idea is introduced a bit later when mentioning that participants of the study were trained on face-scene pairings. I think it could help to give some very short explanation of 'target-associations' already when it is first mentioned. The term 'statistically co-occurring object pairs', for example, could be of great help here.

      Thank you for the suggestion. We have added it to the abstract.

      page 2, line 22: 'prefrotnal'

      Fixed.

      page 2, line 24/25: 'information ... can SUPPLANT (?) ... information'. (That's also a somewhat unfortunate repetition of 'information')

      Fixed.

      page 4, line 23-25: 'Working memory representations in lateral prefrontal and parietal regions are engaged in cognitive control computations that ARE (?) task non-specific but essential to their functioning'

      Fixed.

      page 7, line 1: maybe a comma before 'suggesting'?

      Fixed.

      page 7, line 14-16: Something seems wrong with this sentence: 'The distractor face was a race-gender match, which we previously FOUND MADE (?) target discrimination difficult enough to make the scene useful for guiding attention'

      We have addressed this comment and rewritten this part (now on page 7, line 18).

      Results / Discussion sections:

      In several figures, like in Fig3A, the three different IFJ regions, are grouped separately from the other frontal areas, which makes sense given the special role IFJ plays for representing task-related templates. However, IFJ is still part of PFC. I think it would be more correct to group the other frontal areas (like FEF vLPFC etc.) as 'Other Frontal' or even 'Other PFC'.

      We have made the changes based on the reviewer’s suggestion.

      In some of the Figures, e.g. Fig 3 and 5, I had the impression that the activation patterns of some conditions in vLPFC were rather close to the location of IFJ, which is just a bit posterior. I think I remember that functional localisers of IFJ can actually vary quite a bit in localisation (see e.g. in the Baldauf/Desimone paper). Also, I think it has been shown in the context of other regions, like the human FEF that its position when defined by localisation tasks is not always nicely and fully congruent with the respective labels in an atlas like the Glasser atlas. It might help to take this in consideration when discussing the results, particularly since the term vLPFC is a rather vague collection of several brain parcels and not a parcel name in the Glasser atlas. Some people might even argue that vLPFC in the broad sense contains IFJ, similar to how 'Frontal' contains IFJ (see above). How strong of a point do the authors want to make about activation in IFJ versus in vlPFC?

      We have now added text discussing the inability to truly differentiate between subregions of IFJ and other parts of vLPFC in the methods section on ROIs (page 25, line 13) and in the discussion (page 18, line 25). However, one might think that it is even more surprising given the likely imprecision of ROI boundaries that we see distinct patterns between the subregions of IFG defined by Glasser HCP-MMP1 and the other vLPFC regions defined by the 17-network atlases. We do not wish to overstate the precision of IFJ regions, but note the ROI results within the context of the larger literature. We are sure that our findings will have to be reinterpreted when newer methods allow for better localization of functional subregions of the vLPFC in individuals.

      Given that the authors nicely explain in the introduction how important templates are in visual search, and given that FEF has such an important role in serially guiding saccades through visual search templates, I think it would be worth discussing the finding that FEF did not hold representation of these targets. Of course, this could be in part due to the specific task at hand, but it may still be interesting to note in the Discussion section that here FEF, although important for some top-down attention signals, did not keep representations of the 'search' templates. Is it because there is no spatial component to the task at hand (like proposed in Bedini 2021)?

      We have now added text directly addressing this point and citing the Bedini et al. paper in the discussion (page 18, line 18). Besides our current findings, the relationship between IFJ and FEF is really interesting and will hopefully be investigated more in the future.

      Page 18, line 5: 'we the(N) associated...'

      Fixed.

    1. Resubmitting Essays and Late Work Resubmitting Essays 1-3 That's right! You can resubmit Essays 1-3 for a different grade. I will provide feedback on assignments and essays that you can then use to improve your understanding of the content, writing ability, or critical thinking about the text. Essay resubmits are usually due by Week 17, but more information will be provided within the module and assignment page. Late Work I will expect that you will strive to complete each assignment by the due date. But I recognize that you are juggling a lot and there may be days when completing coursework is not your top priority. If you anticipate the need for an extension, please send me a message in advance (as much as possible) of the due date so I am aware of your situation. Propose an alternative due date that you feel is reasonable (I advise no more than 48-hours to ensure you do not get behind). I will reply with an agreed upon due date to support your success. Receiving an extension/late points on discussion boards or social annotations will not be permitted. The very nature of discussions is to have a conversation around/about the content that is interactive and timely. For an asyncherous class (like this class), it's important to have "due by dates" so everyone has time to plan and participate. If students are interacting on discussion boards or social annotations past the due by dates, there is a chance that students (and I) will miss the awesome things you want to share because we will be focused on the next discussion or assignments. We want to read and engage with people in this class. Please make every effort to participate and engage in the weekly discussions.

      I think everything stated in this section is very fair, especially because I also work 30-40 hours a week. The rule about not getting extensions on discussion boards or social annotations makes sense and will help us to participate and get the most out of learning. I also like how we can give an alternative due date as long as it is reasonable because it gives us flexibility for our other classes and also for work and things outside of school.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      In this study by Li et al., the authors re-investigated the role of cDC1 for atherosclerosis progression using the ApoE model. First, the authors confirmed the accumulation of cDC1 in atherosclerotic lesions in mice and humans. Then, in order to examine the functional relevance of this cell type, the authors developed a new mouse model to selectively target cDC1. Specifically, they inserted the Cre recombinase directly after the start codon of the endogenous XCR1 gene, thereby avoiding off-target activity. Following validation of this model, the authors crossed it with ApoE-deficient mice and found a striking reduction of aortic lesions (numbers and size) following a high-fat diet. The authors further characterized the impact of cDC1 depletion on lesional T cells and their activation state. Also, they provide in-depth transcriptomic analyses of lesional in comparison to splenic and nodal cDC1. These results imply cellular interactions between lesion T cells and cDC1. Finally, the authors show that the chemokine XCL1, which is produced by activated CD8 T cells (and NK cells), plays a key role in the interaction with XCR1-expressing cDC1 and particularly in the atherosclerotic disease progression.<br /> Strengths:

      The surprising results on XCL1 represent a very important gain in knowledge. The role of cDC1 is clarified with a new genetic mouse model.

      Thank you

      Weaknesses:

      My criticism is limited to the analysis of the scRNAseq data of the cDC1. I think it would be important to match these data with published data sets on cDC1. In particular, the data set by Sophie Janssen's group on splenic cDC1 might be helpful here (PMID: 37172103; https://www.single-cell.be/spleen_cDC_homeostatic_maturation/datasets/cdc1). It would be good to assign a cluster based on the categories used there (early/late, immature/mature, at least for splenic DC).

      Thank you very much for your help. Using the scRNA seq data of Xcr1<sup>+</sup> cDC1 sorted from ApoE<sup>–/–</sup> mice, we re-annotated the populations, following the methodology proposed by Sophie Janssen's group. These results are presented in Figure S9 and Figure S10 and described in detail in the Results and Discussion section.

      Please refer to the Results section from line 264 to 284: “Using the scRNA seq data of Xcr1<sup>+</sup> cDC1 sorted from hyperlipidemic mice, we annotated the 10 populations as shown in Figure S9A, following the methodology from a previous study [41]. Ccr7<sup>+</sup> mature cDC1s (Cluster 3, 7 and 9) and Ccr7- immature cDC1s (remaining clusters) were identified across cDC1 cells sorted from aorta, spleen and lymph nodes (Figure S9B). Further stratification based on marker genes reveals that Cluster 10 is the pre-cDC1, with high expression level of CD62L (Sell) and low expression level of CD8a (Figure S9C). Cluster 6 and 8 are the proliferating cDC1s, which express high level of cell cycling genes Stmn1 and Top2a (Figure S9D). Cluster 1 and 4 are early immature cDC1s, and cluster 2 and 5 are late immature cDC1s, according to the expression pattern of Itgae, Nr4a2 (Figure S9E). Cluster 9 cells are early mature cDC1s, with elevated expression of Cxcl9 and Cxcl10 (Figure S9F). Cluster 3 and 7 as late mature cDC1s, characterized by the expression of Cd63 and Fscn1 (Figure S9G). As shown in Figure 5C and Figure S9, the 10 populations displayed a major difference of aortic cDC1 cells that lack in pre-cDC1s (cluster 10) and mature cells (cluster 3, 7 and 9). Interestingly, in hyperlipidemic mice splenic cDC1 possess only Cluster 3 as the late mature cells while the lymph node cDC1 cells have two late mature populations namely Cluster 3 and Cluster 7. In further analysis, we also compared splenic cDC1 cells from HFD mice to those from ND mice. As shown in Figure S10, HFD appears to impact early immature cDC1-1 cells (Cluster 1) and increases the abundance of late immature cDC1 cells (Cluster 2 and 5), regardless of the fact that all 10 populations are present in two origins of samples. We also found that Tnfaip3 and Serinc3 are among the most upregulated genes, while Apol7c and Tifab are downregulated in splenic cDC1 cells sorted from HFD mice”.  

      Please refer to the Discussion section from line 380 to 385: “Based on the maturation analysis of the cDC1 scRNA seq data [41], our findings suggest that the aortic cDC1 cells display a major difference from those of spleen and lymph nodes by lacking the mature clusters, whereas lymph node cDC1 cells contain an additional Fabp5<sup>+</sup> S100a4<sup>+</sup> late mature Cluster. Our results also suggest that hyperlipidemia contributes to alteration in early immature cDC1 and in the abundance of late immature cDC1 cells, which was associated with dramatic change in gene expression of Tnfaip3, Serinc3, Apol7c and Tifab”.

      Reviewer #2 (Public review):

      This study investigates the role of cDC1 in atherosclerosis progression using Xcr1Cre-Gfp Rosa26LSL-DTA ApoE-/- mice. The authors demonstrate that selective depletion of cDC1 reduces atherosclerotic lesions in hyperlipidemic mice. While cDC1 depletion did not alter macrophage populations, it suppressed T cell activation (both CD4+ and CD8+ subsets) within aortic plaques. Further, targeting the chemokine Xcl1 (ligand of Xcr1) effectively inhibits atherosclerosis. The manuscript is well-written, and the data are clearly presented. However, several points require clarification:

      (1) In Figure 1C (upper plot), it is not clear what the Xcr1 single-positive region in the aortic root represents, or whether this is caused by unspecific staining. So I wonder whether Xcr1 single-positive staining can reliably represent cDC1. For accurate cDC1 gating in Figure 1E, Xcr1+CD11c+ co-staining should be used instead.

      The observed false-positive signal in the wavy structures within immunofluorescence Figure 1C (upper panel) results from the strong autofluorescence of elastic fibers, a major vascular wall component (alongside collagen). This intrinsic property of elastic fibers is a well-documented confounder in immunofluorescence studies [A, B].

      In contrast, immunohistochemistry (IHC) employs an enzymatic chromogenic reaction (HRP with DAB substrate) that generates a brown precipitate exclusively at antigen-antibody binding sites. Importantly, vascular elastic fibers lack endogenous enzymatic activity capable of catalyzing the DAB reaction, thereby preventing this source of false positivity in IHC.

      Given that Xcr1 is exclusively expressed on conventional type 1 dendritic cells [C], and considering that IHC lacks the multiplexing capability inherent to immunofluorescence for antigen co-localization, single-positive Xcr1 staining reliably identifies cDC1s in IHC results.

      [A] König, K et al. “Multiphoton autofluorescence imaging of intratissue elastic fibers.” Biomaterials vol. 26,5 (2005): 495-500. doi:10.1016/j.biomaterials.2004.02.059

      [B] Andreasson, Anne-Christine et al. “Confocal scanning laser microscopy measurements of atherosclerotic lesions in mice aorta. A fast evaluation method for volume determinations.” Atherosclerosis vol. 179,1 (2005): 35-42. doi:10.1016/j.atherosclerosis.2004.10.040

      [C] Dorner, Brigitte G et al. “Selective expression of the chemokine receptor XCR1 on cross-presenting dendritic cells determines cooperation with CD8+ T cells.” Immunity vol. 31,5 (2009): 823-33. doi:10.1016/j.immuni.2009.08.027

      (2) Figure 4D suggests that cDC1 depletion does not affect CD4+/CD8+ T cells. However, only the proportion of these subsets within total T cells is shown. To fully interpret effects, the authors should provide:

      (a) Absolute numbers of total T cells in aortas.

      (b) Absolute counts of CD4+ and CD8+ T cells.

      Thanks for your suggestions. We agree that assessing both proportions and absolute numbers in Figure 4 provides a more complete picture of the effects of cDC1 depletion on T cell populations. Furthermore, we also add the absolute count of cDC1 cells and total T cells, and CD44 MFI (mean fluorescence intensity) in CD4<sup>+</sup> and CD8<sup>+</sup> T cells in Figure 4, and supplemented corresponding textual descriptions in the revised manuscript.

      Please refer to the Results section from line 183 to 187: “Subsequently, we assessed T cell phenotype in the two groups of mice. While neither the frequencies nor absolute counts of aortic CD4<sup>+</sup> and CD8<sup>+</sup> T cells differed significantly between two groups of mice (Figure 4D-F), CD69 frequency and CD44 MFI (Mean Fluorescence Intensity), the T cell activation markers, were significantly reduced in both CD4<sup>+</sup> and CD8<sup>+</sup> T cells from Xcr1<sup>+</sup> cDC1 depleted mice compared to controls (Figure 4G and H)”.

      (3) How does T cell activation mechanistically influence atherosclerosis progression? Why was CD69 selected as the sole activation marker? Were other markers (e.g., KLRG1, ICOS, CD44) examined to confirm activation status?

      We sincerely appreciate these insightful comments. As extensively documented in the literature, activated effector T cells (both CD4+ and CD8+) critically promote plaque inflammation and instability through their production of pro-inflammatory cytokines (particularly IFN-γ and TNF-α), which drive endothelial activation, exacerbate macrophage inflammatory responses, and impair smooth muscle cell function [A].

      In our study, we specifically investigated the role of cDC1 cells in atherosclerosis progression. Our key findings demonstrate that cDC1 depletion attenuates T cell activation (as shown by reduced CD69/CD44 expression) and that this reduction in activation is functionally linked to the observed decrease in atherosclerosis burden in our model. 

      Regarding CD44 as an activation marker, we performed quantitative analyses of CD44 mean fluorescence intensity (MFI) in aortic T cells (Figure 4). Importantly, the MFI of CD44 was significantly lower on both CD4+ and CD8+ T cells from Xcr1<sup>Cre-Gfp</sup> Rosa26<sup>LSL-DTA</sup> ApoE<sup>–/–</sup> mice compared to the control ApoE<sup>–/–</sup> mice (data shown below), which is consistent with the result of CD69 in Figure 4. We added the related description in the Result section.

      Please refer to the Results section from line 185 to 187 “CD69 frequency and CD44 MFI (Mean Fluorescence Intensity), the T cell activation markers, were significantly reduced in both CD4+ and CD8+ T cells from Xcr1+ cDC1 depleted mice compared to controls (Figure 4G and H)”.

      Similarly, MFI of CD44 was significantly lower on both CD4<sup>+</sup> and CD8<sup>+</sup> T cells from Xcl1<sup>–/–</sup> ApoE<sup>–/–</sup> mice compared to the control ApoE<sup>–/–</sup> mice (data shown below), which is consistent with the result of CD69 in Figure 7. We also added the related description in the Result section.

      Please refer to the Results section from line 308 to 309 “Crucially, CD69<sup>+</sup> frequency and CD44 MFI remained comparable in both aortic CD4<sup>+</sup> and CD8<sup>+</sup> T cells between two groups (Figure 7D-F).”

      [A] Hansson, Göran K, and Andreas Hermansson. “The immune system in atherosclerosis.” Nature immunology vol. 12,3 (2011): 204-12. doi:10.1038/ni.2001

      (4) Figure 7B: Beyond cDC1/2 proportions within cDCs, please report absolute counts of: Total cDCs, cDC1, and cDC2 subsets. Figure 7D: In addition to CD4+/CD8+ T cell proportions, the following should be included:

      (a) Total T cell numbers in aortas

      (b) Absolute counts of CD4+ and CD8+ T cells.

      Thanks for your suggestions. We have now included in Figure 7 the absolute counts of cDC, cDC1, and cDC2 cells, along with CD4<sup>+</sup> and CD8<sup>+</sup> T cells in aortic tissues. Additionally, we provide the corresponding CD44 mean fluorescence intensity (MFI) measurements for both CD4<sup>+</sup> and CD8<sup>+</sup> T cell populations. We added the related description in the Result section.

      Please refer to the Results section from line 303 to 311: “The flow cytometric results illustrated that both frequencies and absolute counts of Xcr1<sup>+</sup> cDC1 cells in the aorta were significantly reduced, but cDCs and cDC2 cells from Xcl1<sup>–/–</sup> ApoE<sup>–/–</sup> were comparable with that from ApoE<sup>–/–</sup> (Figure 7A-C). Moreover, in both lymph node and spleen, the absolute numbers of pDC, cDC1 and cDC2 from Xcl1<sup>–/–</sup> ApoE<sup>–/–</sup> were comparable with that from ApoE<sup>–/–</sup> (Figure S11). Crucially, CD69<sup>+</sup> frequency and CD44 MFI remained comparable in both aortic CD4<sup>+</sup> and CD8<sup>+</sup> T cells between two groups (Figure 7D-F). However, aortic CD8<sup>+</sup> T cells exhibited reduced frequency and absolute count, while CD4<sup>+</sup> T cells showed increased frequency but unchanged counts in Xcl1<sup>–/–</sup> ApoE<sup>–/–</sup> mouse versus controls (Figure 7G and H).”

      (5) cDC1 depletion reduced CD69+CD4+ and CD69+CD8+ T cells, whereas Xcl1 depletion decreased Xcr1+ cDC1 cells without altering activated T cells. How do the authors explain these different results? This discrepancy needs explanation.

      We sincerely appreciate your professional and insightful comments regarding the mechanistic relationship between cDC1 depletion and T cell activation. Direct cDC1 depletion in the Xcr1<sup>Cre-Gfp</sup> Rosa26<sup>LSL-DTA</sup> ApoE<sup>–/–</sup> micmodel removes both recruited and tissue-resident cDC1s, eliminating their multifunctional roles in antigen presentation, co-stimulation and cytokine secretion essential for T cell activation. In contrast, Xcl1 depletion reduces, but does not eliminate cDC1 migration into plaques. Furthermore, alternative chemokine axes (e.g., CCL5/CCR5, CXCL9/CXCR3, BCL9/BCL9L) may partially rescue cDC1 recruitment [13, 68, 69], and non-cDC1 APCs (e.g., monocytes, cDC2s) may compensate for T cell activation [55, 70]. We emphasize that Xcl1 depletion specifically failed to alter T cell activation in hyperlipidemic ApoE<sup>–/–</sup> mice. However, its impact may differ in other pathophysiological contexts due to compensatory mechanisms. We thank you again for highlighting this nuance, which strengthens our mechanistic interpretation. We have added these points to the discussion section and included new references.

      Please refer to the Discussion section from line 407 to 413: “Notably, while complete ablation of Xcr1<sup>+</sup> cDC1s impaired T cell activation, reduction of Xcr1<sup>+</sup> cDC1 recruitment via Xcl1 deletion did not significantly compromise this process. This discrepancy may arise through compensatory mechanisms: alternative chemokine axes (e.g., CCL5/CCR5, CXCL9/CXCR3, BCL9/BCL9L) may partially rescue Xcr1<sup>+</sup> cDC1 homing [13, 68, 69], while non-cDC1 antigen-presenting cells (e.g., monocytes, cDC2s) may sustain T cell activation [55, 70]. Furthermore, tissue-specific microenvironment factors could potentially modulate its role in other diseases.”. [13] Eisenbarth, S C. “Dendritic cell subsets in T cell programming: location dictates function.” Nature reviews. Immunology vol. 19,2 (2019): 89-103. doi:10.1038/s41577-018-0088-1 [55] Brewitz, Anna et al. “CD8+ T Cells Orchestrate pDC-XCR1+ Dendritic Cell Spatial and Functional Cooperativity to Optimize Priming.” Immunity vol. 46,2 (2017): 205-219. doi:10.1016/j.immuni.2017.01.003 [68] de Oliveira, Carine Ervolino et al. “CCR5-Dependent Homing of T Regulatory Cells to the Tumor Microenvironment Contributes to Skin Squamous Cell Carcinoma Development.” Molecular cancer therapeutics vol. 16,12 (2017): 2871-2880. doi:10.1158/1535-7163.MCT-17-0341.[69] He F, Wu Z, Liu C, Zhu Y, Zhou Y, Tian E, et al. Targeting BCL9/BCL9L enhances antigen presentation by promoting conventional type 1 dendritic cell (cDC1) activation and tumor infiltration. Signal Transduct Target Ther. 2024;9(1):139. Epub 2024/05/30. doi: 10.1038/s41392-024-01838-9. PubMed PMID: 38811552; PubMed Central PMCID: PMCPMC11137111.[70] Böttcher, Jan P et al. “Functional classification of memory CD8(+) T cells by CX3CR1 expression.” Nature communications vol. 6 8306. 25 Sep. 2015, doi:10.1038/ncomms9306.

      Reviewer #1 (Recommendations for the authors):

      (1) Line 32 - The authors might want to add that the mouse model leads to a "constitutive" depletion of cDC1.

      Thanks for your advice, we have revised the sentence as follows.

      Please refer to the Results section from line 31 to 33: “we established Xcr1<sup>Cre-Gfp</sup> Rosa26<sup>LSL-DTA</sup> ApoE<sup>–/–</sup> mice, a novel and complex genetic model, in which cDC1 was constitutively depleted in vivo during atherosclerosis development”.

      (2) Line 187-188: The authors claim that T cell activation was "inhibited" if cDC1 was depleted. The data shows that the T cells were less activated, but there is no indication of any kind of inhibition; this should be corrected.

      Thanks for your advice, we have revised the sentence as follows.

      Please refer to the Results section from line 183 to 187: “Subsequently, we assessed T cell phenotype in the two groups of mice. While neither the frequencies nor absolute counts of aortic CD4<sup>+</sup> and CD8<sup>+</sup> T cells differed significantly between two groups of mice (Figure 4D-F), CD69 frequency and CD44 MFI (Mean Fluorescence Intensity), the T cell activation markers, were significantly reduced in both CD4<sup>+</sup> and CD8<sup>+</sup> T cells from Xcr1<sup>+</sup> cDC1 depleted mice compared to controls (Figure 4G and H)”.

      (3) Why are some splenic DC clusters absent in LNs and vice versa? This is not obvious to this reviewer and should at least be discussed.

      We appreciate the insightful question regarding the absence of certain splenic DC clusters in LNs. This phenomenon in Figure 5 aligns with the 'division of labor' paradigm in dendritic cell biology: tissue microenvironments evolve specialized DC subsets to address local immunological challenges. The absence of universal clusters reflects functional adaptation, not technical artifacts. We acknowledge that this tissue-specific heterogeneity warrants further discussion and have expanded our analysis to address this point in the discussion part of our manuscript.

      Please refer to the Discussion section from line 375 to 385: “This pronounced tissue-specific compartmentalization of Xcr1<sup>+</sup> cDC1 subsets may related to multiple mechanisms including developmental imprinting that instructs precursor differentiation into transcriptionally distinct subpopulations [62], and microenvironmental filtering through organ-specific chemokine axes (e.g., CCL2/CCR2 in spleen) selectively recruits receptor-matched subsets [63, 64]. This spatial specialization optimizes pathogen surveillance for local immunological challenges. Based on the maturation analysis of the cDC1 scRNA seq data [41], our findings suggest that the aortic cDC1 cells display a major difference from those of spleen and lymph nodes by lacking the mature clusters, whereas lymph node cDC1 cells contain an additional Fabp5<sup>+</sup> S100a4<sup>+</sup> late mature Cluster. Our results also suggest that hyperlipidemia contributes to alteration in early immature cDC1 and in the abundance of late immature cDC1 cells, which was associated with dramatic change in gene expression of Tnfaip3, Serinc3, Apol7c and Tifab”.

      [62]. Liu Z, Gu Y, Chakarov S, Bleriot C, Kwok I, Chen X, et al. Fate Mapping via Ms4a3-Expression History Traces Monocyte-Derived Cells. Cell. 2019;178(6):1509-25 e19. Epub 2019/09/07. doi: 10.1016/j.cell.2019.08.009. PubMed PMID: 31491389.

      [63]. Bosmans LA, van Tiel CM, Aarts S, Willemsen L, Baardman J, van Os BW, et al. Myeloid CD40 deficiency reduces atherosclerosis by impairing macrophages' transition into a pro-inflammatory state. Cardiovasc Res. 2023;119(5):1146-60. Epub 2022/05/20. doi: 10.1093/cvr/cvac084. PubMed PMID: 35587037; PubMed Central PMCID: PMCPMC10202633.

      [64]. Mildner A, Schonheit J, Giladi A, David E, Lara-Astiaso D, Lorenzo-Vivas E, et al. Genomic Characterization of Murine Monocytes Reveals C/EBPbeta Transcription Factor Dependence of Ly6C(-) Cells. Immunity. 2017;46(5):849-62 e7. Epub 2017/05/18. doi: 10.1016/j.immuni.2017.04.018. PubMed PMID: 28514690.

      [41]. Bosteels V, Marechal S, De Nolf C, Rennen S, Maelfait J, Tavernier SJ, et al. LXR signaling controls homeostatic dendritic cell maturation. Sci Immunol. 2023;8(83):eadd3955. Epub 2023/05/12. doi: 10.1126/sciimmunol.add3955. PubMed PMID: 37172103.

      (4) The authors should discuss how XCL1 could impact lesional cDC1 and T cell abundance. Notably, preDCs do not express XCR1, and T cells express XCL1 following TCR activation. Is there a recruitment or local proliferation defect of cDC1 in the absence of XCL1? Could there also be a role for NK cells as a potential source of XCL1?

      We appreciate your insightful questions regarding the differential effects of Xcl1 on cDC1s and T cells. Xcl1 primarily mediates the recruitment of mature cDC1s. Our data demonstrate that Xcl1 deletion significantly reduces aortic cDC1 abundance, which correlates with a concomitant decrease in CD8<sup>+</sup> T cell numbers within the aorta. These findings strongly suggest that the Xcl1-Xcr1 axis plays a regulatory role in T cell accumulation in aortic plaques.

      Consistent with prior studies [A, B], cDC1 recruitment can occur in the absence of Xcl1 which echoes our findings that cDC1 cells were still found in Xcl1 knockout aortic plaque but in lower abundance. It is very true that further studies are required to address how the Xcl1 dependent and independent cDC1 cells activate T cells and if they possess capability of proliferation in tissue differentially. We have added these points in discussion section.

      Please refer to the Discussion section from line 407 to 415: “Notably, while complete ablation of Xcr1<sup>+</sup> cDC1s impaired T cell activation, reduction of Xcr1<sup>+</sup> cDC1 recruitment via Xcl1 deletion did not significantly compromise this process. This discrepancy may arise through compensatory mechanisms: alternative chemokine axes (e.g., CCL5/CCR5, CXCL9/CXCR3, BCL9/BCL9L) may partially rescue Xcr1<sup>+</sup> cDC1 homing [13, 68, 69], while non-cDC1 antigen-presenting cells (e.g., monocytes, cDC2s) may sustain T cell activation [55, 70]. Furthermore, tissue-specific microenvironment factors could potentially modulate its role in other diseases. In summary, our findings identify Xcl1 as a potential therapeutic target for atherosclerosis therapy, though its cellular origins and regulation of lesional Xcr1<sup>+</sup> cDC1 and T cells dynamics require further studies”.

      In literatures, Xcl1 are expressed in NK cells and subsects of T cells, and NK cells can be a potential source of Xcl1 during atherosclerosis which deserve further investigations [A, C, D].

      [A] Böttcher, Jan P et al. “NK Cells Stimulate Recruitment of cDC1 into the Tumor Microenvironment Promoting Cancer Immune Control.” Cell vol. 172,5 (2018): 1022-1037.e14. doi:10.1016/j.cell.2018.01.004

      [B] He, Fenglian et al. “Targeting BCL9/BCL9L enhances antigen presentation by promoting conventional type 1 dendritic cell (cDC1) activation and tumor infiltration.” Signal transduction and targeted therapy vol. 9,1 139. 29 May. 2024, doi:10.1038/s41392-024-01838-9

      [C] Woo, Yeon Duk et al. “The invariant natural killer T cell-mediated chemokine X-C motif chemokine ligand 1-X-C motif chemokine receptor 1 axis promotes allergic airway hyperresponsiveness by recruiting CD103+ dendritic cells.” The Journal of allergy and clinical immunology vol. 142,6 (2018): 1781-1792.e12. doi:10.1016/j.jaci.2017.12.1005

      [D] Winkels, Holger et al. “Atlas of the Immune Cell Repertoire in Mouse Atherosclerosis Defined by Single-Cell RNA-Sequencing and Mass Cytometry.” Circulation research vol. 122,12 (2018): 1675-1688. doi:10.1161/CIRCRESAHA.117.312513

      Reviewer #2 (Recommendations for the authors):

      There is a logical error in line 298. I suggest revising to: "Collectively, these data suggest that Xcl1 promotes atherosclerosis by recruiting Xcr1+ cDC1 cells, which subsequently drive T cell activation in lesions."

      Thanks for your advice. Since Xcl1 deficiency reduced both the frequencies and absolute counts of Xcr1+ cDC1 and CD8+ T cells in lesions without affecting T cell activation, we revised the sentence as you suggested.

      Please refer to the Results section from line 314 to 315: “Collectively, these data suggest that Xcl1 promotes atherosclerosis by recruiting Xcr1<sup>+</sup> cDC1 cells, and facilitating CD8<sup>+</sup> T cell accumulation in lesions”.

    1. Author response:

      We thank the reviewers for their thorough evaluation and constructive feedback on our manuscript.

      We think that their valuable suggestions will strengthen the manuscript and help us clarify several important points.

      All reviewers acknowledged the importance of our theoretical results and network classification in making pattern formation analysis a more tractable problem. At the same time, they have also raised a number of important concerns that we shall carefully consider.

      A. A major clarification that the reviewers found important concerns the definition of non-trivial pattern transformations and its generalization to higher dimensions. In this regard, the reviewers’ comments are:

      Reviewer #1:

      (on non-trivial pattern transformations):

      (3) All modelling is confined to one spatial dimension, and the very definition of a "non-trivial" transformation is framed in terms of peak positions along a line, which clearly must be reformulated for higher dimensions. It's well-known that diffusions in 1, 2, and 3 dimensions are also dramatically different, so the relevance of the three-class taxonomy to real multicellular tissues remains unclear, or at least should be explained in more detail. Reviewer #2 (on non-trivial pattern transformations):

      (5) The definition of non-trivial pattern formation is provided only in the Supplementary Information, despite its central importance for interpreting the main results. It would significantly improve clarity if this definition were included and explained in the main text. Additionally, it remains unclear how the definition is consistently applied across the different initial conditions. In particular, the authors should clarify how slope-based measures are determined for both the random noise and sharp peak/step function initial states. Furthermore, the authors do not specify how the sign function is evaluated at zero. If the standard mathematical definition sgn(0)=0 is used, then even a simple widening of a peak could fulfill the criterion for nontrivial pattern transformation.

      We agree with Reviewer #2 that including a more detailed definition of non-trivial pattern transformation in the main text would enhance the clarity of the paper. The one-dimensional (1D) definition currently provided in the Supplementary Information was chosen because all computations presented therein involve exclusively one-dimensional patterns. However, we acknowledge that this definition, as it was, did not have a totally unambiguous generalization  to higher dimensions. Therefore, in a revised version of the manuscript, we will incorporate an expanded definition applicable to higher-dimensional cases.

      This general definition of a non-trivial pattern transformation should make no reference to the sign of spatial derivatives of either the initial or resulting patterns. Specifically, a pattern transformation is considered non-trivial if it satisfies the following criteria:

      - It is heterogeneous: The resulting pattern is heterogeneous in space.

      - It is rearranging: The arrangement of critical points (i.e. peaks, valleys and saddle points in a gene product concentration) along the domain in the resulting pattern of a gene product is different to the arrangement of critical points in its initial pattern. This includes the emergence of new critical points, the disappearance of existing ones, or the spatial displacement of critical points from one location to another.

      - It is non-replicating: The spatial arrangement of critical points in the pattern of one gene product must differ from that of any other upstream gene product.

      Nonetheless, our two initial patterns are spatially discontinuous functions: in homogeneous initial patterns, the white noise is discontinuous by definition; and for the spike and spike+homogeneous initial patterns, we use sharp spikes defined by the rectangular function, which is discontinuous at the spike boundaries. Therefore, the aforementioned definition should be supplemented with the following two ad hoc assumptions:

      - Homogeneous initial patterns do not comprise any critical point. White noise in this type of initial patterns represents small thermodynamic fluctuations around the steady state and, for the purpose of pattern transformation, this is equivalent to a constant concentration along the domain.

      - Spike and spike+homogeneous initial patterns each contain a single critical point located at the center of the spike. The sharp spikes, modeled using the rectangular function, serve as a theoretical idealization to facilitate mathematical analysis. Once diffusion begins to act, these sharp boundaries are smoothed into differentiable gradients, maintaining a unique critical point at the center of the initial spike, which is the most relevant information for pattern transformation.

      Finally, it is worth recalling that our gene network classification is fundamentally based on an analysis of the dispersion relation associated with the gene network, and the construction of this dispersion relation is independent of the spatial dimensionality of the domain (i.e. it does not require assuming any specific number of dimensions). The fact that the description of this dispersion relation was in the SI may have been non-ideal for the understandability of the article and will, consequently, be moved to the main text in an upcoming version of the article. Thus, the gene networks that can lead to pattern transformation are the same in 1D, 2D or 3D. As for the resulting patterns, the broad description we provide also applies to any number of dimensions; these would be periodic, non periodic as in the amplified noise patterns or non periodic as in the hierarchic networks. For the latter notice that, except for boundary effects that we later discuss, the spike initial condition is radially symmetric and thus, the patterns resulting from it will also be radially symmetric. We will make this point more explicit in a revised version of the article, especially since, as suggested, this important portion of the Supplementary Information will be incorporated into the main text.

      Reviewer 2 suggests that with our definition of non-trivial pattern transformation, the simple widening of a concentration peak would constitute a non-trivial pattern transformation. This is not the case, as already shown in the figures as a example, since in a widening there is no change in the position of the critical point. A different situation applies if a wide and completely flat concentration peak (i.e. a plateau) forms. As we will explain in the coming version this is not possible because of requirement R5.

      We think that this clarification of the definition of non-trivial pattern transformation will also help clarify the next point (B below) since it would make it clearer that this article does not intend to explain which specific resulting pattern would arise from any given gene network.

      B. The main concern among these relates to the validity of our linearization of the model equations and the extension of the results obtained for the linear system to the fully nonlinear system. In this regard, the reviewers’ comments are:

      Reviewer #1:

      (on linearization):

      (2) A central step in the model formulation is the linearisation of the reaction term around a homogeneous steady state; higher-order kinetics, including ubiquitous bimolecular sinks such as A + B → AB, are simply collapsed into the Jacobian without any stated amplitude bound on the perturbations. Because the manuscript never analyses how far this assumption can be relaxed, the robustness of the three-class taxonomy under realistic nonlinear reactions or large spike amplitudes remains uncertain.

      Reviewer #2:

      (on linearization):

      (2) Most of the proofs presented in the Supplementary Information rely on linearized versions of the governing equations, and it remains unclear how these results extend to the fully nonlinear system. We are concerned that the generality of the conclusions drawn from the linear analysis may be overstated in the main text. For example, in Section S3, the authors introduce the concept of dynamic equivalence of transitive chains (Proposition S3.1) and intracellular transitive M-branching (Proposition S3.2), which pertains to the system's steady-state behavior. However, the proof is based solely on the linearized equations, without additional justification for why the result should hold in the presence of nonlinearities. Moreover, the linearized system is used to analyze the response to a "spike initial pattern of arbitrary height C" (SI Chapter S5.1), yet it is not clear how conclusions derived from the linear regime can be valid for large perturbations, where nonlinear effects are expected to play a significant role. We encourage the authors to clarify the assumptions under which the linearized analysis remains valid and to discuss the potential limitations of applying these results to the nonlinear regime.

      In this article, we address two main questions: first, which gene network topologies can give rise to non-trivial pattern transformations; and second, which broad types of resulting patterns can these gene network topologies give rise to resulting pattern. Thus, we are not intending to explain which exact resulting patterns would arise from any given gene network (i.e. a gene network topology with specific functions and interaction strengths or weights), a question for which non-linearities do indeed matter.

      For most known gene regulatory networks, available empirical information is typically limited to the nature of gene product regulations -indicating whether they act as activators or inhibitors- while details about the specific functional form of these regulations are rare. For instance, given two gene products, i and j, the network may indicate that i acts as an activator of j, implying that the concentration of j increases with that of i. However, this increase could follow a variety of functional forms: it may be quadratic (e.g., ), cubic (e.g., ), or any other function f j(gi). As we explain in the description of our model, we restrict our study to functions with a monotonicity constraint: higher concentrations of i lead to increased production of j (i.e., ).  In other words, a given gene interaction is always inhibitory or activatory, it does not change of sign. This monotonicity constraint corresponds to requirement (R5) in our main text. This requirement it is based on the biologically plausible idea that the complexity of gene regulation in development stems more from the topology of gene networks than from the complexity of the regulation by which a gene product may regulate another (i.e. we use simple monotonic functions).

      Question 1: A critical part to understand question 1 is in the dispersion relation that was explained in SI. From the reviewers’ comments it is clear that having this crucial part in the main text of an upcoming version of the article would improve understandability, specially for question 1.

      In brief, any pattern transformation requires the initial pattern to change. The trigger of such change is a change in the concentration of some gene product, either conceptualized as a noise fluctuation (in the homogeneous initial pattern) or a regulated change in a specific point (in the spike initial pattern). Mathematically, both can be conceptualized as perturbations and, for pattern transformation to be possible, such perturbation should grow so that the initial pattern becomes unstable and can change to another resulting pattern.

      If the perturbation is small, one can use the standard linear perturbation analysis in S6.2 of our Supplementary Information. In other words, the linear analysis is enough to ascertain if a small perturbation would grow or not. A gene network in which this will not happen would be unable to lead to pattern transformation, whichever the nonlinear part of f(g). In that sense, the linear approximation provides a necessary condition that any gene network needs to fulfill to lead to pattern transformation.

      However, the linear analysis would not ascertain whether a specific gene network will actually lead to pattern transformation (i.e., the condition is not sufficient). This, as well as the shape of the specific resulting pattern, may actually depend on the non-linear parts too. As we discuss, based on the dispersion relation, and other complementing arguments along the article, we can also get some insights on the possible patterns from the linear approximation alone (question 2). This arguments hold thanks to the imposition of requirements (R1-R5) on function f(g), which prevent strange behaviors stemming from the nonlinear part of the equation.

      The amplitude bound of perturbations mentioned by Reviewer #1 is addressed by requirements (R2) and (R4). Although the solution to the linear system predicts unbounded growth of unstable eigenmodes, the assume functions f(g) on which the nonlinear terms  eventually halt this growth, thereby ensuring the boundedness of solutions as imposed by (R4). This assumption on the nonlinear part is literally requirement R2 on f(g) in the main text.

      The transitive chains and branchings in section S3 of the Supplementary Information mentioned by the Reviewer #2 are topological properties of gene networks and therefore they influence only the linear part of the reaction-diffusion equations. This is why the proofs in that section are based on the linearized equations. We agree that clarifying this point in the text, as suggested by the reviewer, would improve the reader’s understanding of the section.

      Regarding Reviewer #2’s concerns about large perturbations, we acknowledge that the phrasing using “arbitrary height” may be confusing. For the homogeneous initial conditions these perturbations are assumed to be small because they are actually molecular noise (otherwise the initial condition could not be considered homogenous in the classical sense of developmental biology models). In the spike initial conditions in hierarchic networks the perturbation is not necessarily small. For the analysis provided in the SI we indeed assume that the perturbations are small enough for the linear approximation to be possible. Notice, however, that since these networks require an intracellular self-activating loop upstream of the first extracellular signal, the effective perturbation would rapidly grow to a value determined by such loop.

      In general the height of the initial spike does not affect the fact that hierarchic networks can lead to non-trivial pattern transformation. By definition these networks require the secretion of an extracellular signal from the cells in the spike (otherwise no change in gene product concentrations can occur over space). By definition this signal is not produced by any other cells and, thus, its concentration is governed by diffusion from the spike and its production in the cells in the spike. Thus, whichever the initial height of the spike and whichever the non-linearities in f(g), the signal’s concentration would decrease with the distance from the spike. As explained in the main text, this would lead to non-trivial pattern transformations if other general conditions are met. In general, the height of the initial perturbation can affect which specific pattern transformation would arise from a specific gene network but not which gene network topologies can lead to pattern transformation. This will be more clearly stated in an upcoming version of the article. C. In the following, we respond to the remaining concerns raised by the reviewers:

      Reviewer #1:

      (1) The Results section is difficult to follow. Key logical steps and network configurations are described shortly in prose, which constantly require the reader to address either SI or other parts of the text (see numerous links on the requirements R1-R5 listed at the beginning of the paper) to gain minimal understanding. As a result, a scientifically literate but non-specialist reader may struggle to grasp the argument with a reasonable time invested.

      We acknowledge that the current version of the main text may not be as clear as we intended. Initially, we believed that placing the more technical mathematical passages in the Supplementary Information would make the main text more accessible to readers. However, we agree with the reviewer that including some of these computations in the main text could improve clarity. We also believe that adding a summary table outlining all the model’s requirements would further contribute to that goal.

      Reviewer #2:

      (1) We have serious concerns regarding the validity of the simulation results presented in the manuscript. Rather than simulating the full nonlinear system described by Equation (1), the authors base their results on a truncated expansion (Equation S.8.2) that captures only the time evolution of small deviations around a spatially homogeneous steady state. However, it remains unclear how this reduced system is derived from the full equations specifically, which terms are retained or neglected and why- and how the expansion of the nonlinear function can be steady-state independent, as claimed. Additionally, in simulations involving the spike plus homogeneous initial condition, it is not evident -or, where equations are provided, it is not correct- that the assumed global homogeneous background actually corresponds to a steady state of the full dynamics. We elaborate on these concerns in the following:

      We believe there has been a misunderstanding regarding the presentation of the model equations (S8.2) used throughout our simulations. Accordingly, we agree that this relevant section of the Supplementary Information should be rewritten in a revised version of the manuscript to clarify this issue. Below, we address all the concerns raised by the reviewer.

      Equation (S8.2) represents the full nonlinear system described in Equation (1). While we recognize that the model may oversimplify real biological processes, its purpose is to illustrate our general statements about pattern formation rather than to capture any specific or detailed mechanism. In this context, model (S8.2) offers three key advantages for our goals: it allows rapid manipulation of gene network topology simply by modifying the matrix J, making it ideal for illustrating pattern formation across different network classes; it accommodates gene networks of arbitrary size -unlike other models, such as the classical Gierer-Meinhardt model, which are limited to two-element Turing or noise-amplifying networks-; and, due to the simplicity of its nonlinear terms, this model involves relatively few free parameters, facilitating the fine-tuning needed to identify parameter regions where non-trivial pattern transformations occur.

      Indeed, we find that the ability of model (S8.2) to illustrate our results despite having such simple nonlinear terms -bearing in mind that at least some nonlinearity is always necessary for selforganization- strongly supports the claim that the capacity of a gene network to produce pattern transformations is fully determined by the linear part of Equation (1). In this sense, nonlinear terms primarily influence the precise parameter values at which these transformations occur and contribute to shaping specific features of the resulting patterns.

      Model (S8.2) has been successfully employed in pattern formation studies elsewhere in the literature; accordingly, we provide relevant bibliographic references to support its widespread use.

      We believe the misunderstanding arises from our explanation of the biological interpretation of the model. As noted in the accompanying bibliography, the model is based on a general reactiondiffusion mechanism assuming the existence of a steady state. However, this conceptual reactiondiffusion framework is not the same as our Equation (1); rather, it was introduced by the original proponents of the model in the seminal paper cited in our text. In this context, Equation (S8.2) describes small concentration perturbations around that steady state, where the variables represent deviations in concentration relative to the general steady state.

      The aforementioned general steady state corresponds to the trivial equilibrium point g≡0 in equations (S8.2). Consequently, all our simulations based on model (S8.2) start from this steady state, to which we add white noise to generate homogeneous initial patterns or a sharp spike for the two types of spike initial patterns.

      It is also worth noting that Equations (S8.2) represent a non-dimensional model.

      It is assumed that the homogeneous steady states are given by g_i=0 and g_i=c_i, where 1/c_i = \mu_i or \hat{\mu}_i, independently of the specific network structure. However, the basis for this assumption is unclear, especially since some of the functions do not satisfy this condition -for example, f5 as defined below Eq. S8.10.5. Moreover, if g_i=c_i does not correspond to a true steady state, then the time evolution of deviations from this state is not correctly described by Eq. S8.2, as the zeroth-order terms do not vanish in that case.

      From the explanations above, it is important to distinguish two scales in the process: the scale of small perturbations, where equations (S8.2) apply; and the global scale, where the conceptual general reaction-diffusion system operates. Since the specific form of this general system does not affect equations (S8.2), we assume that it follows any of the models cited in the text, which yield a non-zero steady state at .

      In this sense, Equation (S8.2) represent a small concentration deviation of such global system and g(t ,x) is a relative concentration where g≡0 represents the steady-state at are concentrations above , and g<0 are concentrations below .

      As previously mentioned, simulations are performed using Equations (S8.2) on the basis of the equilibrium point g≡0. The result of these simulations is then superimposed on the non-zero steady state and presented in the figures along the article.

      Using the full model instead of the simplified Equations (S8.2) may result in slightly different resulting patterns, but it does not affect the gene network’s ability to produce pattern transformations, nor does it alter the main structural properties of the patterns—for example, the periodic nature of patterns generated by Turing networks.

      Additionally, the equations used contain only linear terms and a cubic degradation term for each species g_i, while neglecting all quadratic terms and cubic terms involving cross-species interactions (i≠j). An explanation for this selective truncation is not provided, and without knowledge of the full equation (f), it is impossible to assess whether this expansion is mathematically justified. If, as suggested in the Supplementary Information, the linear and cubic terms are derived from f, then at the very least, the Jacobian matrix should depend on the background steady-state concentration. However, the equations for the small deviation around a steady state (including the Jacobian matrix) used in the simulations appear to be independent of the particular steady state concentration.

      The Jacobian of Equation (S8.2) is independent of g because g represents a small perturbation around a steady state of a general reaction-diffusion system. Consequently, the matrix J corresponds to the Jacobian of the general system evaluated at that steady state. Evaluating the Jacobian of equations (S8.2) at the equilibrium point g≡0 -which represents the general steady state- recovers the matrix J.

      This is why we believe that the differences observed between the spike-only initial condition and the spike superimposed on a homogeneous background are not due to the initial conditions themselves, but rather result from a modified reaction scheme introduced through a questionable cutoff.

      "In simulations with spike initial patterns, the reference value g≡0 represents an actual concentration of 0 and therefore, we must add to (S8.2) a Heaviside function Φ acting of f (i.e., Φ(f(g))=f(g) if f(g)>0 , Φ(f(g))=0 if f(g){less than or equal to}0 ) to prevent the existence of negative concentrations for any gene product (i.e., g_i<0 for some i )." (SI chapter S8).

      This cutoff alters the dynamics (no inhibition) and introduces a different reaction scheme between the two simulations. The need for this correction may itself reflect either a problem in the original equations (which should fulfill the necessary conditions and prevent negative concentrations (R4 in main text)) or the inappropriateness of using an expanded approximation which assumes independence on the steady state concentration. It is already questionable if the linearized equations with a cubic degradation term are valid for the spike initial conditions (with different background concentration values), as the amplitude of this perturbation seems rather large.

      For homogeneous and spike+homogeneous initial conditions, we interpret equations (S8.2) as small perturbations around a non-zero steady state of a general reaction-diffusion system. For spike-only initial conditions, that steady state is zero. As we mention before, g≡0 will then represent such steady-state of zero concentration, g>0 are positive concentrations of the general system, and g<0 would represent unfeasible negative concentrations of the general system. Therefore, the use of a cutoff function to handle such initial conditions is justified. Moreover, this cutoff function is the same as the one employed in the reference general system cited in our paper.

      We acknowledge that the cutoff influences the simulations and accounts for the differences observed between spike and spike+homogeneous initial conditions. However, this distinction reflects what occurs in real biological systems, which is precisely why we differentiate these two types of initial states. For instance, the emergence of a periodic pattern in a noise-amplifying network depends critically on the formation of regions with concentrations below the steady state near the initial spike. Such regions can form in spike-plus-homogeneous initial patterns but not in spike-only initial patterns, where concentrations below the steady state would correspond to biologically unfeasible negative values.

      Lastly, we note that under the current simulation scheme, it is not possible to meaningfully assess criteria RH2a and RH2b, as they rely on nonlinear interactions that are absent from the implemented dynamics.

      It is explicitly stated in the relevant subsections of Section S7 in the Supplementary Information that, for the simulations involving RH2a and RH2b, the function f(g) in equation (S8.2) is modified by adding an ad hoc quadratic term to enable the assessment of these criteria.

      (3) Several statements in the main text are presented without accompanying proof or sufficient explanation, which makes it difficult to assess their validity. In some cases, the lack of justification raises serious doubts about whether the claims are generally true. Examples are:

      "For the purpose of clarity we will explain our results as if these cells have a simple arrangement in space (e.g., a 1D line or a 2D square lattice) but, as we will discuss, our results shall apply with the same logic to any distribution of cells in space." (Main text l.145-l.148).

      We believe that the confusion in this statement arises from the ambiguous use of the phrase “our results”. We will revise the text to provide a more precise description. Specifically, by “our results,” we refer to the conclusion that it is possible to determine whether a gene network leads to nontrivial pattern transformations based solely on its topology. This conclusion is independent of the dimensionality of space, as none of our arguments rely on assumptions specific to spatial dimensions. While one-dimensional examples are used for clarity and illustration, the underlying reasoning applies generally. In an improved version of the article, we will clarify this point explicitly and move relevant arguments from the Supplementary Information into the main text.

      Critically, our classification of gene networks is ultimately based on an argument concerning the dispersion relation associated with the network, and the construction of this dispersion relation is independent of the spatial dimensionality of the domain. In this sense, the networks identified in the text as capable of producing pattern transformations will be able to generate non-trivial pattern transformations in any spatial domain and in any number of dimensions. While the specific parameter values that permit such transformations may vary depending on the geometry and dimensionality of the domain, the existence of at least one such parameter set remains unaffected.

      The geometry of the domain can influence the specific form of the resulting patterns, but it does not alter the broader class of patterns (e.g., periodic patterns, peaks emerging around a spike, etc.) that a given gene network topology can produce. One such geometric influence, commonly observed in simulations, involves boundary effects. For example, structures such as peaks or rings forming near the boundaries may appear higher, broader, or spatially shifted compared to those arising in the central regions of the domain. However, we think a pattern consisting of a periodic train of peaks where only those near the boundary are slightly different can still be classified as a periodic pattern.

      "For any non-trivial pattern transformation (as long as it is symmetric around the initial spike), there exists an H gene network capable of producing it from a spike initial pattern." (Main text l.366f).

      A justification for this statement is provided shortly after the claim, although we acknowledge that the current explanation is somewhat cumbersome and would benefit from a clearer presentation in a revised version of the main text.

      A more detailed justification is provided in the Supplementary Information, based on three key ideas. First, any pattern (provided it is symmetric with respect to the initial spike) can be described as an arrangement of peaks with varying heights and spatial positions along a one-dimensional domain. Second, there exists a simple gene network—the diamond network—that, through parameter tuning, can produce two peaks of arbitrary height and symmetric position relative to the initial spike. Third, by placing multiple diamond networks positively upstream of a common gene product, that gene product can express peaks at each location where the upstream diamond networks induce them. Under mild additional conditions, this mechanism allows the formation of essentially any symmetric pattern. These mild conditions, along with a detailed analysis of the diamond network’s ability to generate peaks with controllable height and position, are discussed in the Supplementary Information.

      "In 2D there are no peaks but concentric rings of high gene product concentration centered around the spike, while in 3D there are concentric spherical shells." (Main text l. 447ff).

      This result pertains specifically to pattern transformations arising from spike initial patterns. As defined in the text, spike initial patterns are radially symmetric. Since diffusion preserves radial symmetry, pattern transformations from spike initial patterns in two or three dimensions reduce to effectively one-dimensional transformations along each radial direction. In this framework, each pair of concentration peaks symmetric with respect to the spike in one dimension corresponds to a ring surrounding the spike in two dimensions, and each ring in two dimensions becomes a hollow spherical shell around the spike in three dimensions.

      We agree that including a brief section in the Supplementary Information to clarify these subtleties would be helpful for readers to better understand the generalization of certain patterns to higher dimensions.

      (4) The study identifies one-signal networks and examines how combinations of these structures can give rise to minimal pattern-forming subnetworks. However, the analysis of the combinations of these minimal pattern-forming subnetworks remains relatively brief, and the manuscript does not explore how the results might change if the subnetworks were combined in upstream and downstream configurations. In our view, it is not evident that all possible gene regulatory networks can be fully characterized by these categories, nor that the resulting patterns can be reliably predicted. Rather, the approach appears more suited to identifying which known subnetworks are present within a larger network, without necessarily capturing the full dynamics of more complex configurations.

      We acknowledge that our explanation regarding the combination of sub-networks was relatively brief, and we intend to address this in a revised version. Our argument that combining sub-networks does not produce qualitatively new types of pattern transformations -beyond those already described- is based on the dispersion relation. Although this relation was only detailed in the Supplementary Information, it is central to our argument and will therefore be moved to the main text. Below, we provide an outline of this argument:

      Our study identifies two distinct behaviors of the principal branch of the dispersion relation at large wavenumbers. Based on this, gene networks capable of pattern formation can be classified into two categories: networks of the first kind, where the real part of the principal branch diverges to infinity as the wavenumber increases; and networks of the second kind, where the real part of the principal branch converges to a positive finite value for large wavenumbers. Naturally this argument applies to any gene network irrespectively of which, or how many, sub-networks are used to built it.

      Any gene regulatory network capable of pattern formation falls into one of these two categories. We identified that networks of the first kind contain at least one Turing sub-network, whereas networks of the second kind include either an H sub-network or a noise-amplifying sub-network. In this way, the primary objective of our study -namely, achieving a topological classification of gene regulatory networks capable of pattern formation- is fulfilled. It is important to note that while the dispersion relation provides broad information about the possible resulting patterns a gene network topology can produce (e.g., periodic versus noisy), it does not specify the exact patterns that emerge for each particular set of parameter values.

      Finally, regarding the shape of the resulting patterns, Figure S10 in the Supplementary Information exemplifies the notion that the behavior of combined networks can be understood as a combination of the individual behaviors of each constituent sub-network (note that the contribution of each type of sub-network in the resulting pattern is readily distinguishable). Consequently, we focus our detailed analysis on the patterning properties of the fundamental classes.

      (6) The manuscript lacks a clear and detailed explanation of the underlying model and its assumptions. In particular, it is not well-defined what constitutes a "cell" in the context of the model, nor is it justified why spatial features of cells -such as their size or boundaries- can be neglected. Furthermore, the concept of the extracellular space in the one-dimensional model remains ambiguous, making it unclear which gene products are assumed to diffuse.

      The size of cells is ignored in our model because we assume that they are small enough with respect to the total size of the domain that the space continuous reaction-diffusion equation (equation (1) in the main text) holds. Conceptually, one could understand cells in our model each of the pieces in an even partition of the domain into small subdomains surrounding each position x. This is anyway the standard procedure in most models of pattern formation by reaction-diffusion in embryonic development.

      For extracellular signals, we assume that g(t ,x) corresponds to the concentration of the signal in the extracellular space surrounding the cell located at position x. The extracellular space is any fluid medium for which Fick Laws apply and, therfore, the Fickian diffusion term in equation (1) is valid.

      For intracellular gene products, we assume that g(t ,x) corresponds to the concentration of such gene product within the cell at position x (if the gene product in hand is a transcription factor, for example), or on its surface (if it is a membrane-bound receptor). When collapsed in the continuous equations there is not such difference between being strictly within the cell or on its boundary. The only important fact is that these gene products cannot diffuse.

      Regarding cell boundaries, let us consider an extracellular signal s that regulates a transcriptor factor i within cells (in our model, i is an intracellular gene product). Such regulation shall be mediated by a membrane-bound receptor, which corresponds to intracellular gene product j. In terms of the gene regulatory network this is sji. Cell boundary effects mentioned by the reviewer should be encapsulated in the specific functional form of the regulation function f(g), but they have no effect in the actual topology of the network. Consequently, they are out of the scope of this study: as we mentioned before, considering different non-linear terms for f(g) will affect the parameter range for which a gene network is capable of producing non-trivial pattern transformations, but not their overall ability to produce non-trivial pattern transformations (i.e., the existence of at least one choice of model parameters for which such transformations take place).

      Finally, we would like to once again express our sincere gratitude to all reviewers for their insightful and constructive feedback. We are confident that the thorough peer review process will significantly enhance both the clarity and depth of our work. We greatly value the detailed comments provided and will carefully incorporate them in the preparation of a revised manuscript, which we intend to submit in the coming months.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      The authors developed a sequence-based method to predict drug-interacting residues in IDP, based on their recent work, to predict the transverse relaxation rates (R2) of IDP trained on 45 IDP sequences and their corresponding R2 values. The discovery is that the IDPs interact with drugs mostly using aromatic residues that are easy to understand, as most drugs contain aromatic rings. They validated the method using several case studies, and the predictions are in accordance with chemical shift perturbations and MD simulations. The location of the predicted residues serves as a starting point for ligand optimization.

      Strengths:

      This work provides the first sequence-based prediction method to identify potential drug-interacting residues in IDP. The validity of the method is supported by case studies. It is easy to use, and no time-consuming MD simulations and NMR studies are needed.

      Weaknesses:

      The method does not depend on the information of binding compounds, which may give general features of IDP-drug binding. However, due to the size and chemical structures of the compounds (for example, how many aromatic rings), the number of interacting residues varies, which is not considered in this work. Lacking specific information may restrict its application in compound optimization, aiming to derive specific and potent binding compounds.

      We fully recognize that different compounds may have different interaction propensity profiles along the IDP sequence. In future studies, we will investigate compound-specific parameter values. The limiting factor is training data, but such data are beginning to be available.

      Reviewer #2 (Public review):

      Summary:

      In this work, the authors introduce DIRseq, a fast, sequence-based method that predicts drug-interacting residues (DIRs) in IDPs without requiring structural or drug information. DIRseq builds on the authors' prior work looking at NMR relaxation rates, and presumes that those residues that show enhanced R2 values are the residues that will interact with drugs, allowing these residues to be nominated from the sequence directly. By making small modifications to their prior tool, DIRseq enables the prediction of residues seen to interact with small molecules in vivo.

      Strengths:

      The preprint is well written and easy to follow

      Weaknesses:

      (1) The DIRseq method is based on SeqDYN, which itself is a simple (which I do not mean as a negative - simple is good!) statistical predictor for R2 relaxation rates. The challenge here is that R2 rates cover a range of timescales, so the physical intuition as to what exactly elevated R2 values mean is not necessarily consistent with "drug interacting". Presumably, the authors are not using the helix boost component of SeqDYN here (it would be good to explicitly state this). This is not necessarily a weakness, but I think it would behove the authors to compare a few alternative models before settling on the DIRseq method, given the somewhat ad hoc modifications to SeqDYN to get DIRseq.

      Actually, the factors that elevate R2 are well-established. These are local interactions and residual secondary structures (if any). The basic assumption of our method is that intra-IDP interactions that elevate R2 convert to IDP-drug interactions. This assumption was supported by our initial observation that the drug interaction propensity profiles predicted using the original SeqDYN parameters already showed good agreement with CSP profiles. We only made relatively small adjustments to the parameters to improve the agreement. Indeed we did not apply the helix boost portion of SeqDYN to DIRseq, and will state as such. We will also compare DIRseq with several alternative models.

      Specifically, the authors previously showed good correlation between the stickiness parameter of Tesei et al and the inferred "q" parameter for SeqDYN; as such, I am left wondering if comparable accuracy would be obtained simply by taking the stickiness parameters directly and using these to predict "drug interacting residues", at which point I'd argue we're not really predicting "drug interacting residues" as much as we're predicting "sticky" residues, using the stickiness parameters. It would, I think, be worth the authors comparing the predictive power obtained from DIRseq with the predictive power obtained by using the lambda coefficients from Tesei et al in the model, local density of aromatic residues, local hydrophobicity (note that Tesei at al have tabulated a large set of hydrophobicity scores!) and the raw SeqDYN predictions. In the absence of lots of data to compare against, this is another way to convince readers that DIRseq offers reasonable predictive power.

      We will compare predictions of these various parameter sets, and summarize the results in a table.

      (2) Second, the DIRseq is essentially SeqDYN with some changes to it, but those changes appear somewhat ad hoc. I recognize that there is very limited data, but the tweaking of parameters based on physical intuition feels a bit stochastic in developing a method; presumably (while not explicitly spelt out) those tweaks were chosen to give better agreement with the very limited experimental data (otherwise why make the changes?), which does raise the question of if the DIRseq implementation of SeqDYN is rather over-parameterized to the (very limited) data available now? I want to be clear, the authors should not be critiqued for attempting to develop a model despite a paucity of data, and I'm not necessarily saying this is a problem, but I think it would be really important for the authors to acknowledge to the reader the fact that with such limited data it's possible the model is over-fit to specific sequences studied previously, and generalization will be seen as more data are collected.

      We have explained the rationale for the parameter tweaks, which were limited to q values for four amino-acid types, i.e., to deemphasize hydrophobic interactions and slightly enhance electrostatic interactions (p. 4-5). We will add that these tweaks were motivated by observations from MD simulations of drug interactions with a-syn (ref 20). As already noted in the response to the preceding comment, we will also present results for the original parameter values as well as for when the four q values are changed one at a time.

      (3) Third, perhaps my biggest concern here is that - implicit in the author's assumptions - is that all "drugs" interact with IDPs in the same way and all drugs are "small" (motivating the change in correlation length). Prescribing a specific lengthscale and chemistry to all drugs seems broadly inconsistent with a world in which we presume drugs offer some degree of specificity. While it is perhaps not unexpected that aromatic-rich small molecules tend to interact with aromatic residues, the logical conclusion from this work, if one assumes DIRseq has utility, is that all IDRs bind drugs with similar chemical biases. This, at the very least, deserves some discussion.

      The reviewer raises a very important point. In Discussion, we will add that it is important to further develop DIRseq to include drug-specific parameters when data for training become available.

      (4) Fourth, the authors make some general claims in the introduction regarding the state of the art, which appear to lack sufficient data to be made. I don't necessarily disagree with the author's points, but I'm not sure the claims (as stated) can be made absent strong data to support them. For example, the authors state: "Although an IDP can be locked into a specific conformation by a drug molecule in rare cases, the prevailing scenario is that the protein remains disordered upon drug binding." But is this true? The authors should provide evidence to support this assertion, both examples in which this happens, and evidence to support the idea that it's the "prevailing view" and specific examples where these types of interactions have been biophysically characterized.

      We will cite several studies showing that IDPs remain disordered upon drug binding.

      Similarly, they go on to say:

      "Consequently, the IDP-drug complex typically samples a vast conformational space, and the drug molecule only exhibits preferences, rather than exclusiveness, for interacting with subsets of residues." But again, where is the data to support this assertion? I don't necessarily disagree, but we need specific empirical studies to justify declarative claims like this; otherwise, we propagate lore into the scientific literature. The use of "typically" here is a strong claim, implying most IDP complexes behave in a certain way, yet how can the authors make such a claim? 

      Here again we will add citations to support the statement.

      Finally, they continue to claim:

      "Such drug interacting residues (DIRs), akin to binding pockets in structured proteins, are key to optimizing compounds and elucidating the mechanism of action." But again, is this a fact or a hypothesis? If the latter, it must be stated as such; if the former, we need data and evidence to support the claim. 

      We will add citations to both compound optimization and mechanism of action.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Major concerns:

      (1) Is the direct binding of MCAK to the microtubule cap important for its in vivo function?

      a.The authors claim that their "study provides mechanistic insights into understanding the end-binding mechanism of MCAK". I respectfully disagree. My concern is that the paper offers limited insights into the physiological significance of direct end-binding for MCAK activity, even in vitro. The authors estimate that in the absence of other proteins in vitro, ~95% of MCAK molecules arrive at the tip by direct binding in the presence of ~ physiological ATP concentration (1 mM). In cells, however, the major end-binding pathway may be mediated by EB, with the direct binding pathway contributing little to none. This is a reasonable concern because the apparent dissociation constant measured by the authors shows that MCAK binding to microtubules in the presence of ATP is very weak (69 uM). This concern should be addressed by 1) calculating relative contributions of direct and EB-dependent pathways based on the affinities measured in this and other published papers and estimated intracellular concentrations. Although there are many unknowns about these interactions in cells, a modeling-based analysis may be revealing. 2) the recapitulation of these pathways using purifying proteins in vitro is also feasible. Ideally, some direct evidence should be provided, e.g. based on MCAK function-separating mutants (GDP-Pi tubulin binding vs. catalytic activity at the curled protofilaments) that contribution from the direct binding of MCAK to microtubule cap in EB presence is significant.

      We thank the reviewer for the thoughtful comments.

      (1) We think that the end-binding affinity of MCAK makes a significant contribution for its cellular functions. To elucidate this concept, we now use a simple model shown in Supplementary Appendix-2 (see pages 49-51, lines 1246-1316). In this model, we simplified MCAK and EB1 binding to microtubule ends by considering only these two proteins while neglecting other factors (e.g. XMAP215). Specifically, we considered two scenarios: one in which both proteins freely diffuse in the cytoplasm and another where MCAK is localized to specific cellular structures, such as the centrosome or centromere. Based on the modeling results, we argue that MCAK's functional impact at microtubule ends derives both from its intrinsic end-binding capacity and its ability to strengthen the EB1-mediated end association pathway.

      (2) We agree with the reviewer that MCAK exhibiting a lower end-binding affinity (69 µM) is indeed intriguing, as one might intuitively expect a stronger affinity, e.g. in the nanomolar range. Several factors may contribute to this observation. First, this could be partly due to the in vitro system employed, which may not perfectly replicate in vivo conditions, especially when considering cellular processes quantitatively. Variations in medium composition can significantly influence the binding state. For example, reducing salt concentration leads to a marked increase in MCAK’s binding affinity (Helenius et al., 2006; Maurer et al., 2011; McHugh et al., 2019). Additionally, while numerous binding events with short durations were detected, we excluded transient interactions from our analysis to facilitate quantification. This likely leads to an underestimation of the on-rate and, consequently, the binding affinity. Moreover, to minimize the interference of purification tags (His-tag), we ensured their complete removal during protein sample preparation. Previous studies reported that retaining the His-tag of MAPs affects the binding affinity to microtubules (Maurer et al., 2011; Zhu et al., 2009). Finally, a low affinity is not necessarily unexpected. Considering the microtubule end as a receptor with multiple binding sites for MCAK, the overall binding affinity is in the nanomolar range (260 nM). This does not necessarily contradict MCAK being a microtubule dynamics regulator as only a few MCAK molecules may suffice to induce microtubule catastrophe (as discussed on page 13, lines 408-441).

      (3) Ideally, we would search for mutants that specifically interfere with the binding of GDP-Pi-tubulin or the curled protofilaments. However, the mutant we tested significantly impacts the overall affinity of MCAK to microtubules (both end and lattice), making it challenging to isolate and discuss the function of MCAK with respect to the binding to GDP-Pi-tubulin alone. Additionally, we also think that the GDP-Pi-tubulin in the EB cap and the tubulin in the curved protofilaments may share structural similarities. For instance, the tubulin dimers in both states may be less compact compared to those in the lattice, which could explain why MCAK recognizes both simultaneously (Manka and Moores, 2018). However, this remains a conjecture, as there is currently no direct evidence to support it.

      b. As mentioned in the Discussion, preferential MCAK binding to tubulins near the MT tip may enhance MCAK targeting of terminal tubulins AFTER the MCAK has been "delivered" to the distal cap via the EB-dependent mechanism. This is a different targeting mechanism than the direct MCAK-binding. However, the measured binding affinity between MCAK and GMPCPP tubulins is so weak (69 uM), that this effect is also unlikely to have any impact because the binding events between MCAK and microtubule should be extremely rare. Without hard evidence, the arguments for this enhancement are very speculative.

      Please see our response to the comment No. 1. Additionally, we have revised our discussion to discuss the end-binding affinity of MCAK as well as its physiological relevance (please see page 13, lines 408-441; and see Supplementary Appendix-2 in pages 49-51, lines 1246-1316).

      (2) The authors do not provide sufficient justification and explanation for their investigation of the effects of different nucleotides in MCAK binding affinity. A clear summary of the nucleotide-dependent function of MCAK (introduction with references to prior affinity measurements and corresponding MCAK affinities), the justifications for this investigation, and what has been learned from using different nucleotides (discussion) should be provided. My take on these results is that by far the strongest effect on microtubule wall and tip binding is achieved by adding any adenosine, whereas differences between different nucleotides are relatively minor. Was this expected? What can be learned from the apparent similarity between ATP and AMPPNP effects in some assays (Fig 1E, 4C, etc) but not others (Fig 1D,F, etc)?

      We thank the reviewer for this suggestion. We have revised the manuscript accordingly, and below are the main points of our response

      (1) The experiment investigating the effects of different nucleotides on MCAK binding affinity was inspired by the previous studies demonstrating that kinesin-13 interactions with microtubules are highly dependent on their adenosine-bound states. For example, kinesin-13s tightly bind microtubules and prefer to form protofilament curls or rings with tubulin in the AMPPNP state, whereas kinesin-13s are considered to move along the microtubule lattice via one-dimensional diffusion in the ADP·Pi state (Asenjo et al., 2013; Benoit et al., 2018; Friel and Howard, 2011; Helenius et al., 2006). Based on these observations, we wondered whether MCAK's adenosine-bound states might similarly affect its binding preference for growing microtubule ends. We have made the motivation clear in the revised manuscript (please see page 7, lines 199-209).

      (2) Our main finding regarding the effects of nucleotides is that MCAK shows differential end-binding affinity and preference based on its nucleotide state. First, MCAK shows the greatest preference for growing microtubule ends in the ATP state, supporting the idea that diffusive MCAK (MCAK·ATP) can directly bind to growing microtubule ends. Second, MCAK·ATP also demonstrates a binding preference for GTPγS microtubules and the ends of GMPCPP microtubules. The similar trends in binding preference suggest that the affinity for GDP·Pi-tubulin and GTP-tubulin likely underpins MCAK’s preference for growing microtubule ends. To clarify these points, we have added further discussions in the manuscript (please see page 8, lines 230-233; page9, lines 258-270 and pages 13-14, lines 443-458).

      (3) It is not clear why the authors decided to use these specific mutant MCAK proteins to advance their arguments about the importance of direct tip binding. Both mutants are enzymatically inactive. Both show roughly similar tip interactions, with some (minor) differences. Without a clear understanding of what these mutants represent, the provided interpretations of the corresponding results are not convincing.

      We thank the reviewer for this comment. In the revised manuscript, we no longer draw conclusions about the importance of end-binding based on the mutant data. Instead, we think that the mutant data provide insights into the structural basis of the end-binding preference. Therefore, we have rewritten the results in this section to more accurately reflect these findings (please see page 10, lines 295-327).

      (4) GMPCPP microtubules are used in the current study to represent normal dynamic microtubule ends, based on some published studies. However, there is no consensus in the field regarding the structure of growing vs. GMPCPP-stabilized microtubule ends, which additionally may be sensitive to specific experimental conditions (buffers, temperature, age of microtubules, etc). To strengthen the authors' argument, Taxol-stabilized microtubules should be used as a control to test if the effects are specific. Additionally, the authors should consider the possibility that stronger MCAK binding to the ends of different types of microtubules may reflect MCAK-dependent depolymerization events on a very small scale (several tubulin rows). These nano-scale changes to tubulins and the microtubule end may lead to the accumulation of small tubulin-MCAK aggregates, as is seen with other MAPs and slowly depolymerizing microtubules. These effects for MCAK may also depend on specific nucleotides, further complicating the interpretation. This possibility should be addressed because it provides a different interpretation than presented in the manuscript.

      Regarding the two points raised here, our thoughts are as following

      (1) The end of GMPCPP-stabilized microtubules differs from that of growing microtubules, with the most obvious known difference being the absence of the region enriched in GDP-Pi-tubulin. We consider the end of GMPCPP microtubules as an analogue of the distal tip of growing microtubules, based on two key features: (1) curled protofilaments and (2) GMPCPP-tubulin, a close analogue of GTP-tubulin. Notably, both features are present at the ends of both GMPCPP-stabilized and growing microtubules. Moreover, we agree with the suggestion to use taxol-stabilized microtubules as a control. This would eliminate the second feature (absence of GTP-tubulin), allowing us to isolate the effect of the first feature. Therefore, we conducted this experiment, and our data showed that MCAK exhibits only a mild binding preference for the ends of taxol-stabilized microtubules, which is much less pronounced than for the ends of GMPCPP microtubules. This observation supports the idea that GMPCPP-stabilized ends closely resemble the growing ends of microtubules.

      (2) The reviewer suggested that stronger MCAK binding to the ends of different types of microtubules might reflect MCAK-dependent depolymerization events on a very small scale. This is an insightful possibility, which we had overlooked in the original manuscript. Fortunately, we performed the experiments at the single-molecule concentrations. Upon reviewing the raw data, we found that under ATP conditions, the binding events of MCAK were not cumulative (see Fig. X1 below) and showed no evidence of local accumulation of MCAK-tubulin aggregates.

      Author response image 1.

      The representative kymograph showing GFP-MCAK binding at the ends and lattice of GMPCPP microtubules in the presence of 1 mM ATP (10 nM GFP-MCAK), which corresponded to Fig. 5A. The arrow: the end-binding of MCAK. Vertical bar: 1 s; horizontal bar: 2 mm.

      (5) It would be helpful if the authors provided microtubule polymerization rates and catastrophe frequencies for assays with dynamic microtubules and MCAK in the presence of different nucleotides. The video recordings of microtubules under these conditions are already available to the authors, so it should not be difficult to provide these quantifications. They may reveal that microtubule ends are different (or not) under the examined conditions. It would also help to increase the overall credibility of this study by providing data that are easy to compare between different labs.

      We thank the reviewer for this suggestion. In the revised manuscript, we have provided data on the growth rates, which are similar across the different nucleotide states (Fig. s1). However, due to the short duration of our recordings (usually 5 minutes, but with a high frame rate, 10 fps), we did not observe many catastrophe events, which prevented us from quantifying catastrophe frequency using the current dataset. Since we measured the binding kinetics of MCAK during the growing phase of microtubules, the similar growth rates and microtubule end morphologies suggest that the microtubule ends are comparable across the different conditions.

      Reviewer #1 (Recommendations For The Authors):

      a. Please provide more details about how the microtubule-bound molecules were selected for analysis (include a description of scripts, selection criteria, and filters, if any). Fig 1A arrows do not provide sufficient information.

      We first measured the fluorescence intensity of each binding event. A probability distribution of these intensities was then constructed and fitted with a Gaussian function. A binding event was considered to correspond to a single molecule if its intensity fell within μ±2σ of the distribution. The details of the single-molecule screening process are now provided in the revised manuscript (see page17, lines 574-583).

      b. Evidence that MCAK is dimeric in solution should be provided (gel filtration results, controls for Figs1A - bleaching, or comparison with single GFP fluorophore).

      In the revised manuscript, we provide the gel filtration results of purified MCAK and other proteins used in this study. The elution volume of the peak for GFP-MCAK corresponded to a molecular weight range between 120 kDa (EB1-GFP dimer) and 260 kDa (XMAP215-GFP-his6), suggesting that GFP-MCAK exists as a dimer (~220 kDa) under experimental condition (please see Fig.s1 and page 5, lines 104-105). In addition, we also measured the fluorescence intensity of both MCAK<sup>sN+M</sup> and MCAK. MCAK<sup>sN+M</sup> is a monomeric mutant that contains the neck domain and motor domain (Wang et al., 2012). The average intensity of MCAK<sup>sN+M</sup> is 196 A.U., about 65% of that of MCAK (300 A.U.). These two measurements suggest that the purified MCAK used in this study exists dimers (see Fig. s1).

      c. Evidence that MCAK on microtubules represents single molecules should be provided (distribution of GFP brightness with controls - GFP imaged under identical conditions). Since assay buffers include detergent, which is not desirable, all controls should be done using the same assay conditions. The authors should rule out that their main results are detergent-sensitive.

      (1) Regarding if MCAK on microtubules represent single molecules: please refer to our responses to the two points above.

      (2) To rule out the effect of tween-20 (0.0001%, v/v), we performed additional control experiments. The results showed that it has no significant effect on microtubule-binding affinity of MCAK (see Figure below).

      Author response image 2.

      Tween-20 (0.0001%, v/v) has no significant effect on microtubule-binding affinity of MCAK. (A) The representative projection images of GFP-MCAK (5 nM) binding to taxol-stabled GDP microtubules in the presence of 1 mM AMPPNP with or without tween-20. The upper panel showed the results of the control experiments performed without MCAK. Scale bar: 5 mm. (B) Statistical quantification of the binding intensity of GFP-MCAK binding to GDP microtubules with or without tween-20 (53 microtubules from 3 assays and 70 microtubules from 3 assays, respectively). Data were presented as mean ± SEM. Statistical comparisons were performed using the two-tailed Mann-Whitney U test with Bonferroni correction, n.s., no significance.

      d. How did the authors plot single-molecule intensity distributions? I am confused as to why the intensity distribution for single molecules in Fig 1D and 2A looks so perfectly smooth, non-pixelated, and broader than expected for GFP wavelength. Please provide unprocessed original distributions, pixel size, and more details about how the distributions were processed.

      In the revised manuscript, we provided unprocessed original data in Fig. 1B and Fig. 2A. We thank the reviewer for pointing out this problem.

      e. Many quantifications are based on a limited number of microtubules and the number of molecules is not provided, starting from Fig 1D and down. Please provide detailed statistics and explain what is plotted (mean with SEM?) on each graph.

      We performed a thorough inspection of the manuscript and corrected the identified issues.

      f. Plots with averaged data should be supplemented with error bars and N should be provided in the legend. E.g. Fig 1C - average position of MT and peak positions.

      We agree with the reviewer. In the revised manuscript, we have made the changes accordingly (e.g. Fig. 2C).

      g. Detailed information should be provided about protein constructs used in this work including all tags. The use of truncated proteins or charged/bulky tags can modify protein-microtubule interactions.

      We agree with the reviewer. In the revised manuscript, we provide the information of all constructs (see Fig. s1 and the related descriptions in Methods, pages 15-16, lines 476-534).

      h. Line 515: We estimated that the accuracy of microtubule end tracking was ~6 nm by measuring the standard error of the distribution of the estimated error in the microtubule end position. - evidence should be provided using the conditions of this study, not the reference to the prior work by others.

      i. Line 520: We estimated that the accuracy of the measured position was ~2 nm by measuring the standard error of the fitting peak location". Please provide evidence.

      Point h-i: we now provide detailed descriptions of how to estimate tracking and measurement accuracy and error in our work. Please see pages 18-19, lines 626-645.

      j. Kymographs in Fig 5G are barely visible. Please provide single-channel greyscale images. What are the dim molecules diffusing on this microtubule?

      We have incorporated the changes suggested by the reviewer. We think that some of the dim signals may result from stochastic background noise, while others likely represent transient bindings of MCAK. The exposure time in our experiments was approximately 0.05 seconds; if the binding duration were shorter than this, the signal would be lower (i.e. the “dim” signals). It is important to note that in this study, we selected binding events lasting at least 2 consecutive frames, meaning transient binding events were not included. This point has been clarified in the Methods section (see page17, lines 573-583).

      k. Please provide a methods description for Fig 6. Did the buffer include 1 mM ATP? The presence of ATP would make these conditions more physiological. ATP concentration should be stated clearly in the main text or figure legend.

      The buffer contains ATP. In the revised manuscript, we have provided the methods for the experiments of microtubule dynamics assay, as well as the analysis of microtubule lifetimes and catastrophe frequency (see page 17, lines 561-572 and page 20, lines 685-690).

      l. Line 104: experiment was performed in BRB80 supplemented with 50 mM KCl and 1 mM ATP, providing a nearly physiological ion strength. Please provide a reference or add your calculations in Methods.

      We have provided references on page 5, lines 101-104 of our manuscript.

      m. What was the MCAK concentration in Figure 4? Did the microtubule shorten under any of these conditions?

      In these experiments, we used a very low concentration of MCAK and taxol-stabilized microtubules, so there’s no microtubule shortening observed here. ATP: 10 nM GFP-MCAK; AMPPNP: 1 nM GFP-MCAK; ADP: 10 nM GFP-MCAK; APO state: 0.1 nM GFP-MCAK.

      Other criticism:

      Text improvements are recommended in the Discussion. For example, line 348: Fourth, the loss of the binding preference.. suggests that the binding preference .. is required for the optimal .. preference.

      We thank the reviewer for pointing out this. In the revised manuscript, we conducted a thorough revision and review of the text.

      Reviewer #2 (Public Review):

      Summary:

      In this manuscript, Chen et al. investigate the localization of microtubule kinesin-13 MCAK to the microtubule ends. MCAK is a prominent microtubule depolymerase whose molecular mechanisms of action have been extensively studied by a number of labs over the last ~twenty years. Here, the authors use single-molecule approaches to investigate the precise localization of MCAK on growing microtubules and conclude that MCAK preferentially binds to a GDP-Pi-tubulin portion of the microtubule end. The conclusions are speculative and not well substantiated by the data, making the impact of the study in its current form rather limited. Specifically, greater effort should be made to define the region of MCAK binding on microtubule ends, as well as its structural characteristics. Given that MCAK has been previously shown to effectively tip-track growing microtubule ends through an established interaction with EB proteins, the physiological relevance of the present study is unclear. Finally, the manuscript does not cite or properly discuss a number of relevant literature references, the results of which should be directly compared and contrasted to those presented here.

      We thank the reviewer for the comments. As these suggestions are more thoroughly expressed in the following comments for authors, we will provide the responses in the corresponding sections, as shown below.

      Reviewer #2 (Recommendations For The Authors):

      Significant concerns:

      (1) Establishing the precise localization of MCAK wrt microtubule end is highly non-trivial. More details should be provided, including substantial supplementary data. In particular, the authors claim ~6 nm accuracy in microtubule end positioning - this should be substantiated by data showing individual overlaid microtubule end intensity profiles as well as fits with standard deviations etc. Furthermore, to conclude that MCAK binds behind XMAP215, the authors should look at the localization of the two proteins simultaneously, on the same microtubule end. Notably, EB binding profiles are well known to exponentially decay along the microtubule lattice - this is not very apparent from the presented data. If MCAK's autonomous binding pattern matches that of EB, we should be seeing an exponentially-decaying localization for MCAK as well? However, averaged MCAK signals seem to only be fitted to Gaussian. Note that the EB binding region (i.e. position and size of the EB comet) can be substantially modulated by increasing the microtubule growth rate - this can be easily accomplished by increasing tubulin concentrations or the addition of XMAP215 (e.g. see Maurer et al. Cur Bio 2014). Thus to establish that MCAK on its own binds the same region as EB, experiments that directly modulate the size and the position of this region should be added.

      (1) We thank the reviewer for this comment. Regarding the accuracy in microtubule end positioning, we now provide more details, and please see pages 18-19, lines 625-645 in the revised manuscript.

      (2) Regarding the relative localization of XMAP215 and MCAK, we performed additional experiments to record their colocalizations simultaneously, on the same microtubule end. Our results showed that MCAK predominantly binds behind XMAP215, with 14.5% appearing within the XMAP215’s binding region. Please see Fig. 2.D-E and lines 184-197 in the revised manuscript.

      (3) Regarding the exponential decay of the EB1 signal along microtubules, we observed that the position probability distribution measured in the present study follows a Gaussian distribution, and the expected exponential decay was not apparent. Since the exponential decay is thought to result from the time delay between tubulin polymerization and GTP hydrolysis, slower polymerization is expected to reduce this latency (Maurer et al., 2014). In our experiments, the growth rate was relatively low (~0.7 mm/min), much slower than the rate observed in cells, where the comet-shaped EB1 signal is most pronounced. The previous study has shown that the exponential decay of EB1 is more pronounced at growth rates exceeding 3 mm/min in vitro (Maurer et al., 2014). Therefore, we think that the relatively slow growth may account for the observed non-exponential decay distribution of the EB1 signals. The same reason may also explain the distribution of MCAK.

      (4) We agree with the reviewer’s suggestion that altering microtubule growth rate is a valid and effective approach to regulate the EB cap length. However, the conclusion that MCAK binds to the EB region is supported by three lines of evidence: (1) the localization of MCAK at the ends of microtubules, (2) new experimental data showing that MCAK binds to the proximal end of the XMAP215 site, and (3) the tendency of MCAK to bind GTPγS microtubules, similar to EB1. Based on these findings, we did not pursue additional experiments to modify the length of the EB cap.

      (2) Even if MCAK indeed binds behind XMAP215, there is no evidence that this region is defined by the GDP-Pi nucleotide state; it could still be curved protofilaments. GTPyS is an analogue of GTP - to what extent GTPyS microtubules exactly mimic the GDP-Pi-tubulin state remains controversial. Furthermore, nucleotide sensing for EB is thought to be achieved through its binding at the interface of four tubulin dimers. However MCAK's binding site is distinct, and it has been shown to recognize intradimer tubulin curvature. Thus it is not clear how MCAK would sense the nucleotide state. On the other hand, there is mounting evidence that the morphology of the growing microtubule end can be highly variable, and that curved protofilaments may be protruding off the growing ends for tens of nanometers or more, previously observed both by EM as well as by fluorescence (e.g. Mcintosh, Moores, Chretien, Odde, Gardner, Akhmanova, Hancock, Zanic labs). Thus, to establish that MCAK indeed localizes along the closed lattice, EM approaches should be used.

      First, we conducted additional experiments that demonstrate MCAK indeed binds behind XMAP215, supporting the conclusion that MCAK interacts with the EB cap (please see Fig. 2 in the revised manuscript). Second, our argument that MCAK preferentially binds to GDP-Pi tubulin is based on two observations: (1) the binding regions of MCAK overlap with those of EB1, and (2) MCAK preferentially binds to GTPγS microtubules, which are considered a close analogue of GDP-Pi tubulin. Third, understanding the structural basis of how MCAK senses the nucleotide state of tubulin is beyond the scope of the present study. However, inspired by the reviewer’s suggestion, we looked into the structure of the MCAK-tubulin complex. The L2 loop of MCAK makes direct contact with the interdimer interface (Trofimova et al., 2018; Wang et al., 2017), which could provide a structural basis for recognizing the changes induced by GTP hydrolysis. While this remains a hypothesis, it is certainly a promising direction for future research. Forth, we agree with the reviewer that an EM approach would be ideal for establishing that MCAK localizes along the closed lattice. However, this is not the focus of the current study. Instead, we argue that MCAK binds to the EB cap, where at least some lateral interactions are likely to have formed.

      (3) The physiological relevance of the study is rather questionable: MCAK has been previously established to be able to both diffuse along the microtubule lattice (e.g. Helenius et al.) as well as hitchhike on EBs (Gouveia et al.). Given the established localization of EBs to growing microtubule ends in cells, and apparently higher affinity of MCAK for EB vs. the microtubule end itself (although direct comparisons with the literature have not been reported here), the relevance of MCAK's autonomous binding to dynamic microtubule ends is dubious.

      We thank the reviewer for raising the importance of physiological relevance. Please refer to our response to the comment No.1 of reviewer 1. Briefly, we think that the end-binding affinity of MCAK makes a significant contribution for its cellular functions. To elucidate this concept, we now use a simple model shown in Supplementary Appendix-2 (see pages 49-51, lines 1246-1316). In this model, we simplified MCAK and EB1 binding to microtubule ends by considering only these two proteins while neglecting other factors (e.g. XMAP215). Specifically, we considered two scenarios: one in which both proteins freely diffuse in the cytoplasm and another where MCAK is localized to specific cellular structures, such as the centrosome or centromere. Based on the modeling results, we argue that MCAK's functional impact at microtubule ends derives both from its intrinsic end-binding capacity and its ability to strengthen the EB1-mediated end association pathway.

      (4) Finally, the study seriously lacks discussion of and comparison with the existing literature on this topic. There are major omissions in citing relevant literature, such as e.g. landmark study by Kinoshita et al. Science 2001. Several findings reported here directly contradict previous findings in the literature. Direct comparison with e.g. Gouveia et al findings, Helenius et al. findings, and others need to be included. For example, Gouveia et al reported that EB is necessary for MCAK plus-end-tracking in vitro (please see Figure 1 of their manuscript). The authors should discuss how they reconcile the differences in their findings when compared to this earlier study.

      We thank the reviewer for this helpful suggestion. In the revised manuscript, we have updated the text description and included comparative discussions with other relevant studies in the Discussion section. Specifically, we added comparisons with the research on XMAP215 in page 14, lines 459-472 (Barr and Gergely, 2008; Kinoshita et al., 2001; Tournebize et al., 2000). Additionally, we have compared our findings with those of Gouveia et al. and Helenius et al. regarding MCAK's preference for binding microtubule ends in page 6, lines 145-157 and page 13, 408-441, respectively (Gouveia et al., 2010; Helenius et al., 2006).

      Additional specific comments:

      Figure 1

      Gouveia et al. (Figure 1) reported that MCAK does not autonomously preferentially localize to growing tips. Specifically, Gouveia et al. found equal association rates of MCAK to both the lattice and the tip in the presence of EB3delT, an EB3 construct that does not directly interact with MCAK. How can these findings be reconciled with the results presented here?

      We are uncertain why there was no observed difference in the on-rates to the lattice and the end in the study by Gouveia et al. Even when considering only the known affinity of MCAK for curved protofilaments at the distal tip of growing microtubules, we would still expect to observe an end-binding preference. After carefully comparing the experimental conditions, we nevertheless identified some differences. First, we used a 160 nm tip size to calculate the on-rate (k<sub>on</sub>), whereas Gouveia et al. used a 450 nm tip. Using a longer tip size would naturally lead to a smaller(k<sub>on</sub>) value. Note that we chose 160 nm for several reasons: (i) a previous cryo-electron tomography study has elucidated that the sheet structures of dynamic microtubule ends have an average length of around 180 nm (Guesdon et al., 2016); (ii) Analysis of fluorescence signals at dynamic microtubule ends has demonstrated that the taper length at the microtubule end is less than 180 nm (Maurer et al., 2014); (iii) in the present study, we estimated that the length of MCAK's end-binding region is approximately 160 nm. Second, in Gouveia et al., single-molecule binding events were recorded in the presence of 75 nM EB3ΔT, which could potentially create a crowded environment at the tip, reducing MCAK binding. Third, as mentioned in our response to Reviewer 1, we took great care to minimize the interference from purification tags (e.g., His-tag) by ensuring their complete removal during protein preparation. Previous studies reported that retaining the His-tag of MAPs led to a significant increase in binding for microtubules (Maurer et al., 2011; Zhu et al., 2009). We believe that some of the factors mentioned above, or their combined effects, may account for the differences in these two observations.

      1C shows the decay of tubulin signal over several hundred nm - should show individual traces? How aligned? Doesn't this long decay suggest protruding protofilaments? (E.g. Odde/Gardner work).

      (1) In the revised manuscript, we now show individual traces (e.g. in Fig. 1B and Fig. 2A). The average trace for tubulin signal with standard deviation was shown in Fig. 2C.

      (2) The microtubule lattice was considered as a Gaussian wall and its end as a half-Gaussian in every frame. Use the peak position of the half-Gaussian of every frame to align and average microtubule end signals, during the dwell time. The average microtubule ends' half-Gaussion peak used as a reference to measure the intensity profile of individual single-molecule binding event in every frame (see page18, lines 607-624).

      (3) We think that the decay of tubulin signal results from the convolution of the tapered end structure and the point spread function. In the revised manuscript, we have updated the Figures to provide unprocessed original data in Fig. 1B and Fig. 2A.

      Please show absolute numbers of measurements in 1C (rather than normalized distribution only).

      In the revised manuscript, we have included the raw data for both tubulin and MCAK signals as part of the methods description. In Fig. 1, using normalized values allows for the simultaneous representation of microtubule and protein signals on a unified graph.

      How do the results in 1D-G compare with the previous literature? Particularly comparison of on-rates between this study and the Gouveia et al? Assuming 1 um = 1625 dimers, it appears that in the presence of EB3, the on-rate of MCAK to the tips reported in Gouveia et al. is an order of magnitude higher than reported here in the absence of EB3 (4.3 x 10E-4 vs. 2 x 10E-5). If so, and given the robust presence of EB proteins at growing microtubule ends in cells, this would invalidate the potential physiological relevance of the current study. Note that the dwell times measured in Gouveia et al. are also longer than those measured here.

      Note that in Gouveia et al, the concentration of mCherry-EB3 was 75 nM, about 187.5 times higher than that of MCAK (0.4 nM). The relative concentrations of these two proteins are not always the case in cells. Regarding the physiological relevance of the end-binding affinity of MCAK itself, please refer to our response to the point No.1 of Reviewer 1.

      Notably, Helenius et al reported a diffusion constant for MCAK of 0.38 um^2/s, which is more than an order of magnitude higher than reported here. The authors should comment on this!

      In the revised manuscript, we have provided an explanation for the difference in diffusion coefficient. Please see page 6, line 142-157. In short, low salt condition facilitates rapid diffusion of MCAK.

      Figure 2:

      This figure is critical and really depends on the analysis of the tubulin signal. Note significant variability in tubulin signal between presented examples in 2A. Also, while 2C looks qualitatively similar, there appears to be significant variability over the several hundred nm from the tip along the lattice. This is the crucial region; statistical significance testing should be presented. More detailed info, including SDs etc. is necessary.

      In the revised manuscript, we have provided raw data in Fig. 1B and Fig. 2A. Additionally, we have provided statistical analysis on the tubulin signals (Fig. 2C) and performed significance test. Please see page 5, lines 111-116 and page 7, lines 179-183 for detailed descriptions.

      Insights into the morphology of microtubule ends based on TIRF imaging have been previously gained in the literature, with reports of extended tip structures/protruding protofilaments (see e.g. Coombes et al. Cur Bio 2013, based on the methods of Demchouk et al. 2011). Such analysis should be performed here as well, if we are to conclude that nucleotide state alone, as opposed to the end morphology, specifies MCAK's tip localization.

      We appreciate the reviewer’s suggestion and agree that it provides a valid optical microscopy-based approach for estimating microtubule end morphology. However, this method did not establish a direct correlation between microtubule end morphology and tubulin nucleotide status. Therefore, we think that refining the measurement of microtubule end morphology will not necessarily provide more information to the understanding of tubulin nucleotide status at MCAK binding sites. Based on the available data in the present study, there are two main pieces of evidence supporting the idea that MCAK can sense tubulin nucleotide status: (1) the binding regions of MCAK and EB overlap significantly, and (2) MCAK shows a clear preference for binding to GTPγS microtubules, similar to EB1 (we provide a new control to support this, Fig. s4). Of course, we do not consider this to be a perfect set of evidence. As the reviewer has pointed out here and in other suggestions, future work should aim to further distinguish the nucleotide status of tubulin in the dynamic versus non-dynamic regions at the ends of microtubules, and to investigate the structural basis by which MCAK recognizes tubulin nucleotide status.

      EB comet profile should be clearly reproduced. MCAK should follow the comet profile.

      Please see our 3<sup>rd</sup> response to the point 1 of this reviewer.

      The conclusion that the MCAK binding region is larger than XMAP215 is not firm, based on the data presented. The authors state that 'the binding region of MCAK was longer than that of XMAP215'. What is the exact width of the region of the XMAP215 localization and how much longer is the MCAK end-binding region? Is this statistically significant?

      We have revised this part in the revised manuscript (page 6, lines 167-172). The position probability distributions of MCAK and XMAP215 were significantly different (K-S test, p< 10<sup>-5</sup>), and the binding region of MCAK (FWHM=185 nm) was significantly longer than that of XMAP215 (FWHM=123 nm).

      MCAK localization with AMPPNP should also be performed here. Even low concentrations of MCAK have been shown to induce microtubule catastrophe/end depolymerization. This will dramatically affect microtubule end morphology, and thus apparent positioning of MCAK at the end.

      In the end positioning experiment, we used a low concentration of MCAK (1 nM). Under this condition, microtubule dynamics remained unchanged, and the morphology of the microtubule ends was comparable across different conditions (with EB1, MCAK or XMAP215). Additionally, in the revised manuscript, we present a new experiment in which we recorded the localization of both MCAK and XMAP215 on the same microtubule. The results support the conclusion regarding their relative localization: most MCAK is found at the proximal end of the XMAP215 binding region, while approximately 15% of MCAK is located within the XMAP215 binding region. Please see Fig. 2D-E and page 7, lines 184-197 for the corresponding descriptions.

      Figure 3:

      For clearer presentation, projections showing two microtubule lattice types on the same image (in e.g. two different colors) should be shown first without MCAK, and then with MCAK.

      We thank the reviewer for this suggestion. We have adjusted the figure accordingly. Please see Fig. 4 in the revised manuscript.

      Please comment on absolute intensity values - scales seem to be incredibly variable.

      The fluorescence value presented here is the result of multiple images being summed. Therefore, the difference in absolute values is influenced not only by the binding affinity of MCAK in different states to microtubules, but also by the number of images used. In this analysis, we are not comparing MCAK in different states, but rather evaluating the binding ability of MCAK in the same state on different types of microtubules.

      Given that the authors conclude that MCAK binding mimics that of EB, EB intensity measurements and ratios on different lattice substrates should be performed as a positive control.

      We performed additional experiments with EB1, in the revised manuscript, we provide the data as a positive control (please see Fig. s4).

      Figure 4:

      MCAK-nucleotide dependence of GMPCPP microtubule-end binding has been previously established (see e.g. Helenius et al, others?) - what is new here? Need to discuss the literature. This would be more appropriate as a supplemental figure?

      In the present study, we reproduced the GMPCPP microtubule-end binding of MCAK in the AMPPNP state, as shown in several previous reports (Desai et al., 1999; Hertzer et al., 2006). Here, we also quantified the end to lattice binding preference, and our results showed that the nucleotide state-dependence shows the same trend as the binding preference of MCAK to the growing microtubule ends. Therefore, we prefer to keep this figure in the main text (Fig. 5).

      Figure 5:

      Please note that both MCAK mutants show an additional two orders of magnitude lower microtubule binding on-rates when compared to wt MCAK. This makes the analysis of preferential binding substrate for these mutants dubious.

      We agreed with this point. We have rewritten this part. Please see page 10, lines 295-327, in the revised manuscript.

      Figure 6:

      Combined effects of XMAP215 and XKCM1 (MCAK) have been previously explored in the landmark study by Kinoshita et al. Science 2001, which should be cited and discussed. Also note that Moriwaki et al. JCB 2016 explored the combined effects of XMA215 and MCAK - which should be discussed here and compared to the current results.

      We agree with the reviewer. We have revised the discussion on this part. Please see page 11, lines 329-342 and page 14, lines 459-472 in the revised manuscript.

      Please report quantification for growth rate and lifetime.

      In the revised manuscript, we provide all these data. Please see pages 11-12, lines 343-374.

      To obtain any new quantitative information on the combined effects of the two proteins, at the very minimum, the authors should perform a titration in protein concentration.

      We agree with the reviewer on this point. In our pilot experiments, we performed titration experiments to determine the appropriate concentrations of MCAK and XMAP215, respectively. We selected 50 nM for XMAP215, as it clearly enhances the growth rate and exhibits a mild promoting effect on catastrophe—two key effects of XMAP215 reported in previous studies (Brouhard et al., 2008; Farmer et al., 2021). Reducing the XMAP215 concentration eliminates the catastrophe-promoting effect, while increasing it would not much enhance the growth rate. For MCAK, we chose 20 nM, as it effectively promotes catastrophe; increasing the concentration beyond this point leads to no microtubule growth, at least in the MCAK-only condition. If there’s no microtubule growth, it would be difficult to quantify the parameters of microtubule dynamics, hindering a clear comparison of the combined versus individual effects. Therefore, we think that the concentrations used in this study are appropriate and representative. In the revised manuscript, we make this point clearer (see pages 11 and lines 329-342).

      Finally, the writing could be improved for overall clarity.

      We thank the reviewer for pointing out this. In the revised manuscript, we conducted a thorough revision and review of the text.

      Reviewer #3 (Public Review):

      The authors revisit an old question of how MCAK goes to microtubule ends, partially answered by many groups over the years. The authors seem to have omitted the literature on MCAK in the past 10-15 years. The novelty is limited due to what has previously been done on the question. Previous work showed MCAK targets to microtubule plus-ends in cells through association with EB proteins and Kif18b (work from Wordeman, Medema, Walczak, Welburn, Akhmanova) but none of their work is cited.

      We thank the reviewer for the suggestion. Some of the referenced work has already been cited in our manuscript, such as studies on the interaction between MCAK and EB1. However, other relevant literature had not been properly cited. In the revised manuscript, we have added further discussion on this topic in the context of existing findings. Please refer to pages 3-4, lines 68-85, and pages 13, lines 425-441.

      It is not obvious in the paper that these in vitro studies only reveal microtubule end targeting, rather than plus end targeting. MCAK diffuses on the lattice to both ends and its conformation and association with the lattice and ends has also been addressed by other groups-not cited here. I want to particularly highlight the work from Friel's lab where they identified a CDK phosphomimetic mutant close to helix4 which reduces the end preference of MCAK. This residue is very close to the one mutated in this study and is highly relevant because it is a site that is phosphorylated in vivo. This study and the mutant produced here suggest a charge-based recognition of the end of microtubules.

      Here the authors analyze this MCAK recognition of the lattice and microtubule ends, with different nucleotide states of MCAK and in the presence of different nucleotide states for the microtubule lattice. The main conclusion is that MCAK affinity for microtubules varies in the presence of different nucleotides (ATP and analogs) which was partially known already. How different nucleotide states of the microtubule lattice influence MCAK binding is novel. This information will be interesting to researchers working on the mechanism of motors and microtubules. However, there are some issues with some experiments. In the paper, the authors say they measure MCAK residency of growing end microtubules, but in the kymographs, the microtubules don't appear dynamic - in addition, in Figure 1A, MCAK is at microtubule ends and does not cause depolymerization. I would have expected to see depolymerization of the microtubule after MCAK targeting. The MCAK mutants are not well characterized. Do they still have ATPase activity? Are they folded? Can the authors also highlight T537 and discuss this?

      Finally, a few experiments are done with MCAK and XMAP215, after the authors say they have demonstrated the binding sites overlap. The data supporting this statement were not obvious and the conclusions that the effect of the two molecules are additive would argue against competing binding sites. Overall, while there are some interesting quantitative measurements of MCAK on microtubules - in particular in relation to the nucleotide state of the microtubule lattice - the insights into end-recognition are modest and do not address or discuss how it might happen in cells. Often the number of events is not recorded. Histograms with large SEM bars are presented, so it is hard to get a good idea of data distribution and robustness. Figures lack annotations. This compromises therefore their quantifications and conclusions. The discussion was hard to follow and needs streamlining, as well as putting their work in the context of what is known from other groups who produced work on this in the past few years.

      We thank the reviewer for the comments. Regarding the physiological relevance of the end-binding of MCAK itself, please refer to our response to the point No.1 of reviewer 1. Moreover, as we feel that other suggestions are more thoroughly expressed in the following comments for authors, we will provide the responses in the corresponding sections, as shown below.

      Reviewer #3 (Recommendations For The Authors):

      Why, on dynamic microtubules, is MCAK at microtubule plus ends and does not cause a catastrophe?

      At this concentration (10 nM MCAK with 16 mM tubulin in Fig. 1; 1 nM MCAK with 12 mM tubulin in Fig. 2), MCAK has little effect on microtubule dynamics in our experiments. Using TIRFM, we were able to observe individual MCAK binding events. Based on these observations, we think that in the current experimental condition, a single binding event of MCAK is insufficient to induce microtubule catastrophe; rather, it likely requires cumulative changes resulting from multiple binding events.

      Do the MCAK mutants still have ATPase activity?

      The ATPase activities of MCAK<sup>K525A</sup> and MCAK<sup>V298S</sup> are both reduced to about 1/3 of the wild-type (Fig. s6).

      The intensities of GFP are not all the same on the microtubule lattice (eg 1A). See blue and white arrowheads. The authors could be looking at multiple molecules of GFP-MCAK instead of single dimers. How do they account for this possibility?

      In the revised manuscript, we provide the gel filtration result of the purified MCAK, and the position of the peak corresponds to ~220 kDa, demonstrating that the purified MCAK in solution is dimeric (please see Fig.s1 and page 5, lines 101-103). We measured the fluorescence intensity of each binding event. A probability distribution of these intensities was then constructed and fitted with a Gaussian function. A binding event was considered to correspond to a single molecule if its intensity fell within μ±2σ of the distribution. The details of the single-molecule screening process are provided in the revised manuscript (see page 17, lines 574-583).

      In addition, we also measured the fluorescence intensity of both MCAK<sup>sN+M</sup> and MCAK. MCAK<sup>sN+M</sup> is a monomeric mutant that contains the neck domain and motor domain (Wang et al., 2012). The average intensity of MCAK<sup>sN+M</sup> is 196 A.U., about 65 % of that of MCAK (300 A.U.), suggesting that MCAK is a dimer (see Fig. s1). Moreover, we think that some of the dim signals may result from stochastic background noise, while others likely represent transient bindings of MCAK. The exposure time in our experiments was approximately 0.05 seconds; if the binding duration were shorter than this, the signal would be lower. It is important to note that in this study, we specifically selected binding events lasting at least 2 consecutive frames, meaning transient binding events were not included. This point has been clarified in the Methods section (see page 17, lines 568-569 and lines 574-583).

      Could the authors provide a kymograph of an MT growing, in the presence of MCAK+AMPPNP? Can MCAK track the cap?

      Under single-molecule conditions, we observed a single MCAK molecule briefly binding to the end of the microtubule. However, we did not record if MCAK at high concentrations could track microtubule ends under AMPPNP conditions.

      In the experiments in Figure 6, the authors should also show the localization of MCAK and XMAP215 at microtubule plus ends in their kymographs to show the two molecules overlap.

      Regarding the relative localization of XMAP215 and MCAK, we conducted additional experiments to record their colocalization simultaneously at the same microtubule end. Our results show that MCAK predominantly binds behind XMAP215, with 14.5% of MCAK binding within the XMAP215 binding region. Please see Fig. 2.D-E and page 7, lines 184-197 in the revised manuscript. However, we argue that the effects of XMAP215 and MCAK are additive, and their binding sites do not necessarily need to overlap for these effects to occur.

      The authors do not report what statistical tests are done in their graphs, and one concern is over error propagation of their data. Instead of bar graphs, showing the data points would be helpful.

      We have now shown all data points in the revised manuscript.

      MCAK+AMPPNP accumulates at microtubule ends. Appropriate quotes from previous work should be provided.

      We have made the revisions accordingly. Please see page 9, lines 273-276.

      Controls are missing. An SEC profile for all purified proteins should be presented. Also, the authors need to explain if they report the dimeric or monomeric concentration of MCAK, XMAP215, etc...

      We have provided the gel filtration result for all purified proteins in the revised manuscript (Fig.s1). Moreover, we now make it clear that the concentrations of MCAK and EB1 are monomeric concentration. Please see the legend for Fig. 1, line 893 in the revised manuscript.

      Figure 1: the microtubules don't look dynamic at all. This is also why the authors can record MCAK at microtubule ends, because their structure is not changing.

      The microtubules are dynamic, but they may appear non-dynamic due to the relatively slow growth rate and the high frame rate at which we are recording. We propose that individual binding events of MCAK induce structural changes at the nanoscopic or molecular scale, which are not detectable using TIRFM.

      I recommend the authors measure the Kon and Koff for single GFP-MCAK mutant molecules and provide the information alongside their normalized and averaged binding intensities of GFP-MCAK in Fig 5. Showing data points instead of bar graphs would be better.

      (1) We measured k<sub>on</sub> and dwell time for mutants at growing microtubule end. However, we did not perform single-molecule tracking for MCAK’s binding on stabilized microtubules. This is mainly because the superimposed signal on the stable microtubule already indicates the changes in the mutant's binding affinity to different microtubule structures, and moreover, the binding of the mutants is highly transient, making accurate single-molecule tracking and calculations difficult.

      (2) In the revised figure, we have included the data points in all plots.

      When discussing how Kinesin-13 interacts with the lattice, the authors should quote the papers that report the organization of full-length Kinesin-13 on tubulin heterodimers: Trofimova et al, 2018; McHugh et al 2019; Benoit et al, 2018. It would reinforce their model and account for the full-length protein, rather than just the motor domain.

      We thank the suggestion for the reviewer. In our manuscript, we have cited papers on full-length Kinesin-13 to discuss the interaction between MCAK and microtubule end-curved structure. Additionally, we have utilized the MCAK-tubulin crystal structure (PDB ID: 5MIO) in Fig. 6, as it depicts a human MCAK, which is consistent with the protein used in our study. This structure illustrates the interaction sites between MCAK and tubulin dimer, guiding our mutation studies on specific residues. Thus, we prefer to use the structure (PDB ID: 5MIO) in Fig.6.

      Figure 5A. What type of model is this? A PDB code is mentioned. Is this from an X-ray structure? If so, mention it.

      We have now included the structural information in the Figure legend (see page 37, lines 1045).

      Figure 5B. It is not possible to distinguish the different microtubule lattices (GTPyS, GDP, and GMPCPP). The experiment needs to be better labelled.

      We thank the reviewer for this comment. We have now rearranged the figure for better clarity (see Fig. 6).

      "Figure 5D: what are the statistical tests? I don't understand " The statistical comparisons were made versus the corresponding value of 848 GFP-MCAK".

      We have made this point clearer in the revised manuscript (see pages 38, line 1078-1080).

      What is the "EB cap"? This needs explaining.

      We provide this explanation for this, please see page 4, lines 87-89 in the revised manuscript.

      Work from Friel and co-workers showed MCAK T537E did not have depolymerizing activity and a reduced affinity for microtubule ends. The work of the authors should be discussed with respect to this previously published work.

      We thank the reviewer for this suggestion. In the revised manuscript, we have added discussions on this (see page 10, lines 303-307).

      The concentration of protein used in the assays is not always described.

      We have checked throughout the manuscript and made revisions accordingly.

      "Having revealed the novel binding sites of MCAK in dynamic microtubule ends " should be on "we wondered how MCAK may work ..with EB1". This is not addressed so should be removed. Instead, they can quote the work from Akhmanova's lab. Realistically this section should be rephrased as there are other plus-end targeting molecules that compete with MCAK, not just XMAP215 and EB1.

      We have rephrased this section as suggested by this reviewer to be more specific. Please see page 11, lines 329-342.

      What is AMPCPP?

      It should be “AMPPNP”

      Typos in Figure 5.

      Corrected

    1. Reviewer #2 (Public Review):

      Summary:

      This paper describes a new approach to detecting directed causal interactions between two genes without directly perturbing either gene. To check whether gene X influences gene Z, a reporter gene (Y) is engineered into the cell in such a way that (1) Y is under the same transcriptional control as X, and (2) Y does not influence Z. Then, under the null hypothesis that X does not affect Z, the authors derive an equation that describes the relationship between the covariance of X and Z and the covariance of Y and Z. Violation of this relationship can then be used to detect causality.

      The authors benchmark their approach experimentally in several synthetic circuits. In 4 positive control circuits, X is a TetR-YFP fusion protein that represses Z, which is an RFP reporter. The proposed approach detected the repression interaction in 2 of the 4 positive control circuits. The authors constructed 16 negative control circuit designs in which X was again TetR-YFP, but where Z was either a constitutively expressed reporter, or simply the cellular growth rate. The proposed method detected a causal effect in two of the 16 negative controls, which the authors argue is perhaps not a false positive, but due to an unexpected causal effect. Overall, the data support the potential value of the proposed approach.

      Strengths:

      The idea of a "no-causality control" in the context of detected directed gene interactions is a valuable conceptual advance that could potentially see play in a variety of settings where perturbation-based causality detection experiments are made difficult by practical considerations.

      By proving their mathematical result in the context of a continuous-time Markov chain, the authors use a more realistic model of the cell than, for instance, a set of deterministic ordinary differential equations.

      The authors have improved the clarity and completeness of their proof compared to a previous version of the manuscript.

      Limitations:

      The authors themselves clearly outline the primary limitations of the study: The experimental benchmark is a proof of principle, and limited to synthetic circuits involving a handful of genes expressed on plasmids in E. coli. As acknowledged in the Discussion, negative controls were chosen based on the absence of known interactions, rather than perturbation experiments. Further work is needed to establish that this technique applies to other organisms and to biological networks involving a wider variety of genes and cellular functions. It seems to me that this paper's objective is not to delineate the technique's practical domain of validity, but rather to motivate this future work, and I think it succeeds in that.

      Might your new "Proposed additional tests" subsection be better housed under Discussion rather than Results?

      I may have missed this, but it doesn't look like you ran simulation benchmarks of your bootstrap-based test for checking whether the normalized covariances are equal. It would be useful to see in simulations how the true and false positive rates of that test vary with the usual suspects like sample size and noise strengths.

      It looks like you estimated the uncertainty for eta_xz and eta_yz separately. Can you get the joint distribution? If you can do that, my intuition is you might be able to improve the power of the test (and maybe detect positive control #3?). For instance, if you can get your bootstraps for eta_xz and eta_yz together, could you just use a paired t-test to check for equality of means?

      The proof is a lot better, and it's great that you nailed down the requirement on the decay of beta, but the proof is still confusing in some places:

      On pg 29, it says "That is, dividing the right equation in Eq. 5.8 with alpha, we write the ..." but the next equation doesn't obviously have anything to do with Eq. 5.8, and instead (I think) it comes from Eq 5.5. This could be clarified.

      Later on page 29, you write "We now evoke the requirement that the averages xt and yt are stationary", but then you just repeat Eq. 5.11 and set it to zero. Clearly you needed the limit condition to set Eq. 5.11 to zero, but it's not clear what you're using stationarity for. I mean, if you needed stationarity for 5.11 presumably you would have referenced it at that step.

      It could be helpful for readers if you could spell out the practical implications of the theorem's assumptions (other than the no-causality requirement) by discussing examples of setups where it would or wouldn't hold.

    1. Reviewer #2 (Public review):

      This is a revised version of a paper I reviewed previously.

      Again, the purpose of the paper is to suggest that common metrics, such as friction or any given physical property of the surface, are probably inadequate to predict the perception of the surface or its discriminability. Instead, the authors propose a very interesting and original idea that, instead, frictional instabilities are related to fine touch perception (title).

      Overall, the authors have put much effort into improving the manuscript, enhancing clarity, and avoiding overstatements. And I feel the narrative is indeed much improved and less ambiguous.

      However, the authors have systematically avoided addressing the main comment of all reviewers: the link made between the mock finger passive experiment and the active human psychophysics is incorrect and should not be done, because its interpretation could be flawed.<br /> - First, this link is very weak (the correlation of 6 datapoints is barely significant).<br /> - Second, the real and mock fingers have very different properties (think about moisture, compliance, roughness,...).<br /> - Third, the comparison is made between a passive and well-controlled experiment and an active exploration. Yet, the comparison metrics (number of events) are clearly dependent on exploration procedures.

      In your response to my comments:<br /> "We have made changes throughout the manuscript to acknowledge that our findings are correlative, clarifying this throughout, and incorporating into the discussion how our work may enable biomechanical measurements and tactile decision making models"

      The authors admit that the analysis is flawed, yet they did not remove it. If they cannot demonstrate that the mock finger and the human finger behave the same way during the perceptual experiment, then they should remove Fig2 that combines apples and oranges. OR, they should look at the active exploration data and compute the same metrics on that data.

      "This "weird choice" is the central innovation of this paper. This choice was necessary because we demonstrated that the common usage of friction coefficient is fundamentally flawed: we see that friction coefficient suggests that surface which are more different would feel more similar - indeed the most distinctive surfaces would be two surfaces that are identical, which is clearly spurious. "

      They did not "demonstrate" such a flaw. Again, the difference in friction is between the mock finger trials. At the very least, the authors should verify that it is true of the active human experiment.

      "To fully implement this, a decision-making model is necessary because, as a counter example, a participant could have generated 10 swipes of SFW and 1 swipe of a Sp, but the Sp may have been the most important event for making a tactile decision. This type of scenario is not compatible with the analysis suggested - and similar counterpoints can be made for other types of seemingly straightforward analysis."

      The suggested analyses are straightforward and would be much more valuable than the data from the mock finger, even with the potential variability stated above.

      "We recognize that, with all factors being equal, this sample size is on the smaller end"

      Yet, the authors did not collect additional data to confirm their findings.

    2. Author response:

      The following is the authors’ response to the original reviews

      eLife Assessment

      This useful study integrates experimental methods from materials science with psychophysical methods to investigate how frictional stabilities influence tactile surface discrimination. The authors argue that force fluctuations arising from transitions between frictional sliding conditions facilitate the discrimination of surfaces with similar friction coefficients. However, the reliance on friction data obtained from an artificial finger, together with the ambiguous correlative analyses relating these measurements to human psychophysics, renders the findings incomplete.

      Our main goal with this paper was to show that the most common metric, i.e. average friction coefficient—widely used in tactile perception and device design – is fundamentally unsound, and to offer a secondary parameter that is compatible with the fact that human motion is unconstrained, leading to dynamic interfacial mechanics.

      We understand the Reviewers wanted, through biomechanical measurements, to demonstrate that humans using instabilities. This is seemingly reasonable, but in individual responses, we explain the significant challenges and fundamental unknowns to those experiments. We believe this paper sets forth an important step to approach this problem. At the same time, we have made several changes in the discussion, conclusion, and title to clarify that our study is correlative between mechanical characterization and human testing.

      In short, there are still several fundamental unknowns that prevented us from basing the study around biomechanical measurements: (1) a decision-making model would need to be created, but it is unknown if tactile decision making follows other models, (2) it is further unknown what constitutes “tactile evidence”, though at our manuscript’s conclusion, we propose that friction instabilities are better suited for to be tactile evidence than the averaging of friction coefficients from a narrow range of human exploration (3) in the design of samples, from a friction mechanics and materials perspective, it is not at this point, possible to pre-program surfaces a priori to deliver friction instabilities and instead must be experimentally determined – especially when attempting to achieve this in controlled surfaces that do not create other overriding tactile cues, like macroscopic bumps or large differences in surface roughness. (4) Given that the basis for tactile percepts, like which object feels “rougher” or “smoother” is not sufficiently established, it is necessary to use a 3-alternative forced choice task which avoids asking objects along a preset perceptual dimension – a challenge recognized by Reviewer 3. However, this would bring in issues of memory in the decision-making model. (5) The prior points are compounded by the fact that, we believe, tactile exploration must be performed in an unconstrained manner, i.e., without an apparatus generating motion onto a stationary finger. Work by Liu et al. (IEEE ToH, 2024) showed that recreating friction obtained during free exploration onto a stationary finger was uninterpretable by the participants, hinting at the importance of efference copies.[1] We believe that many of the above-mentioned issues constitutes a significant advance in knowledge and would require discussion and dissemination with the community.

      Our changes to the manuscript

      Page 1 & SI Page 1, Title

      “Alternatives to Friction Coefficient: Fine Touch Perception Correlates with Frictional Instabilities”

      Reviewer 1 (Public review):

      Summary:

      In this paper, Derkaloustian et. al look at the important topic of what affects fine touch perception. The observations that there may be some level of correlation with instabilities are intriguing. They attempted to characterize different materials by counting the frequency (occurrence #, not of vibration) of instabilities at various speeds and forces of a PDMS slab pulled lengthwise over the material. They then had humans make the same vertical motion to discriminate between these samples. They correlated the % correct in discrimination with differences in frequency of steady sliding over the design space as well as other traditional parameters such as friction coefficient and roughness. The authors pose an interesting hypothesis and make an interesting observation about the occurrences of instability regimes in different materials while in contact with PDMS, which is interesting for the community to see in the publication. It should be noted that the finger is complex, however, and there are many factors that may be quite oversimplified with the use of the PDMS finger, and the consideration and discounting of other parameters are not fully discussed in the main text or SI. Most importantly, however, the conclusions as stated do not align with the primary summary of the data in Figure 2.

      Strengths:

      The strength of this paper is in its intriguing hypothesis and important observation that instabilities may contribute to what humans are detecting as differences in these apparently similar samples.

      We thank Reviewer 1 for their time on the manuscript, recognizing the approach we took, and offering constructive feedback. We believe that our conclusions, in fact, are supported by the primary summary of the data in Fig. 2 but we believe that our use of R<sup>2</sup> could have led to misinterpretation. The trend with friction coefficient and percent correct was indeed statistically significant but was spurious because the slope was negative. In the revision, we add clarifying comments throughout, change from R<sup>2</sup> to r as to highlight the negative trend, and adjust the figures to better focus on friction coefficient.

      Finally, we added a new section to discuss the tradeoffs between using a real human finger versus a mock finger, and which situations may warrant the use of one or the other. In short, for our goal of characterizing surfaces to be used in tactile experiments, we believe a mock finger is more sustainable and practical than using real humans because human fingers are unique per participant, humans move their fingers at constantly changing pressures and velocities, and friction generated during free exploring human cannot be satisfactorily replicated by moving a sample onto a stationary finger. But, we do not disagree that for other types of experiments, characterizing a human participant directly may be more advantageous.

      Weaknesses:

      Comment 1

      The most important weakness is that the findings do not support the statements of findings made in the abstract. Of specific note in this regard is the primary correlation in Figure 2B between SS (steady sliding) and percent correct discrimination. Of specific note in this regard is the primary correlation in Figure 2B between SS (steady sliding) and percent correct discrimination. While the statistical test shows significance (and is interesting!), the R-squared value is 0.38, while the R-squared value for the "Friction Coefficient vs. Percent Correct" plot has an R-squared of 0.6 and a p-value of < 0.01 (including Figure 2B). This suggests that the results do not support the claim in the abstract: "We found that participant accuracy in tactile discrimination was most strongly correlated with formations of steady sliding, and response times were negatively correlated with stiction spikes. Conversely, traditional metrics like surface roughness or average friction coefficient did not predict tactile discriminability."

      We disagree that the trend with friction coefficient suggests the results do not support the claim because the correlation was found to be negative. However, we could have made the comparison more apparent and expanded on this point, given its novelty.

      While the R<sup>2</sup> value corresponding to the “Friction Coefficient vs. Percent Correct” plot is notably higher, our results show that the slope is negative, which would be statistically spurious. This is because a negative correlation between percent correct (accuracy in discriminating surfaces) and difference in friction coefficient means that the more similar two surfaces are (by friction coefficient), the easier it would be for people to tell them apart. That is, it incorrectly concludes that two identical surfaces would be much easier to tell apart than two surfaces with greatly different friction coefficients.

      This is counterintuitive to nearly all existing results, but we believe our samples were well-positioned to uncover this trend by minimizing variability, by controlling multiple physical parameters in the samples, and that the friction coefficient — typically calculated in the field as an average friction coefficient — ignores all the dynamic changes in forces present in elastic systems undergoing mesoscale friction, i.e., human touch, as seen in Fig. 1 in a mock finger and Fig. 3 in a real finger. By demonstrating this statistically spurious trend, we believe this strongly supports our premise that an alternative to friction coefficient is needed in the design of tactile psychophysics and haptic interfaces.

      We believe that this could have been misinterpreted, so we took several steps to improve clarity, given the importance of this finding: we separated the panel on friction coefficient to its own panel, we changed from R<sup>2</sup> to r throughout, and we added clarifying text. We also added a small section focusing on this spurious trend.

      Our changes to the manuscript

      Page 1, Abstract

      “In fact, the typical method of averaging friction coefficients led to a spurious correlation which erroneously suggests that distinct objects should feel identical and identical objects should feel distinct.”

      Page 7

      “As Fig. 1 was constructed from friction measurements, we can also calculate an average friction coefficient, µ, by averaging the friction coefficient obtained at each of the 16 combinations of masses and velocities (Table 1). This calculation is a standard approach in tactile studies for summarizing friction measurements, or in some cases, surfaces are never characterized at multiple masses and velocities. However, summarizing friction data in this manner has been considered as conceptually questionable by others from a mechanics perspective.[3] Fig. 1 shows that the type of instabilities and friction forces encountered on a single surface can vary widely depending on the conditions. As a result, large variations in the friction coefficient are expected, depending on the mass and velocity — even though measurements originate from the same surface. This variability in friction coefficient can be seen with the large interquartile range of friction coefficients, which shows that the variation in friction coefficient across a single surface is similar, or even larger, than the differences in average friction coefficient across two different surfaces. The observation that friction coefficients vary so widely on a single surface calls into question the approach of analyzing how humans may perceive two different objects based on their average friction coefficients.”

      Page 9, Fig. 2 Caption

      “D) GLMM of accuracy vs. difference in average friction coefficient , showing a negative correlation. E) GLMMs of accuracy vs. other commonly used material properties or parameters: ΔAverage roughness R<sub>a</sub>, ΔHurst exponent H, and ΔWater contact angle hysteresis (º) (N = 10 participants_, _n = 600 total trials).”

      Page 9

      “Considering all instabilities individually, we found that only steady sliding was a positive, statistically significant predictor. (r \= 0.62, p < 0.05, shown in Fig. 2B).”

      Page 10

      “To compare the value of looking at frictional instabilities, we also performed GLMM fits on common approaches in the field, like a friction coefficient or material property typically used in tactile discrimination, shown in Fig. 2D-E. Interestingly, in Fig. 2D, we observed a spurious, negative correlation between friction coefficient (typically and often problematically simplified as across all tested conditions) and accuracy (r = -0.64, p < 0.01); that is, the more different the surfaces are by friction coefficient, the less people can tell them apart. This spurious correlation would be the opposite of intuition, and further calls into question the common practice of using friction coefficients in touch-related studies. Interestingly, this spurious correlation was also found by Gueorguiev et al.[21] The alternative, two-term model which includes adhesive contact area for friction coefficient[32] was even less predictive (see Fig. S6A of SI). We believe such a correlation could not have been uncovered previously as our samples are minimal in their physical variations. Yet, the dynamic changes in force even within a single sample are not considered, despite being a key feature of mesoscale friction during human touch.

      We investigate different material properties in Fig. 2E. Differences in average roughness R<sub>a</sub> (or other parameters, like root mean square roughness R<sub>rms</sub> (Fig. S6A of SI) did not show a statistically significant correlation to accuracy. Though roughness is a popular parameter, correlating any roughness parameter to human performance here could be moot: the limit of detecting roughness differences has previously been defined as 13 nm on structured surfaces[36] and much higher for randomly rough surfaces,[49] all of which are magnitudes larger than the roughness differences between our surfaces. The differences in contact angle hysteresis – as an approximation of the adhesion contributions[50] – do not present any statistically significant effects on performance.”

      Page 11-12

      “Despite the correlative nature of this study, we still obtained high correlations compared to existing biomechanical studies[4,19,21], which we speculate is because instabilities are an important predictive phenomenon for models of human touch. We believe that biomechanical studies, including more sophisticated techniques, like spatially resolved force maps from digital image correlation[5,42] may yield stronger correlations and results if they analyze data based on instabilities.

      Added References

      (2) Khamis, H. et al. Friction sensing mechanisms for perception and motor control: passive touch without sliding may not provide perceivable frictional information. J. Neurophysiol. 125, 809– 823 (2021).

      (6) Olczak, D., Sukumar, V. & Pruszynski, J. A. Edge orientation perception during active touch. J. Neurophysiol. 120, 2423–2429 (2018).

      Comment 2, Part 1

      Along the same lines, other parameters that were considered such as the "Percent Correct vs. Difference in Sp" and "Percent Correct vs. Difference in SFW" were not plotted for consideration in the SI. It would be helpful to compare these results with the other three metrics in order to fully understand the relationships.

      We have added these plots to the SI. We note that we had checked these relationships and discussed them briefly, but did not include the plot. The plots show that the type of instability was not as helpful as its presence or absence.

      Our changes to the manuscript

      Page 9

      “Furthermore, a model accounting for slow frictional waves alone specifically shows a significant, negative effect on performance (p < 0.01, Fig. S5 of SI), suggesting that in these samples and task, the type of instability was not as important.”

      “Fig. S5. GLMM fits of participant accuracy vs. the differences in instability incidence for individual instability types. Left: accuracy vs. differences in formation of slow frictional waves (SFW) between pairs. P1 and P5 have the same x-axis value and are shifted for clarity. Right: accuracy vs. differences in formation of stiction spikes (Sp).”

      SI Page 4

      “and no correlation between accuracy and stiction spikes (Fig. S5).”

      Comment 2, Part 2

      Other parameters such as stiction magnitude and differences in friction coefficient over the test space could also be important and interesting.

      We agree these are interesting and have thought about them. We are aware that others, like Gueorguiev et al., have studied stiction magnitudes, and though there was a correlation, the physical differences in surface roughness (glass versus PMMA) investigated made it unclear if these could be generalized further.[3] We are unsure how to proceed here with a satisfactory analysis of stiction magnitude, given that stiction spikes are not always generated. In fact, Fig. 1 shows that for many velocities and pressures, stiction spikes are not formed. In ongoing work, however, we are always cognizant that if stiction spikes are a dominant factor, then a secondary analysis on their magnitude would be important. We offer some speculation on why stiction spikes may be overrepresented in the literature:

      (1) They are prone to being created if the finger was loaded for a long time onto a surface prior to movement, thus creating adhesion by contact aging which is unlike active human exploration. We avoid this by discarding the first pull in our measurements, which is a standard practice in mechanical characterization if contact aging needs to be avoided.

      (2) The ranges of velocities and pressures explored by others were small.

      (3) In an effort to generate strong tactile stimuli, highly adhesive or rough surfaces are used.

      (4) Stiction spikes are visually distinctive on a plot, but we are unaware of any mechanistic reason that mechanoreceptors would be particularly sensitive to this low frequency event over other signals.

      We interpret “difference in friction coefficient over the test space” to be, for a single surface, like C4, to find the highest average friction for a condition of single velocity and mass and subtract that from the lowest average friction for a condition of single velocity and mass. We calculated the difference in friction coefficient in the typical manner of the field, by averaging all data collected at all velocities and masses and assigning a single value for all of a surface, like C4. We had performed this, and have the data, but we are wary of overinterpreting secondary and tertiary metrics because they do not have any fundamental basis in traditional tribology, and this value, if used by humans, would suggest that they rapidly explore a large parameter space to find a “maximum” and “minimum” friction. Furthermore, the range in friction across the test space, after averaging, can be smaller than the range of friction experienced at different masses and velocities on a single surface. We have tabulated and newly included these values (the interquartile range of friction coefficients of different masses and velocities per surface) in Table 1.

      Fig. 2D shows a GLMM fit between percent correct responses across our pairs and the differences in friction coefficient for each pair, where we see a spurious negative correlation. As we had the data of all average friction coefficients for each condition for a given material, we also looked at the difference in maximum and minimum friction coefficients. For our tested pairs, these differences also lined up on a statistically significant, negative GLMM fit (r = -0.86, p < 0.005). However, the values for a given surface can vary drastically, with an interquartile range of 1.20 to 2.09 on a single surface. We fit participant accuracy to the differences in these IQRs across pairs. This also led to a negative GLMM fit (r = -0.65, p < 0.05). However, we are hesitant to add this plot to the manuscript for the reasons stated previously.

      Comment 3, Part 1

      Beyond this fundamental concern, there is a weakness in the representativeness of the PDMS finger, the vertical motion, and the speed of sliding to real human exploration.

      Overall, this is a continuous debate that we think offers two solutions, and we are not advocating for an “either-or” case. There is always a tradeoff between using a synthetic model of a finger versus a real human finger, and there is a place for both models. That is, while our mock finger will be “better” the more similar it is to a human finger, it is not our goal to fully replace a human finger. Rather our goal is to provide a consistent method of characterizing surfaces that is sufficiently similar to human touch as to be a useful and predictive tool.

      The usefulness of the mock finger is in isolating the features of each surface that is independent of human variability, i.e., instabilities that form without changing loading conditions between sliding motions or even within one sliding motion. Of course, with this method, we still require confirmation of these features still forming during human exploration, which we show in Fig. 3. We believe that this method of characterizing surfaces at the mesoscale will ultimately lead to more successful human studies on tactile perception. Currently, and as shown in the paper, characterizing surfaces through traditional techniques, such as a commercial tribometer (friction coefficient, using a steel or hard metal ball), roughness (via atomic force microscopy or some other metrology), surface energy are less or not at all predictive. Thus, we believe this mock finger is better than the current state-of-the-art characterizing surfaces (we are also aware of a commercial mock finger company, but we were unable to purchase or obtain an evaluation model).

      One of the main – and severe – limitations of using a human finger is that all fingers are different, meaning any study focusing on a particular user may not apply to others or be recreated easily by other researchers. We do not think it is feasible to set a standard for replication around a real human finger as that participant may no longer be available, or willing to travel the world as a “standard”. Furthermore, the method in which a person changes their pressures and velocities is different. We note that this is a challenge unique to touch perception – how an object is touched changes the friction generated, and thus the tactile stimulus generated, whereas a standardized stimulus is more straightforward for light or sound.

      However, we do emphasize that we have strongly considered the balance between feasibility and ecological validity in the design of a mock finger. We have a mock finger, with the three components of stiffness of a human finger (more below). Furthermore, we have also successfully used this mock finger in correlations with human psychophysics in previous work, where findings from our mechanical experiments were more predictive of human performance[4–7] than other available methods.

      Our changes to the manuscript Added (Page 2-3)

      “Mock finger as a characterization tool

      We use a mechanical setup with a PDMS (poly(dimethylsiloxane)) mock finger to derive tactile predictors as opposed to direct biomechanical measurements on human participants. While there is a tradeoff in selecting a synthetic finger over a real human finger to modeling human touch, human fingers themselves are also highly variable[23] both in their physical shape and their use during human motion. Our goal is to design a consistent method of characterization of samples that can be easily accessed by other researchers and does not rely on a standard established around single human participant. We believe that sufficient replication of surface, bulk properties, and contact geometry results in characterization that isolates consistent features of surfaces that are not derived from human-to-human variability. We have used this approach to successfully correlate human results with mock finger characterization previously.[8,9,24]

      The major component of a human finger, by volume, is soft tissue (~56%),[25] resulting in an effective modulus close to 100 kPa.[26,27] In order to achieve this same softness, we crosslink PDMS in a 1×1×5 cm mold at a 30:1 elastomer:crosslinker ratio. In addition, two more features in the human finger impart significant mechanical differences. Human fingers have a bone at the fingertip, the distal phalanx,[26–28, 8–10]which we mimic with an acrylic “bone” within our PDMS network. The stratum corneum, the stiffer, glassier outer layer of skin,[29] is replicated with the surface of the mock finger glassified, or further crosslinked, after 8 hours of UV-Ozone treatment.30 This treatment also modifies the surface properties of the native PDMS to align with those of a human finger more closely: it minimizes the viscoelastic tack at the surface, resulting in a comparable non-sticky surface. Stabilizing after one day after treatment, the mock finger surface obtains a moderate hydrophilicity (~60º), as is typically observed for a real finger.[11,31]

      The initial contact area formed before a friction trace is collected is a rectangle of 1×1 cm. While this shape is not entirely representative of a human finger with curves and ridges, human fingers flatten out enough to reduce the effects of curvature with even very light pressures.[31–33] This implies that for most realistic finger pressures, the contact area is largely load-independent, which is more accurately replicated with a rectangular mock finger.

      Lastly, we consider the role of fingerprint ridges. A key finding of our previous work is that while fingerprints enhanced frictional dynamics at certain conditions, key features were still maintained with a flat finger.[11] Furthermore, for some loading conditions, the more amplified signals could also result in more similar friction traces for different surfaces. We have observed good agreement between these friction traces and human experiments.[8,9,22,34]”

      Page 3-4, Materials and Methods

      “Mock Finger Preparation

      Friction forces across all six surfaces were measured using a custom apparatus with a polydimethylsiloxane (PDMS, Dow Sylgard 184) mock finger that mimics a human finger’s mechanical properties and contact mechanics while exploring a surface relatively closely.[8,9] PDMS and crosslinker were combined in a 30:1 ratio to achieve a stiffness of 100 kPa comparable to a real finger, then degassed in a vacuum desiccator for 30 minutes. We are aware that the manufacturer recommended crosslinking ratio for Sylgard 184 is 10:1 due to potential uncrosslinked liquid residues,[35] but further crosslinking concentrated at the surface prevents this. The prepared PDMS was then poured into a 1×1×5 cm mold also containing an acrylic 3D-printed “bone” to attach applied masses on top of the “fingertip” area contacting a surface during friction testing. After crosslinking in the mold at 60ºC for 1 hour, the finger was treated with UV-Ozone for 8 hours out of the mold to minimize viscoelastic tack.

      Mechanical Testing

      A custom device using our PDMS mock finger was used to collect macroscopic friction force traces replicating human exploration.[8,9] After placing a sample surface on a stage, the finger was lowered at a slight angle such that an initial 1×1 cm rectangle of “fingertip” contact area could be established. We considered a broad range of applied masses (M \= 0, 25, 75, and 100 g) added onto the deadweight of the finger (6 g) observed during a tactile discrimination task. The other side of the sensor was connected to a motorized stage (V-508 PIMag Precision Linear Stage, Physikinstrumente) to control both displacement (4 mm across all conditions) and sliding velocity (v \= 5, 10, 25, and 45 mm s<sup>-1</sup>). Forces were measured at all 16 combinations of mass and velocity via a 250 g Futek force sensor (k \= 13.9 kN m<sup>-1</sup>) threaded to the bone, and recorded at an average sampling rate of 550 Hz with a Keithley 7510 DMM digitized multimeter. Force traces were collected in sets of 4 slides, discarding the first due to contact aging. Because some mass-velocity combinations were near the boundaries of instability phase transitions, not all force traces at these given conditions exhibited similar profiles. Thus, three sets were collected on fresh spots for each condition to observe enough occurrences of multiple instabilities, at a total of nine traces per combination for each surface.”

      Added References

      (23) Infante, V. H. P. et al. The role of skin hydration, skin deformability, and age in tactile friction and perception of materials. Sci. Rep. 15, 9935 (2025).

      (24) Nolin, A., Lo, C.-Y., Kayser, L. V. & Dhong, C. B. Transparent and Electrically Switchable Thin Film Tactile Actuators Based on Molecular Orientation. Preprint at https://doi.org/10.48550/arXiv.2411.07968 (2024).

      (25) Murai, M., Lau, H.-K., Pereira, B. P. & Pho, R. W. H. A cadaver study on volume and surface area of the fingertip. J. Hand Surg. 22, 935–941 (1997).

      (26) Abdouni, A. et al. Biophysical properties of the human finger for touch comprehension: influences of ageing and gender. R. Soc. Open Sci. (2017) doi:10.1098/rsos.170321.

      (27) Cornuault, P.-H., Carpentier, L., Bueno, M.-A., Cote, J.-M. & Monteil, G. Influence of physico-chemical, mechanical and morphological fingerpad properties on the frictional distinction of sticky/slippery surfaces. J. R. Soc. Interface (2015) doi:10.1098/rsif.2015.0495.

      (28) Qian, K. et al. Mechanical properties vary for different regions of the finger extensor apparatus. J. Biomech. 47, 3094–3099 (2014).

      (29) Yuan, Y. & Verma, R. Measuring microelastic properties of stratum corneum. Colloids Surf. B Biointerfaces 48, 6–12 (2006).

      (30) Fu, Y.-J. et al. Effect of UV-Ozone Treatment on Poly(dimethylsiloxane) Membranes: Surface Characterization and Gas Separation Performance. Langmuir 26, 4392–4399 (2010).

      Comment 3, Part 2

      The real finger has multiple layers with different moduli. In fact, the stratum corneum cells, which are the outer layer at the interface and determine the friction, have a much higher modulus than PDMS. The real finger has multiple layers with different moduli. In fact, the stratum corneum cells, which are the outer layer at the interface and determine the friction, have a much higher modulus than PDMS.

      We have approximated the softness of the finger with 100 kPa crosslinked PDMS, which is close to what has been reported for the bulk of a human fingertip.[9,10] However, as mentioned in the Materials and Methods, there are two additional features of the mock finger that impart greater strength. The PDMS surrounds a rigid, acrylic bone comparable to the distal phalanx, which provides an additional layer of higher modulus.[8] Additionally, the 8-hour UV-Ozone treatment decreases the viscoelastic tack of the pristine PDMS by glassifying, or further crosslinking the surface of the finger,[12] therefore imparting greater stiffness at the surface similar to the contributions of the stratum corneum, along with a similar surface energy.[13] This technique is widely used in wearables,[14] soft robotics,[15] and microfluidics[16] to induce both these material changes. Additionally, the finger is used at least a day after UV-Ozone treatment is completed to generate a stable surface that is moderately hydrophilic, similar to the outermost layer of human skin.[17]

      Comment 3, Part 3

      In addition, the slanted position of the finger can cause non-uniform pressures across the finger. Both can contribute to making the PDMS finger have much more stick-slip than a real finger.

      To ensure that there is minimal contribution from the slanted position of the finger, an initial contact area of 1×1 cm is established before sliding and recording friction measurements. As the PDMS finger is a soft object, the portion in contact with a surface flattens and the contact area remains largely unchanged during sliding. Any additional stick-slip after this alignment step is caused by contact aging at the interface, but the first trace we collect is always discarded to only consider stick-slip events caused by surface chemistry. We recognize that it is difficult to completely control the pressure distribution due to the planar interface, but this is also expected when humans freely explore a surface.

      Comment 3, Part 4

      In fact, if you look at the regime maps, there is very little space that has steady sliding. This does not represent well human exploration of surfaces. We do not tend to use a force and velocity that will cause extensive stick-slip (frequent regions of 100% stick-slip) and, in fact, the speeds used in the study are on the slow side, which also contributes to more stick-slip. At higher speeds and lower forces, all of the materials had steady sliding regions.”

      We are not aware of published studies that extensively show that humans avoid stickslip regimes. In fact, we are aware familiar with literature where stiction spike formation is suppressed – a recent paper by AliAbbasi, Basdogan et. al. investigates electroadhesion and friction with NaCl solution-infused interfaces, resulting in significantly steadier forces.[18] We also directly showed evidence of instability formation that we observed during human exploration in Fig. 3B-C. These dynamic events are common, despite the lack of control of normal forces and sliding velocities. We also note that Reviewer 1, Comment 2, Part 2 was suggesting that we further explore possible trends from parameterizing the stiction spike.

      We note that many studies have often not gone at the velocities and masses required for stiction spikes – even though these masses and velocities would be routinely seen in free exploration – this is usually due to constraints of their equipment.[19] Sliding events during human free exploration of surfaces can exceed 100 mm/s for rapid touches. However, for the surfaces investigated here, we observe that large regions of stick-slip can emerge at velocities as low as 5 mm/s depending on the applied load. The incidence of steady sliding appears more dependent on the applied mass, with almost no steady sliding observed at or above 75 g. Indeed, the force categorization along our transition zones is the main point of the paper.

      Comment 3, Part 5

      Further, on these very smooth surfaces, the friction and stiction are more complex and cannot dismiss considerations such as finger material property change with sweat pore occlusion and sweat capillary forces. Also, the vertical motion of both the PDMS finger and the instructed human subjects is not the motion that humans typically use to discriminate between surfaces.

      We did not describe the task sufficiently. Humans were only given the instruction to slide their finger along a single axis from top to bottom of a sample, not vertical as in azimuthal to gravity. We have updated our wording in the manuscript to reflect this.

      Page 4

      “Participants could touch for as long as they wanted, but were asked to only use their dominant index fingers along a single axis to better mimic the conditions for instability formation during mechanical testing with the mock finger.”

      Page 11

      “The participant was then asked to explore each sample simultaneously, and ran over each surface in strokes along a single axis until the participant could decide which of the two had “more friction”.”

      Comment 3, Part 6

      Finally, fingerprints may not affect the shape and size of the contact area, but they certainly do affect the dynamic response and detection of vibrations.”

      We are aware of the nuance. Our previous work on the role of fingerprints on friction experienced by a PDMS mock finger showed enhanced signals with the incorporation of ridges on the finger and used a rate-and-state model of a heterogenous, elastic body to find corresponding trends (though there is no existing model of friction that can accurately model experiments on mesoscale friction).[11] The key conclusion was that a flat finger still preserved key dynamic features, and the presence of stronger or more vibrations could result in more similar forces for different surfaces depending on the sliding conditions.

      This is also in the context that we are seeking to provide a reasonable and experimentally accessible method to characterize surfaces, which will always be better as we get closer in replicating a true human finger. But our goal here was to replicate the finger sufficiently for use in human studies. We believe the more appropriate metric of success is if the mock finger is more successful than replacing traditional characterization experiments, like friction coefficient, roughness, surface energy, etc.

      Comment 4

      This all leads to the critical question, why are friction, normal force, and velocity not measured during the measured human exploration and in a systematic study using the real human finger? The authors posed an extremely interesting hypothesis that humans may alter their speed to feel the instability transition regions. This is something that could be measured with a real finger but is not likely to be correlated accurately enough to match regime boundaries with such a simplified artificial finger.

      We are excited that our manuscript offers a tractable manner to test the hypothesis that tactile decision-making models use friction instabilities as evidence. However, we lay out the challenges and barriers, and how the scope of this paper will lead us in that direction. We also clarify that our goals are to provide a method to characterize samples to better design tactile interfaces in haptics or in psychophysical experiments and raise awareness that the common methods of sample characterization in touch by an average friction coefficient or roughness is fundamentally unsound. Throughout the paper, we have made changes to reflect that our study, at this point, is only correlative.

      As discussed in the summary, and with additional detail here, to further support our findings through observation on humans would require answering:

      (1) Which one, or combination of, of the multiple swipes that people make responsible for a tactile decision? (There is a need for a decision-making model)

      (2) Establish what is, or may be, tactile evidence.

      (3) Establish tactile decision-making models are similar or different than existing decision-making models.

      (4) Design a task that does not require the use of subjective tactile descriptors, like “which one feels rougher”, which we have seen causes confusion in participants, which will likely require accounting for memory effects.

      We elaborate these points below:

      To successfully perform this experiment, we note that freely exploring humans make multiple strokes on a surface. Therefore, we would need to construct a decision-making model. It has not yet been demonstrated whether tactile decision making follows visual decision making, but perhaps to start, we can assume it does. Then, in the design of our decision-making paradigm, we immediately run into the problem: What is tactile evidence?

      From Fig. 3C, we already can see that identifying evidence is challenging. Prior to this manuscript, people may have chosen the average force, or the highest force. Or we may choose the average friction force. Then, after deciding on the evidence, we need to find a method to manipulate the evidence, i.e., create samples or a machine that causes high friction, etc. We show that during the course of human touch, due to the dynamic nature of friction, the average can change a large amount and sample design becomes a central barrier to experiments. Others may suggest immobilizing the finger and applying a known force, but given how much friction changes with human exploration, there is no known method to make a machine recreate temporally and spatially varying friction forces during sliding onto a stationary finger. Finally, perhaps most importantly, in addition to mechanical challenges, a study by Liu, Colgate et al. showed that even if they recorded the friction (2D) of a finger exploring a surface and then replicated the same friction forces onto a finger, the participant could not determine which surface the replayed friction force was supposed to represent.[1] This supports that the efference copy is important, that the forces in response to expected motion are important to determine friction. Finally, there is no known method to design instabilities a priori. They must be found through experiments. Especially since if we were to introduce, say a bump or a trough, then we bring in confounding variables to how participants tell surfaces apart.

      Furthermore, even if we had some consistent method to create tactile “evidence”, the paradigm also deserves some consideration. In our experience, the 3-AFC task we perform is important because the vocabulary for touch has not been established. That is, in 3-AFC, by asking to determine which one sample is unlike the others, we do not have to ask the participant questions like “which one is rougher” or “which one has less friction”. In contrast, 2-AFC, which is better for decision-making models because it does not include memory, requires the asking of a perceptual question like: “which one is rougher?”. In our ongoing work, taking two silane coatings, we found that participants could easily identify which surface is unlike the others above chance in a 3-AFC, but participants, even within their own trials, could not consistently identify one silane as perceptually “rougher” by 2-AFC. To us, this calls into question the validity of tactile descriptors, but is beyond the scope of this manuscript.

      This is not our only goal, but in the context of human exploration, in this manuscript here, we believed it was important to identify a mechanical parameter that was consistent with how humans explore surfaces, but was also a parameter that could characterize to some consistent property of a surface – irrespective of whether a human was touching it. We thought that designing human decision-making models and paradigms around the friction coefficient would not be successful.

      Given the scope of these challenges, we do not think it would be possible to establish these conceptual sequences in a single manuscript. However, we think that our manuscript brings an important step forward to approach this problem.

      Reviewer 2 (Public review):

      Summary:

      In this paper, the authors want to test the hypothesis that frictional instabilities rather than friction are the main drivers for discriminating flat surfaces of different sub-nanometric roughness profiles.

      They first produced flat surfaces with 6 different coatings giving them unique and various properties in terms of roughness (picometer scale), contact angles (from hydrophilic to hydrophobic), friction coefficient (as measured against a mock finger), and Hurst exponent.

      Then, they used those surfaces in two different experiments. In the first experiment, they used a mock finger (PDMS of 100kPA molded into a fingertip shape) and slid it over the surfaces at different normal forces and speeds. They categorized the sliding behavior as steady sliding, sticking spikes, and slow frictional waves by visual inspection, and show that the surfaces have different behaviors depending on normal force and speed. In a second experiment, participants (10) were asked to discriminate pairs of those surfaces. It is found that each of those pairs could be reliably discriminated by most participants.

      Finally, the participant's discrimination performance is correlated with differences in the physical attributes observed against the mock finger. The authors found a positive correlation between participants' performances and differences in the count of steady sliding against the mock finger and a negative correlation between participants' reaction time and differences in the count of stiction spikes against the mock finger. They interpret those correlations as evidence that participants use those differences to discriminate the surfaces.

      Strengths:

      The created surfaces are very interesting as they are flat at the nanometer scale, yet have different physical attributes and can be reliably discriminated.

      We thank Reviewer 2 for their notes on our manuscript. The responses below address the reviewer’s comments and recommendations for revised work.

      Weaknesses:

      Comment 1

      In my opinion, the data presented in the paper do not support the conclusions. The conclusions are based on a correlation between results obtained on the mock finger and results obtained with human participants but there is no evidence that the human participants' fingertips will behave similarly to the mock finger during the experiment. Figure 3 gives a hint that the 3 sliding behaviors can be observed in a real finger, but does not prove that the human finger will behave as the mock finger, i.e., there is no evidence that the phase maps in Figure 1C are similar for human fingers and across different people that can have very different stiffness and moisture levels.

      We have made changes throughout the manuscript to acknowledge that our findings are correlative, clarifying this throughout, and incorporating into the discussion how our work may enable biomechanical measurements and tactile decision making models.

      The mechanical characterization conducted with the mock finger seeks to extract significant features of friction traces of a set of surfaces to use as predictors of tactile discriminability. The goal is to find a consistent method to characterize surfaces for use in tactile experiments that can be replicated by others and used prior to any human experiments. However, in the overall response and in a response to a similar comment by Reviewer 1 (recreated below), we also explain why we believe experiments on humans to establish this fact is not yet reasonable.

      First, we discuss the mock finger. The PDMS finger is treated to have comparable surface and bulk properties to a human finger. We have approximated the softness of the finger with 100 kPa crosslinked PDMS, which is close to what has been reported for the bulk of a human fingertip.[9,10] However, as mentioned in the Materials and Methods, there are two additional features of the mock finger that impart greater strength. The PDMS surrounds a rigid, acrylic bone comparable to the distal phalanx, which provides an additional layer of higher modulus.[8] Additionally, the 8-hour UV-Ozone treatment decreases the viscoelastic tack of the pristine PDMS by glassifying, or further crosslinking the surface of the finger,[12] therefore imparting greater stiffness at the surface similar to the contributions of the stratum corneum, along with a similar surface energy.[13] Additionally, the finger is used at least a day after UV-Ozone treatment is completed in order for the surface to return to moderate hydrophilicity, similar to the outermost layer of human skin.[17] We also discuss the shape of the contact formed. To ensure that there is minimal contribution from the slanted position of the finger, an initial contact area of 1×1 cm is established before sliding and recording friction measurements. As the PDMS finger is a soft object, the portion in contact with a surface flattens and the contact area remains largely unchanged during sliding. Any additional stick-slip after this alignment step is caused by contact aging at the interface, but the first trace we collect is always discarded to only consider stick-slip events caused by surface chemistry. We recognize that it is difficult to completely control the pressure distribution due to the planar interface, but this is also expected when humans freely explore a surface. Finally, we consider flat vs. fingerprinted fingers. Our previous work on the role of fingerprints on friction experienced by a PDMS mock finger showed enhanced signals with the incorporation of ridges on the finger and used a rate-and-state model of a heterogenous, elastic body to find corresponding trends.[11] The key conclusion was that a flat finger still preserved key dynamic features, and the presence of stronger or more vibrations could result in more similar forces for different surfaces depending on the sliding conditions. We note that we have subsequently used this flat mock finger in correlations with human psychophysics in previous work, where findings from our mechanical experiments were predictive of human performance.[4–7] We have added these details to the manuscript.

      With this adequately similar mock finger, we collected friction traces at controlled conditions of normal force and velocity in order to extract the signals unique to each material that are not caused by the influence of human variability. For example, we observe the smallest regions of steady sliding on our phase maps (Fig. 1C) for short-chain alkylsilanes C4 and C5, while the increased intermolecular forces of other silanes increase the incidence of steady sliding. We have also previously shown that comparisons of similarly collected mechanical data is predictive of human performance, using the crosscorrelations between signals of two different materials.[4–7] While different participants produce different raw signals, we see that broad categories of stick-slip, i.e. instabilities, can be extracted (Fig. 3B-C) and used as a cue in a tactile discrimination task. As mentioned above, we have provided an additional section about the usefulness of our mock finger, as well as its structure, in the main manuscript.

      Second, we lay out the challenges and barriers to demonstrating this in humans in the manner requested by the reviewer, and how the scope of this paper will lead us in that direction. We also clarify that our goals are to provide a method to characterize samples to better design tactile interfaces in haptics or in psychophysical experiments and raise awareness that the common methods of sample characterization in touch by an average friction coefficient or roughness is fundamentally unsound.

      As discussed in the summary, and with additional detail here, to further support our findings through observation on humans would require answering:

      (1) Which one, or combination of, of the multiple swipes that people make responsible for a tactile decision?

      (2) Establish what is, or may be, tactile evidence.

      (3) Establish tactile decision-making models are similar or different than existing decision-making models.

      (4) Test the hypothesis, in these models, that friction instabilities are evidence, and not some other unknown metric.

      (5) Design a task that does not require the use of subjective tactile descriptors, like “which one feels rougher”, which we see cause confusion in participants, which will likely require accounting for memory effects.

      We elaborate these points below:

      To successfully perform this experiment, we note that freely exploring humans make multiple strokes on a surface. Therefore, we would need to construct a decision-making model. It has not yet been demonstrated whether tactile decision making follows visual decision making, but perhaps to start, we can assume it does. Then, in the design of our decision-making paradigm, we immediately run into the problem: What is tactile evidence?

      From Fig. 3C, we already can see that identifying evidence is challenging. Prior to this manuscript, people may have chosen the average force, or the highest force. Or we may choose the average friction force. Then, after deciding on the evidence, we need to find a method to manipulate the evidence, i.e., create samples or a machine that causes high friction, etc. We show that during the course of human touch, due to the dynamic nature of friction, the average can change a large amount and sample design becomes a central barrier to experiments. Others may suggest immobilizing the finger and applying a known force, but given how much friction changes with human exploration, there is no known method to make a machine recreate temporally and spatially varying friction forces during sliding onto a stationary finger. Finally, perhaps most importantly, in addition to mechanical challenges, a study by Liu, Colgate, et al. showed that even if they recorded the friction (2D) of a finger exploring a surface and then replicated the same friction forces onto a finger, the participant could not determine which surface the replayed friction force was supposed to represent.[1] This supports that the efference copy is important, that the forces in response to expected motion are important to determine friction. Finally, there is no known method to design instabilities a priori. They must be found through experiments, especially since if we were to introduce, say a bump or a trough, then we bring in confounding variables to how participants tell surfaces apart.

      Furthermore, even if we had some consistent method to create tactile “evidence”, the paradigm also deserves some consideration. In our experience, the 3-AFC task we perform is important because the vocabulary for touch has not been established. That is, in 3-AFC, by asking to determine which one sample is unlike the others, we do not have to ask the participant questions like “which one is rougher” or “which one has less friction”. In contrast, 2-AFC, which is better for decision-making models because it does not include memory, requires the asking of a perceptual question like: “which one is rougher?”. In our ongoing work, taking two silane coatings, we found that participants could easily identify which surface is unlike the others above chance in a 3-AFC, but participants, even within their own trials, could not consistently identify one silane as perceptually “rougher” by 2-AFC. To us, this calls into question the validity of tactile descriptors, but is beyond the scope of the current manuscript.

      This is not our only goal, but in the context of human exploration, in this manuscript here, we believed it was important to identify a mechanical parameter that was consistent with how humans explore surfaces, but was also a parameter that could characterize to some consistent property of a surface – irrespective of whether a human was touching it. We thought that designing human decision-making models and paradigms around the friction coefficient would not be successful.

      Given the scope of these challenges, we do not think it would be possible to establish this conceptual sequence in a single manuscript.

      See Reviewer 1, comment 3part 3 for changes to the manuscript

      Comment 2

      I believe that the authors collected the contact forces during the psychophysics experiments, so this shortcoming could be solved if the authors use the actual data, and show that the participant responses can be better predicted by the occurrence of frictional instabilities than by the usual metrics on a trial by trial basis, or at least on a subject by subject basis. I.e. Poor performers should show fewer signs of differences in the sliding behaviors than good performers.

      To fully implement this, a decision-making model is necessary because, as a counter example, a participant could have generated 10 swipes of SFW and 1 swipe of a Sp, but the Sp may have been the most important event for making a tactile decision. This type of scenario is not compatible with the analysis suggested — and similar counterpoints can be made for other types of seemingly straightforward analysis.

      While we are interested and actively working on this, the study here is critical to establish types of evidence for a future decision-making model. We know humans change their friction constantly during real exploration, so it is unclear which of these constantly changing values we should input into the decision making model, and the future challenges we anticipate are explained in Weaknesses, Comment 1.

      Comment 3

      The sample size (10) is very small.

      We recognize that, with all factors being equal, this sample size is on the smaller end. However, we emphasize the degree of control of samples is far above typical, with minimal variations in sample properties such as surface roughness, and every sample for every trial was pristine. Furthermore, the sample preparation (> 300 individual wafers were used) became a factor. Although not typically appropriate, and thus not included in the manuscript, a post-hoc power analysis for our 100 trials of our pair that was closest to chance, P4, (53%, closest to chance at 33%) showed a power of 98.2%, suggesting that the study was appropriately powered.

      Reviewer 2 (Recommendations for the authors):

      Comment 1

      Differences in SS and Sp (Table 2) are NOT physical or mechanical differences but are obtained by counting differences in the number of occurrences of each sliding behavior. It is rather a weird choice.

      We disagree that differences in SS and Sp are not physical or mechanical, as these are well-established phenomena in the soft matter and tribology literature.[20–22] These are known as “mechanical instabilities” and generated due to the effects of two physical phenomena: the elasticity of the finger (which is constant in our mechanical testing) and the friction forces present (which change per sample type). The motivation behind using these different shapes is that the instabilities, in some conditions, can be invariant to external factors like velocity. This would be quite advantageous for human exploration because, unlike friction coefficient, which changes with nearly any factor, including velocity and mass, the instabilities being invariant to velocity would mean that we are accurately characterizing a unique identifier of the surface even though velocity may be variable.

      This “weird choice” is the central innovation of this paper. This choice was necessary because we demonstrated that the common usage of friction coefficient is fundamentally flawed: we see that friction coefficient suggests that surface which are more different would feel more similar – indeed the most distinctive surfaces would be two surfaces that are identical, which is clearly spurious. Furthermore, Table 1 now includes the range of friction generated on a surface, the range of friction coefficients of a single surface is large – of order the differences in friction between two surfaces. This is expected in soft sliding systems and emphasizes our issue with the use of average friction coefficient in psychophysical design. One potential explanation for why we were able to see this is effect is because our surfaces have similar (< 0.6 nm variability) roughness, removing potential confounding factors from large scale roughness, and this type of low roughness control has not been widely used in tactile studies to the best of our knowledge.

      Comment 2

      Figures 2B-C: why are the x-data different than Table 2?

      The x-data in Fig. 2B-C are the absolute differences in the number of occurrences measured for a given instability type or material property out of 144 pulls. Modeling the human participant results in our GLMMs required the independent variables to be in this form rather than percentages. We initially chose to list percent differences in Table 2 to highlight the ranges of differences instead of an absolute value, but have added both for clarity.

      Our changes to the manuscript

      Page 7

      “To determine if humans can detect these three different instabilities, we selected six pairs of surfaces to create a broad range of potential instabilities present across all three types. These are summarized in Table 2, where the first column for each instability is the difference in occurrence of that instability formed between each pair, and the second is the percent difference.”

      “Thus, when comparing C4 versus C4-APTMS, they have a difference in steady sliding of 20 out of a maximum 144 pulls, for a |ΔSS| of 13.9%. The absolute value is taken to compare total differences present, as the psychophysical task does not distinguish between sample order.”

      Comment 3

      We constructed a set of coated surfaces with physical differences which were imperceptible by touch but created different types of instabilities based on how quickly a finger is slid and how hard a human finger is pressed during sliding." Yet, in your experiment, participants could discriminate them, so this is incoherent.

      To clarify the point, macroscopic objects can differ in physical shape and in chemical composition. What we meant was that the physical differences, i.e., roughness, were below a limit (Skedung et al.) that participants, without a coating, would not be able to tell these apart.[23] Therefore, the reason people could tell our surfaces apart was due to the chemical composition of the surface, and not any differences in roughness or physical effects like film stiffness (due to the molecular-scale thinness of the surface coatings, they are mechanically negligible). However, we concede that at the molecular scale, the traditional macroscopic distinction between physical and chemical is blurred.

      We have made minor revisions to the wording in the abstract. We clarify that the surface coatings had physical differences in roughness that were smaller than 0.6 nm, which based purely on roughness, would not be expected to be distinguishable to participants. Therefore, the reason participants can tell these surfaces apart is due to differences in friction generated by chemical composition, and we were able to minimize contributions from physical differences in the sample our study.

      Our changes to the manuscript

      Page 1, Abstract

      “Here, we constructed a set of coated surfaces with minimal physical differences that by themselves, are not perceptible to people, but instead, due to modification in surface chemistry, the surfaces created different types of instabilities based on how quickly a finger is slid and how hard a human finger is pressed during sliding.”

      “In one experiment, we used a mechanical mock finger to quantify and classify differences in instability formation from different coated surfaces. In a second experiment, participants perform a discrimination task using the same coated surfaces. Using the data from these two experiments, we found that human discrimination response times were faster with surfaces where the mock finger produced more stiction spikes and discrimination accuracy was higher where the mock finger produced more steady sliding. Conversely, traditional metrics like surface roughness or average friction coefficient did not relate to tactile discriminability. In fact, the typical method of averaging friction coefficients led to a spurious correlation which erroneously suggests that distinct objects should feel identical and identical objects should feel distinct—similar to findings by others. Friction instabilities may offer a more predictive and tractable framework of fine touch perception than friction coefficients, which would accelerate the design of tactile interfaces.”

      Reviewer 3 (Public review):

      Strengths

      The paper describes a new perspective on friction perception, with the hypothesis that humans are sensitive to the instabilities of the surface rather than the coefficient of friction. The paper is very well written and with a comprehensive literature survey.

      One of the central tools used by the author to characterize the frictional behavior is the frictional instabilities maps. With these maps, it becomes clear that two different surfaces can have both similar and different behavior depending on the normal force and the speed of exploration. It puts forward that friction is a complicated phenomenon, especially for soft materials.

      The psychophysics study is centered around an odd-one-out protocol, which has the advantage of avoiding any external reference to what would mean friction or texture for example. The comparisons are made only based on the texture being similar or not.

      The results show a significant relationship between the distance between frictional maps and the success rate in discriminating two kinds of surface.

      We thank Reviewer 3 for their notes and interesting discussion points on our manuscript. Below, we address the reviewer’s feedback and comments on related works.

      Weaknesses:

      Comment 1

      The main weakness of the paper comes from the fact that the frictional maps and the extensive psychophysics study are not made at the same time, nor with the same finger. The frictional maps are produced with an artificial finger made out of PDMS which is a poor substitute for the complex tribological properties of skin.

      A similar comment was made by Reviewers 1 and 2. We agree in part and have made changes throughout that our study is correlative, but presents an important step forward to these biomechanical measurements and corresponding decision making models.

      We are not claiming that our PDMS fingers are superior to real fingers, but rather, we cannot establish standards in the field by using real human fingers that vary between subjects and researchers. We believe the mock finger we designed is a reasonable mimic of the human finger by matching surface energy, heterogeneous mechanical structure, and the ability to test multiple physiologically relevant pressures and sliding velocities.

      We achieve a heterogeneous mechanical structure with the 3 primary components of stiffness of a human finger. The effective modulus of ~100 kPa, from soft tissue,[9,10] is obtained with a 30:1 ratio of PDMS to crosslinker. The PDMS also surrounds a rigid, acrylic bone comparable to the distal phalanx, which provides an additional layer of higher modulus.[8] Additionally, the 8-hour UV-Ozone treatment decreases the viscoelastic tack of the pristine PDMS by glassifying, or further crosslinking the surface of the finger,[12] therefore imparting greater stiffness at the surface similar to the contributions of the stratum corneum, along with a similar surface energy.[13] The finger is used at least a day after UV-Ozone treatment is completed in order for the surface to return to moderate hydrophilicity, similar to the outermost layer of human skin.[17] We also discuss the shape of the contact formed. To ensure that there is minimal contribution from the slanted position of the finger, an initial contact area of 1×1 cm is established before sliding and recording friction measurements. As the PDMS finger is a soft object, the portion in contact with a surface flattens and the contact area remains largely unchanged during sliding. We recognize that it is difficult to completely control the pressure distribution due to the planar interface, but this variation is also expected when humans freely explore a surface. Finally, we consider flat vs. fingerprinted fingers. Our previous work on the role of fingerprints on friction experienced by a PDMS mock finger showed enhanced signals with the incorporation of ridges on the finger and used a rate-andstate model of a heterogenous, elastic body to find corresponding trends.[11] The key conclusion was that a flat finger still preserved key dynamic features, and the presence of stronger or more vibrations could result in more similar forces for different surfaces depending on the sliding conditions. We note that we have subsequently used the controlled mechanical data collected with this flat mock finger in correlations with human psychophysics in previous work, where findings from our mechanical experiments were predictive of human performance.[4–7] Ultimately, we see from our prior work and here that, despite the drawbacks of our mock finger, it outperforms other standard characterization technique in providing information about the mesoscale that correlates to tactile perception. We have added these details to the manuscript.

      We also note that an intermediate option, replicating real fingers, even in a mold, may also inadvertently limit trends from characterization to a specific finger. One of the main – and severe – limitations of using a human finger is that all fingers are different, meaning any study focusing on a particular user may not apply to others or be recreated easily by other researchers. We cannot set a standard for replication around a real human finger as that participant may no longer be available, or willing to travel the world as a “standard”. Furthermore, the method in which a single person changes their pressures and velocities as they touch a surface is highly variable. We also note that in the Summary Response, we noted that a study by Colgate et al. (IEEE ToH 2024) demonstrated that efference copies may be important, and thus constraining a human finger and replaying the forces recorded during free exploration will not lead to the participant identifying a surface with any consistency. Thus, it is important to allow humans to freely explore surfaces, but creates nearly limitless variability in friction forces.

      This is also against the backdrop that we are seeking to provide a method to characterize surfaces. Indeed, the more features we replicate in the mock finger to a human finger, the more likely it is that the mechanical data will correlate to human performance. However, we have used this technique several times to achieve stronger correlations to human data than other available techniques. We believe the metric of success should be in comparison to the available characterization technique, rather than a 1:1 reconstruction of forces of an arbitrary human finger. Indeed, a 1:1 reconstruction of forces of an arbitrary human finger would be limited to the finger of a single individual, perhaps even to that individual on a given day.

      See Reviewer1 weaknesses, comment 2 part 2 for changes to the manuscript

      Comment 2

      The evidence would have been much stronger if the measurement of the interaction was done during the psychophysical experiment. In addition, because of the protocol, the correlation is based on aggregates rather than on individual interactions.

      We agree that this would have helped further establish our argument, but in the overall statement and in other reviewer responses, we describe the significant challenges to establishing this.

      To fully implement this, a decision-making model is necessary because, as a counter example, a participant could have generated 10 swipes of SFW and 1 swipe of a Sp, but the Sp may have been the most important event for making a tactile decision. We also clarify that our goals are to provide a method to characterize samples to better design tactile interfaces in haptics or in psychophysical experiments.

      As discussed in the summary, and expanded on here, in our view, to develop a decision-making model, the challenges are as follows:

      (1) Which one, or combination of, of the multiple swipes that people make responsible for a tactile decision?

      (2) Establish what is, or may be, tactile evidence.

      (3) Establish tactile decision-making models are similar or different than existing decision-making models.

      (4) Test the hypothesis, in these models, that friction instabilities are evidence, and not some other unknown metric.

      (5) Design a task that does not require the use of subjective tactile descriptors, like “which one feels rougher”, which we see cause confusion in participants, which will likely require accounting for memory effects.

      (6) Design samples that vary in the amount of evidence generated, but this evidence cannot be controlled directly. Rather, the samples indirectly vary evidence by how likely it is for a human to generate different types of friction instabilities during standard exploration.

      We elaborate these points below:

      To successfully perform this experiment, we note that freely exploring humans make multiple strokes on a surface. Therefore, we would need to construct a decision-making model. It has not yet been demonstrated whether tactile decision making follows visual decision making, but perhaps to start, we can assume it does. Then, in the design of our decision-making paradigm, we immediately run into the problem: What is tactile evidence?

      From Fig. 3C, we already can see that identifying evidence is challenging. Prior to this manuscript, people may have chosen the average force, or the highest force. Or we may choose the average friction force. Then, after deciding on the evidence, we need to find a method to manipulate the evidence, i.e., create samples or a machine that causes high friction, etc. We show that during the course of human touch, due to the dynamic nature of friction, the average can change a large amount and sample design becomes a central barrier to experiments. Others may suggest to immobilize the finger and applying a known force, but given how much friction changes with human exploration, there is no known method to make a machine recreate temporally and spatially varying friction forces during sliding onto a stationary finger. Finally, perhaps most importantly, in addition to mechanical challenges, a study by Liu, Colgate et al. showed that even if they recorded the friction (2D) of a finger exploring a surface and then replicated the same friction forces onto a finger, the participant could not determine which surface the replayed friction force was supposed to represent.[1] This supports that the efference copy is important, that the forces in response to expected motion are important to determine friction. Finally, there is no known method to design instabilities a priori. They must be found through experiments, especially since if we were to introduce, say a bump or a trough, then we bring in confounding variables to how participants tell surfaces apart.

      Furthermore, even if we had some consistent method to create tactile “evidence”, the paradigm also deserves some consideration. In our experience, the 3-AFC task we perform is important because the vocabulary for touch has not been established. That is, in 3-AFC, by asking to determine which one sample is unlike the others, we do not have to ask the participant questions like “which one is rougher” or “which one has less friction”. In contrast, 2-AFC, which is better for decision-making models because it does not include memory, requires the asking of a perceptual question like: “which one is rougher?”. In our ongoing work, taking two silane coatings, we found that participants could easily identify which surface is unlike the others above chance in a 3-AFC, but participants, even within their own trials, could not consistently identify one silane as perceptually “rougher” by 2-AFC. To us, this calls into question the validity of tactile descriptors, but is beyond the scope of the current manuscript.

      This is not our only goal, but in the context of human exploration, in this manuscript here, we believed it was important to identify a mechanical parameter that was consistent with how humans explore surfaces, but was also a parameter that could characterize to some consistent property of a surface – irrespective of whether a human was touching it. We thought that designing human decision-making models and paradigms around the friction coefficient would not be successful.

      Given the scope of these challenges, we do not think it would be possible to establish this conceptual sequence in a single manuscript.

      Comment 3

      The authors compensate with a third experiment where they used a 2AFC protocol and an online force measurement. But the results of this third study, fail to convince the relation.

      With this experiment, our central goal was to demonstrate that the instabilities we have identified with the PDMS finger also occur with a human finger. Several instances of SS, Sp, and SFW were recorded with this setup as a participant touched surfaces in real time.

      Comment 4

      No map of the real finger interaction is shown, bringing doubt to the validity of the frictional map for something as variable as human fingers.

      Real fingers change constantly during exploration, and friction is state-dependent, meaning that the friction will depend on how the person was moving the moment prior. Therefore, a map is only valid for a single human movement – even if participants all were instructed to take a single swipe and start from zero motion, humans are unable to maintain constant velocities and pressures. Clearly, this is not sustainable for any analysis, and these drawbacks apply to any measured parameter, whether instabilities suggested here, or friction coefficients used throughout. We believe the difficulty of this approach emphasizes why a standard map of characterization of a surface by a mock finger, even with its drawbacks, is a viable path forward.

      Reviewer 3 (Recommendations for the authors):

      Comment 1

      It would be interesting to comment on a potential connection between the frictional instability maps and Schalamack waves.

      Schallamach waves are a subset of slow frictional waves (SFW). Schallamach waves are very specifically defined in the field. They occur when pockets of air that form between a soft sliding object and rigid surface which then propagate rear-to-front (retrograde waves) relative to motion of the sliding motion and form buckles due to adhesive pinning. Wrinkles then form at the detached portion of the soft material, until the interface reattaches and the process repeats.[24] There is typically a high burden of proof to establish a Schallamach wave over a more general slow frictional wave. We note that it would be exceedingly difficult to design samples that can reliably create subsets of SFW, but we are aware that this may be an interesting question at a future point in our work.

      Comment 2

      The force sensors look very compliant, and given the dynamic nature of the signal, it is important to characterize the frequency response of the system to make sure that the fluctuations are not amplified.

      Thank you for noticing. We mistyped the sensor spring constant as 13.9 N m<sup>-1</sup> instead of kN m<sup>-1</sup>. However, below we show how the instabilities are derived from the mechanics at the interface due to the compliance of the finger. The “springs” of the force sensor and PDMS finger are connected in parallel. Since k<sub>sensor</sub> = 13.9 kN m<sup>-1</sup>, the spring constant of the system overall reflects the compliance of the finger, and highlights the oscillations arising solely from stick-slip. A sample calculation is shown below.

      Author response image 1.

      Fitting a line to the initial slope of the force trace for C6 gives the equation y = 25.679x – 0.2149. The slope here represents force data over time data, and is divided by the velocity (25 mm/s) to determine the spring constant of the system k<sub>total</sub> == 1027.16 N/m. This value is lower than k<sub>sensor</sub> = 13.9 kN/m, indicating that the “springs” representing the force sensor and PDMS finger are connected in parallel:

      . The finger is the compliant component of the system, with k<sub>finger</sub> = 1.11 kN/m, and of course, real human fingers are also compliant so this matches our goals with the design of the mock finger.

      Our changes to the manuscript

      (Page 4) (k = 13.9 kN m<sup>1</sup>)

      Comment 3

      The authors should discuss about the stochastic nature of friction: - Wiertlewski, Hudin, Hayward, IEEE WHC 2011 Greenspon, McLellan, Lieber, Bensmaia, JRSI 2020.

      We believe that, given the references, this comment on “stochastic” refers to the macroscopically-observable fluctuations (i.e., the mechanical “noise” which is not due to instrument noise) in friction arising from the discordant network of stick-slip phenomena occurring throughout the contact zone, and not the stochastic nature of nanoscale friction that occurs thermal fluctuations nor due to statistical distributions in bond breaking associated with soft contact.

      We first note that our small-scale fluctuations do not arise from a periodic surface texture that dominates in the frequency regime. However, even on our comparatively smooth surfaces, we do expect fluctuations due to nanoscale variation in contact, generation of stick-slip across at microscale length scales that occur either concurrently or discordantly across the contact zone, and the nonlinear dependence of friction to nearly any variation in state and composition.[11]

      Perhaps the most relevant to the manuscript is that a major advantage of analysis by friction is that it sidesteps these ever-present microscale fluctuations, leading to more clearly defined classifiers or categories during analysis. Wiertlewski et. al. showed repeated measurements in their systems ultimately gave rise to consistent frequencies[25] (we think their system was in a steady sliding regime and the patterning gave rise to underlying macroscopic waves). These consistent frequencies, at least in soft systems and absent obvious macroscopic patterned features, would be expected to arise from the instability categories and we see them throughout.

      Comment 4

      It is stated that "we observed a spurious, negative correlation between friction coefficient and accuracy".

      What makes you qualify that correlation as spurious?

      We mean this as in the statistical definition of “spurious”.

      This correlation would indicate that by the metric of friction coefficient, more different surfaces are perceived more similarly. Thus, two very different surfaces, like Teflon and sandpaper, by friction coefficient would be expected to feel very similar. Two nearly identical surfaces would be expected to feel very different – but of course, humans cannot consistently distinguish two identical surfaces. This finding is counterintuitive and refutes that friction coefficient is a reliable classifier of surfaces by touch. We do not think it is productive to determine a mechanism for a spurious correlation, but perhaps one reason we were able to observe this is because our study, to the best of our knowledge, is unique for having samples that are controlled in their physical differences in roughness and surface features.

      See response to Reviewer 1 weaknesses, comment 1 for changes to the manuscript

      Comment 5

      The authors should comment on the influence of friction on perceptual invariance. Despite inducing radially different frictional behavior for various conditions, these surfaces are stably perceived. Maybe this is a sign that humans extract a different metric?

      We agree – we are excited that frictional instabilities may offer a more stable perceptual cue because they are not prone to fluctuations (as discussed in Comment 3) and instability formation, in many conditions, is invariant to applied pressures and velocities – thus forming large zones where a human may reasonable encounter a given instability.

      Raw friction is highly prone to variation during human exploration (in alignment with Recommendations for the authors, Comment 3), but ongoing work seeks to explain tactile constancy, or the ability to identify objects despite these large changes in force. Very recently published work by Fehlberg et. al. identified the role of modulating finger speed and normal force in amplifying the differences in friction coefficient between materials in order to identify them,[26] and we postulate that their work may be streamlined and consistent with the idea of friction instabilities, though we have not had a chance to discuss this in-depth with the authors yet.

      We think that the instability maps show a viable path forward to how surfaces are stably perceived, and instabilities themselves show a potential mechanism: mathematically, instabilities for given conditions can be invariant to velocity or mass, creating zones where a certain instability is encountered. This reduces the immense variability of friction to a smaller, more stable classification of surfaces (e.g., a 30% SS surface or a 60% SS surface). A given surface will typically produce the same instability at a specific condition (we found some boundaries of experimental parameters are very condition sensitive, but many conditions are not), whereas a single friction trace which is highly prone to variation is not a stable metric.

      Added Reference

      (53) M. Fehlberg, E. Monfort, S. Saikumar, K. Drewing and R. Bennewitz, IEEE Trans. Haptics, 2024, 17, 957–963.

      References

      (1) Liu, Z., Kim, J.-T., Rogers, J. A., Klatzky, R. L. & Colgate, J. E. Realism of Tactile Texture Playback: A Combination of Stretch and Vibration. IEEE Trans. Haptics 17, 441–450 (2024).

      (2) Waters, I., Alazmani, A. & Culmer, P. Engineering Incipient Slip Into Surgical Graspers to Enhance Grasp Performance. IEEE Transactions on Medical Robotics and Bionics 2, 541–544 (2020).

      (3) Gueorguiev, D., Bochereau, S., Mouraux, A., Hayward, V. & Thonnard, J.-L. Touch uses frictional cues to discriminate flat materials. Sci Rep 6, 25553 (2016).

      (4) Carpenter, C. W. et al. Human ability to discriminate surface chemistry by touch. Mater. Horiz. 5, 70– 77 (2018).

      (5) Nolin, A. et al. Predicting human touch sensitivity to single atom substitutions in surface monolayers for molecular control in tactile interfaces. Soft Matter 17, 5050–5060 (2021).

      (6) Nolin, A. et al. Controlling fine touch sensations with polymer tacticity and crystallinity. Soft Matter 18, 3928–3940 (2022).

      (7) Swain, Z. et al. Self-Assembled Thin Films as Alternative Surface Textures in Assistive Aids with Users Who are Blind. J. Mater. Chem. B (2024) doi:10.1039/D4TB01646G.

      (8) Qian, K. et al. Mechanical properties vary for different regions of the finger extensor apparatus. J Biomech 47, 3094–3099 (2014).

      (9) Abdouni, A. et al. Biophysical properties of the human finger for touch comprehension: influences of ageing and gender. Royal Society Open Science (2017) doi:10.1098/rsos.170321.

      (10) Cornuault, P.-H., Carpentier, L., Bueno, M.-A., Cote, J.-M. & Monteil, G. Influence of physicochemical, mechanical and morphological fingerpad properties on the frictional distinction of sticky/slippery surfaces. Journal of The Royal Society Interface (2015) doi:10.1098/rsif.2015.0495.

      (11) Dhong, C. et al. Role of fingerprint-inspired relief structures in elastomeric slabs for detecting frictional differences arising from surface monolayers. Soft Matter 14, 7483–7491 (2018).

      (12) Fu, Y.-J. et al. Effect of UV-Ozone Treatment on Poly(dimethylsiloxane) Membranes: Surface Characterization and Gas Separation Performance. Langmuir 26, 4392–4399 (2010).

      (13) Yuan, Y. & Verma, R. Measuring microelastic properties of stratum corneum. Colloids Surf B Biointerfaces 48, 6–12 (2006).

      (14) Yu, G. et al. A wearable pressure sensor based on ultra-violet/ozone microstructured carbon nanotube/polydimethylsiloxane arrays for electronic skins. Nanotechnology 29, 115502 (2018).

      (15) Zheng, L. et al. Dual-Stimulus Smart Actuator and Robot Hand Based on a Vapor-Responsive PDMS Film and Triboelectric Nanogenerator. ACS Appl. Mater. Interfaces 11, 42504–42511 (2019).

      (16) Ma, K., Rivera, J., Hirasaki, G. J. & Biswal, S. L. Wettability control and patterning of PDMS using UV–ozone and water immersion. Journal of Colloid and Interface Science 363, 371–378 (2011).

      (17) Mavon, A. et al. Sebum and stratum corneum lipids increase human skin surface free energy as determined from contact angle measurements: A study on two anatomical sites. Colloids and Surfaces B: Biointerfaces 8, 147–155 (1997).

      (18) AliAbbasi, E. et al. Effect of Finger Moisture on Tactile Perception of Electroadhesion. IEEE Trans. Haptics 17, 841–849 (2024).

      (19) Corniani, G. et al. Sub-surface deformation of individual fingerprint ridges during tactile interactions.

      eLife 13, (2024).

      (20) Israelachvili, J. N. Intermolecular and Surface Forces. (Academic Press, 2011).

      (21) Das, S. et al. Stick–slip friction of gecko-mimetic flaps on smooth and rough surfaces. J R Soc Interface 12, 20141346 (2015).

      (22) Persson, B. N. J., Albohr, O., Creton, C. & Peveri, V. Contact area between a viscoelastic solid and a hard, randomly rough, substrate. The Journal of Chemical Physics 120, 8779–8793 (2004).

      (23) Skedung, L. et al. Feeling Small: Exploring the Tactile Perception Limits. Sci Rep 3, 2617 (2013).

      (24) Viswanathan, K., Sundaram, N. K. & Chandrasekar, S. Stick-slip at soft adhesive interfaces mediated by slow frictional waves. Soft Matter 12, 5265–5275 (2016).

      (25) Wiertlewski, M., Hudin, C. & Hayward, V. On the 1/f noise and non-integer harmonic decay of the interaction of a finger sliding on flat and sinusoidal surfaces. in 2011 IEEE World Haptics Conference 25–30 (2011). doi:10.1109/WHC.2011.5945456.

      (26) Fehlberg, M., Monfort, E., Saikumar, S., Drewing, K. & Bennewitz, R. Perceptual Constancy in the Speed Dependence of Friction During Active Tactile Exploration. IEEE Transactions on Haptics 17, 957–963 (2024).

    1. The title of the article makes a simple striking claim about the state of the scientific literature with a numerical estimate of the proportion of “fake” articles. Yet, by contrast to this title, in the text of the article, Heathers is highly critical of his own work.

      James’ peer review of Heathers’ article

      James Heathers often mentions the limitations of his research thus “peer-reviewing” his own article to the extent that he admits that this work is “incomplete”, “unsystematic” and “far flung”.

      This work is too incomplete to support responsible meta-analysis, and research that could more accurately define this figure does not exist yet. ~1 in 7 papers being fake represents an existential threat to the scientific enterprise.”

      While this is highly unsystematic, it produced a substantially higher figure. Correspondents reliably estimated 1-5% of all papers contain fabricated data, and 2-10% contain falsified results.”

      These values are too disparate to meta-analyze responsibly, and support only the briefest form of numerical summary: n=12 papers return n=16 individual estimates; these have a median of 13.95%, and 9 out of 16 of these estimates are between 13.4% and 16.9%. Given this, a rough approximation is that for any given corpus of papers, 1 in 7 (i.e. 14.3%) contain errors consistent with faking in at least one identifiable element.”

      “The accumulation of papers collected here is, frankly, haphazard. It does not represent a mature body of literature. The papers use different methods of analyzing figures, data, or other features of scientific publications. They do not distinguish well between papers that have small problematic elements which are fake, or fake in their entirety. They analyze both small and large corpora of papers, which are in different areas of study and in journals of different scientific quality – and this greatly changes base rates;…”

      “As a consequence, it would be prudent to immediately reproduce the result presented here as a formal systematic review. It is possible further figures are available after an exhaustive search, and also that pre registered analytical assumptions would modify the estimations presented.”

      Heathers has also in an interview published in Retraction Watch (Chawla 2024) acknowledged pitfalls in this article such as:

      “Heathers said he decided to conduct his study as a meta-analysis because his figures are “far flung.””

      “They are a little bit from everywhere; it’s wildly nonsystematic as a piece of work,” he said.”

      “Heathers acknowledged those limitations but argued that he had to conduct the analysis with the data that exist. “If we waited for the resources necessary to be able to do really big systematic treatments of a problem like this within a specific area, I think we’d be waiting far too long,” he said. “This is crucially underfunded.”

      Built in opposition to Fanelli 2009, but it’s illogical

      Heathers states in the abstract that his article is “in opposition” to Fanelli’s 2009 PloS One article (Fanelli 2009), yet that opposition is illogical and artificially constructed since there is no contradiction between 2% of scientists self-reporting having taking part in fabrication or falsification and an eventual much higher proportion of “fake scientific outputs”. Like most of what is wrong with Heather’s article, this is in fact acknowledged by the author who notes that the 2% figure “leaves us with no estimate of how much scientific output is fake” (bias in self-reporting, possibility of prolific authors, etc).

      Fanelli 2009 is not cited in the way JH says it is cited

      Whilst the opposition discussed above is illogical, it could be that the 2% figure is mis-cited by others as representing an estimate of fake scientific outputs thus probably underestimating the extent of fraud. Heathers suggests that this may indeed be the case, but also contradicts himself about how (Fanelli 2009), or the 2% figure coming from that publication, is typically used.

      In one sentence, he writes that “the figure is overwhelmingly the salient cited fact in its 1513 citations” and that “this generally appears as some variant ofabout 2% of scientists admitted to have fabricated, falsified or modified data or results at least once” (Frank et al. 2023)

      whilst and in another sentence, he writes that “the typical phraseology used to express it – e.g. “the most serious types of misconduct, fabrication and falsification (i.e., data fraud), are relatively rare” (George 2016).

      Those two sentences cited by Heathers are fundamentally different, the first one accurately reports that the 2% figure relates to individuals self-reporting, whilst the second one appears to relate to the prevalence of misconducts in the literature itself. How Fanelli 2009 is cited in the literature is an empirical question that can be studied by looking at citation contexts beyond the two examples given by Heathers. Given that a central justification for Heathers’ piece appears to be the misuse of this 2% figure, we sought to test whether this was the case.

      A first surprise was that whilst the sentence attributed to (George 2016) can indeed be found in that publication (in the abstract), first it is not in a sentence citing (Fanelli 2009) nor the 2% figure, and, second, it is quoted selectively omitting a part of the sentence that nuances it considerably: “The evidence on prevalence is unreliable and fraught with definitional problems and with study design issues. Nevertheless, the evidence taken as a whole seems to suggest that cases of the most serious types of misconduct, fabrication and falsification (i.e., data fraud), are relatively rare but that other types of questionable research practices are quite common.” (Fanelli 2009) is discussed extensively by (George 2016), and some of the caveats, e.g. on self-reporting, are highlighted.

      To go beyond those two examples, we constructed a comprehensive corpus of citation contexts, defined as the textual environment surrounding a paper's citation, including several words or sentences before and after the citation (see Methods section below). 737 citation contexts could be analysed. Out of those, the vast majority (533, or 72%) did not cite the 2% figure. Instead, they often referred to this article as a general reference together with other articles to make a broad point, or, focused on other numbers in particular those related to questionable research practices (Bordignon, Said, and Levy 2024). The 28% (204) citation contexts that did mention the 2% figure did so accurately in the majority of cases: 83% (170) of those did mention that it was self-reporting by scientists whilst 17% (34) of those, or 5% of the total citation contexts analysed were either ambiguous or misleading in that they suggested or claimed that the 2% figure related to scientific outputs.

      Although the analysis above does not include all citation contexts, it is possible to conclude unambiguously that the 2% figure is not overwhelmingly the salient cited fact in relation to Fanelli 2009, and that when it is cited it is often accurately, i.e. as representing self-reporting by scientists. Whilst an exhaustive analysis is beyond the scope of this peer review, it is not uncommon to find in this corpus citations contexts that have an alarming tone about the seriousness of the problem of FFPs, e.g. “…a meta-analysis (Fanelli 2009) suggest that the few cases that do surface represent only the tip of a large iceberg." [DOI: 10.1177/0022034510384627]

      Thus, the rationale for Heathers’ study appears to be misguided. The supposed lack of attention for the very serious problem of FFPs is not due to a minimisation of the situation fueled by a misinterpretation of Fanelli 2009. Importantly, even if that was the case, an attempt to draw attention by claiming that 1 in 7 papers are fake, a claim which according to the author himself is not grounded in solid facts, is not how the scientific literature should be used.

      Methods for the construction of the corpus of citation contexts

      We used Semantic Scholar, an academic database encompassing over 200 million scholarly documents from diverse sources including publishers, data providers, and web crawlers. Using the specific paper identifier for Fanelli's 2009 publication (d9db67acc223c9bd9b8c1d4969dc105409c6dfef), we queried the Semantic Scholar API to retrieve available citation contexts. Citation contexts were extracted from the "contexts" field within the JSON response pages, (see technical specifications).

      The query looks like this: semanticscholar.org

      The broad coverage of Semantic Scholar does not imply that citation contexts are always retrieved. The Semantic Scholar API provided citation contexts for only 48% of the 1452 documents citing the paper. To get more, we identified open access papers among the remaining 52% citing papers, retrieved their PDF location and downloaded the files. We used Unpaywall API, which is a database to be queried with a DOI in order to get open access information about a document. The query looks like this.

      We downloaded 266 PDF files and converted them to text format using an online bulk PDF-to-text converter. These files were then processed using TXM, a specialized textual analysis tool. We used its concordancer function to identify the term "Fanelli" as a pivot term and check the reference being the good one (the 2009 paper in PlosOne). We did manual cleaning and appended the citation contexts to the previous corpus.

      Through this comprehensive methodology, we ultimately identified 824 citation contexts, representing 54% (784) of all documents citing Fanelli's 2009 paper. This corpus comprised 48% of contexts retrieved from Semantic Scholar and an additional 6% obtained through semi-manual extraction from open access documents. 87 of those contexts were excluded from the analysis for a range of reasons including: context too short to conclude, language neither English nor French (shared languages of the authors of this review), duplicate documents (e.g. preprints), etc, leaving us with 737 contexts. They were first classified manually in two categories, those mentioning the 2% figure and those which did not. Then, for the first category, they were further classified manually in two categories depending on whether the figure was appropriately assigned to self-reporting of researchers or rather misleadingly suggesting that the 2% applied to research outputs.

      Contributions

      Investigation: FB collected the citation contexts.<br /> Data curation and formal analysis: RL and MS<br /> Writing – review & editing: RL, MS and FB

      References

      Bordignon, Frederique, Maha Said, and Raphael Levy. 2024. “Citation Contexts of [How Many Scientists Fabricate and Falsify Research? A Systematic Review and Meta-Analysis of Survey Data, DOI: 10.1371/Journal.Pone.0005738].” Zenodo. https://doi.org/10.5281/zenodo.14417422.

      Chawla, Dalmeet Singh. 2024. “1 in 7 Scientific Papers Is Fake, Suggests Study That Author Calls ‘Wildly Nonsystematic.’” Retraction Watch (blog). September 24, 2024. https://retractionwatch.com/2024/09/24/1-in-7-scientific-papers-is-fake-suggests-study-that-author-calls-wildly-nonsystematic/.

      Fanelli, Daniele. 2009. “How Many Scientists Fabricate and Falsify Research? A Systematic Review and Meta-Analysis of Survey Data.” PLOS ONE 4 (5): e5738. https://doi.org/10.1371/journal.pone.0005738.

      Frank, Fabrice, Nans Florens, Gideon Meyerowitz-Katz, Jérôme Barriere, Éric Billy, Véronique Saada, Alexander Samuel, Jacques Robert, and Lonni Besançon. 2023. “Raising Concerns on Questionable Ethics Approvals - a Case Study of 456 Trials from the Institut Hospitalo-Universitaire Méditerranée Infection.” Research Integrity and Peer Review 8 (1): 9. https://doi.org/10.1186/s41073-023-00134-4.

      George, Stephen L. 2016. “Research Misconduct and Data Fraud in Clinical Trials: Prevalence and Causal Factors.” International Journal of Clinical Oncology 21 (1): 15–21. https://doi.org/10.1007/s10147-015-0887-3.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, Cho et al. present a comprehensive and multidimensional analysis of glutamine metabolism in the regulation of B cell differentiation and function during immune responses. They further demonstrate how glutamine metabolism interacts with glucose uptake and utilization to modulate key intracellular processes. The manuscript is clearly written, and the experimental approaches are informative and well-executed. The authors provide a detailed mechanistic understanding through the use of both in vivo and in vitro models. The conclusions are well supported by the data, and the findings are novel and impactful. I have only a few, mostly minor, concerns related to data presentation and the rationale for certain experimental choices.

      Detailed Comments:

      (1) In Figure 1b, it is unclear whether total B cells or follicular B cells were used in the assay. Additionally, the in vitro class-switch recombination and plasma cell differentiation experiments were conducted without BCR stimulation, which makes the system appear overly artificial and limits physiological relevance. Although the effects of glutamine concentration on the measured parameters are evident, the results cannot be confidently interpreted as true plasma cell generation or IgG1 class switching under these conditions. The authors should moderate these claims or provide stronger justification for the chosen differentiation strategy. Incorporating a parallel assay with anti-BCR stimulation would improve the rigor and interpretability of these findings. 

      We will edit the manuscript to be more explicit that total splenic B cells were used in this set-up figure and the rest of the paper. In addition, we will try to perform new experiments to improve this "set-up figure" (and add old and new data for Supplemental Figure presentation). Specifically, we will increase the range of conditions tested - e.g., styles of stimulating proliferation and differentiation - to foster an increased sense of generality. We plan to compare mitogenic stimulation with anti-CD40 to  anti-IgM and to anti-IgM + anti-CD40, all with BAFF, IL-4, and IL-5, bearing in mind excellent work from Aiba et al, Immunity 2006; 24: 259-268, and similar papers. We also will try to present some representative flow cytometric profiles (presumably in new Supplemental Figure panels).

      To be transparent and add to a more open public discussion (using the virtues of this forum, the senior author and colleagues would caution about whether any in vitro conditions exist that warrant complete confidence. That is the reason for proceeding to immunization experiments in vivo. That is not said to cast doubt on our own in vitro data - there are some experiments (such as those of Fig. 1a-c and associated Supplemental Fig. 1) that only can be done in vitro or are better done that way (e.g., because of rapid uptake of early apoptotic B cells in vivo).

      For instance: Well-respected papers use the CD40LB and NB21.2D9 systems to activate B cells and generate plasma cells. Those appear to be BCR-independent and unfortunately, we found that they cannot be used with a.a. deprivation or these inhibitors due to effects on the engineered stroma-like cells. In considering BCR engagement, Reth has published salient points about signaling and concentrations of the Ab, the upshot being that this means of activating mitogenesis and plasma cell differentiation (when the B cells are costimulated via CD40 or TLR(4 or 7/8) is probably more than a bit artificial. Moreover, although Aiba et al, Immunity 2006; 24: 259-268 is a laudable exception, one rarely finds papers using BAFF despite the strong evidence it is an essential part of the equation of B cell regulation in vivo and a cytokine that modulates BCR signaling - in the cultures. 

      (2) In Figure 1c, the DMK alone condition is not presented. This hinders readers' ability to properly asses the glutaminolysis dependency of the cells for the measured readouts. Also, CD138+ in developing PCs goes hand in hand with decreased B220 expression. A representative FACS plot showing the gating strategy for the in vitro PCs should be added as a supplementary figure. Similarly, division number (going all the way to #7) may be tricky to gate and interpret. A representative FACS plot showing the separation of B cells according to their division numbers and a subsequent gating of CD138 or IgG1 in these gates would be ideal for demonstrating the authors' ability to distinguish these populations effectively.

      We agree that exact placement  of divisions deconvolution by FlowJow is more fraught than might be thought forpresentations in many or most papers. For the revision, we will try to add one or several representative FACS plot(s) with old and new data to provide the gating on CTV fluorescence, bearing these points in mind when extending the experiments from ~7 years ago (Fig. 1b, c). With the representative examples of the old data pasted in here, we will aver, however, that using divisions 0-6, and ≥7 was reasonable. 

      Ditto for DMK with normal glutamine. However, in the spirit of eLife transparency lacking in many other journals, this comparison is more fraught than the referee comment would make things seem. The concentration tolerated by cells is highly dependent on the medium and glutamine concentration, and perhaps on rates of glutaminolysis (due to its generation of ammonia). In practice, we find that DMK becomes more toxic to B cells unless glutamine is low or glutaminolysis is restricted. Thus, the concentration of DMK that is tolerated and used in Fig. 1b, c can become toxic to the B cells when using the higher levels of glutamine in typical culture media (2 mM or more) - at which point the "normal conditions + DMK" "control" involves the surviving cells in conditions with far greater cell death and less population expansion than the "low glutamine + DMK". condition. Overall, we appreciate the suggestion to show more DMK data and will work to do so for the earlier proliferation data (shown above) and the new experiments.  

      Author response image 1.

       

      (3) A brief explanation should be provided for the exclusive use of IgG1 as the readout in class-switching assays, given that naïve B cells are capable of switching to multiple isotypes. Clarifying why IgG1 was preferentially selected would aid in the interpretation of the results.

      We will edit the text to be more explicit and harmonize in light of the referee's suggestion that we focus the presentation of serologic data on IgG1 in the immunization experiments.

      [IgG1 provides the strongest signal and hence better signal/noise both in vitro and with the alum-based immunizations that are avatars for the adjuvant used in the majority of protein-based vaccines for humans.]

      (4) The immunization experiments presented in Figures 1 and 2 are well designed, and the data are comprehensively presented. However, to prevent potential misinterpretation, it should be clarified that the observed differences between NP and OVA immunizations cannot be attributed solely to the chemical nature of the antigens - hapten versus protein. A more significant distinction lies in the route of administration (intraperitoneal vs. intranasal) and the resulting anatomical compartment of the immune response (systemic vs. lung-restricted). This context should be explicitly stated to avoid overinterpretation of the comparative findings.

      We agree with the referee and will edit the text accordingly. Certainly, the difference in how the anti-ova response is elicited compared to the anti-NP response in the same mice or with a bit different an immunization regimen might be another factor - or the major factor - that could contribute towards explaining why glutaminolysis was important after ovalbumin inhalations (used because emergence of anti-ova Ab / ASCs is suppressed by the NP hapten after NP-ova immunization) but not needed for the anti-NP response unless Slc2a1 or Mpc2 also was inactivated. Thank you prompting addition of this caveat.

      Nevertheless, it seems fair to note that in Figures 1 and 2, the ASCs and Ab are being analyzed for NP and ova in the same mice, albeit with the NP-specific components not being driven by the inhalations of ovalbumin. With that in mind, when one compares the IgG1 anti-NP ASC and Ab to those for IgG1 anti-ovalbumin (ASC in bone marrow; Ab), the ovalbumin-specific response was reduced whereas the anti-NP response was not.

      (5) NP immunization is known to be an inducer of an IgG1-dominant Th2-type immune response in mice. IgG2c is not a major player unless a nanoparticle delivery system is used. However, the authors arbitrarily included IgG2c in their assays in Figures 2 and 3. This may be confusing for the readers. The authors should either justify the IgG2c-mediated analyses or remove them from the main figures. (It can be added as supplemental information with proper justification). 

      We will rearrange the Figure panels to move the IgM and IgG2c data to Supplemental Figures.

      For purposes of public discourse, we note that the data of previous Figure 3(c, g) show a very strong NP-specific IgG2c response that seems to contradict the concept that IgG2c responses necessarily are weak in this setting, and the important role of IgG2c (mouse - IgG1 in humans) in controlling or clearing various pathogens as well as in autoimmunity. So from the standpoint of providing a better sense of generality to the loss-of-function effects, we continue to think that these measurements are quite important. That said, the main text has many figure panels and as the review notes, the class switching and in vitro ASC generation were done with IL-4 / IgG1-promoting conditions. If possible, we will try to assay in vitro class switching with IFN-g rather than IL-4 but there may not be enough resources (time before lab closure; money).

      [As a collegial aside, we speculate that a greater or lesser IgG2c anti-NP response may arise due to different preparations of NP-carrier obtained from the vendor (Biosearch) having different amounts of TLR (e.g., TLR4) ligand. In any case, the points of presenting the IgG2c (and IgM) data were to push against the limiting boundaries of convention (which risks perpetuating a narrow view of potential outcomes) and make the breadth of results more apparent to readers.

      (6) Similarly, in affinity maturation analyses, including IgM is somewhat uncommon. I do not see any point in showing high affinity (NP2/NP20) IgMs (Figure 3d), since that data probably does not mean much.

      As noted in the reply immediately preceding this one, we appreciate this suggestion from the reviewer and will move the IgM and IgG2c to Supplemental status.

      Nonetheless, in collegial discourse we disagree a bit with the referee in light of our data as well as of work that (to our minds) leads one to question why inclusion of affinity maturation of IgM is so uncommon - as the referee accurately notes. Of course a defect in the capacity to class-switch is highly deleterious in patients but that is not the same as concluding that recall IgM or its affinity is of little consequence.

      In some of the pioneering work back in the 1980's, Bothwell showed that NP-carrier immunization generated hybridomas producing IgM Ab with extensive SHM (~11% of the 18 lineages; ~ 1/3 of the IgM hybridomas) [PMID: 8487778], IgM B cells appear to move into GC, and there is at least a reasonable published basis for the view that there are GC-derived IgM (unswitched) memory B cells (MBC) that would be more likely, upon recall activation, to differentiate into ASCs. [As an example, albeit with the Jenkins lab anti-rPE response, Taylor, Pape, and Jenkins generated quantitative estimates of the numbers of Ag-specific IgM<sup>+</sup>vs switched MBC that were GC-derived (or not). [PMID: 22370719]. While they emphasized that ~90% of  IgM<sup>+</sup> MBC appeared to be GC-independent, their data also indicated that ~1/2 of all GC-derived MBC were IgM<sup>+</sup> rather than switched (their Fig. 8, B vs C; also 8E, which includes alum-PE). And while we immensely respect the referee, we are perhaps less confident that IgM or high-affinity Ag-specific IgM doesn't mean that much, if only because of evidence that localized Ab compete for Ag and may thus influence selective processes [PMCID: PMC2747358; PMID: 15953185; PMID: 23420879; PMID: 27270306].

      (7) Following on my comment for the PC generation in Figure 1 (see above), in Figure 4, a strategy that relies solely on CD40L stimulation is performed. This is highly artificial for the PC generation and needs to be justified, or more physiologically relevant PC generation strategies involving anti-BCR, CD40L, and various cytokines should be shown. 

      In line with our response to point (1), we plan and will try to self-fund testing BCR-stimulated B cells (anti-CD40 to  anti-IgM and to anti-IgM + anti-CD40, all with BAFF, IL-4, and IL-5).

      (8) The effects of CB839 and UK5099 on cell viability are not shown. Including viability data under these treatment conditions would be a valuable addition to the supplementary materials, as it would help readers more accurately interpret the functional outcomes observed in the study. 

      We will add to the supplemental figures to present data that provide cues as to relative viability / survival under the experimental conditions used. [FSC X SSC as well as 7AAD or Ghost dye panels; we also hope to generate new data that include further experiments scoring annexin V staining.]

      (9) It is not clear how the RNA seq analysis in Figure 4h was generated. The experimental strategy and the setup need to be better explained.

      The revised manuscript will include more information (at minimum in the Methods, Legend), and we apologize that in this and a few other instances sufficiency of detail was sacrificed on the altar of brevity.

      [Adding a brief synopsis to any reader before the final version of record, given the many months it will take to generate new data, thoroughly revise the manuscript, etc:

      In three temporally and biologically independent experiments, cultures were harvested 3.5 days after splenic B cells were purified and cultured as in the experiments of Fig. 4a-e. total cellular RNA prepared from the twelve samples (three replicates for each of four conditions - DMSO vehicle control, CB839, UK5099, and CB839 + UK5099) was analyzed by RNA-seq. After the RNA-seq data were initially processed using the pipeline described in the Methods. For panels g & h of Fig 4, DE Seq2 was used to quantify and compare read counts in the three CB839 + UK5099 samples relative to the three independent vehicle controls and identify all genes for which variances yielded P<0.05. In Fig 4g, all such genes for which the difference was 'statistically significant' (i.e., P<0.05) were entered into the Immgen tool and thereby mapped to the B lineage subsets shown in the figure panels (i.e., g, h). In (g), these are displayed using one format, whereas (h) uses the 'heatmap' tool in MyGeneSet.  

      Reviewer #2 (Public review): 

      Summary: 

      In this manuscript, the authors investigate the functional requirements for glutamine and glutaminolysis in antibody responses. The authors first demonstrate that the concentrations of glutamine in lymph nodes are substantially lower than in plasma, and that at these levels, glutamine is limiting for plasma cell differentiation in vitro. The authors go on to use genetic mouse models in which B cells are deficient in glutaminase 1 (Gls), the glucose transporter Slc2a1, and/or mitochondrial pyruvate carrier 2 (Mpc2) to test the importance of these pathways in vivo. 

      Interestingly, deficiency of Gls alone showed clear antibody defects when ovalbumin was used as the immunogen, but not the hapten NP. For the latter response, defects in antibody titers and affinity were observed only when both Gls and either Mpc2 or Slc2a1 were deleted. These latter findings form the basis of the synthetic auxotrophy conclusion. The authors go on to test these conclusions further using in vitro differentiations, Seahorse assays, pharmacological inhibitors, and targeted quantification of specific metabolites and amino acids. Finally, the authors document reduced STAT3 and STAT1 phosphorylation in response to IL-21 and interferon (both type 1 and 2), respectively, when both glutaminolysis and mitochondrial pyruvate metabolism are prevented. 

      Strengths:

      (1) The main strength of the manuscript is the overall breadth of experiments performed. Orthogonal experiments are performed using genetic models, pharmacological inhibitors, in vitro assays, and in vivo experiments to support the claims. Multiple antigens are used as test immunogens--this is particularly important given the differing results. 

      (2) B cell metabolism is an area of interest but understudied relative to other cell types in the immune system. 

      (3) The importance of metabolic flexibility and caution when interpreting negative results is made clear from this study.

      Weaknesses:

      (1) All of the in vivo studies were done in the context of boosters at 3 weeks and recall responses 1 week later. This makes specific results difficult to interpret. Primary responses, including germinal centers, are still ongoing at 3 weeks after the initial immunization. Thus, untangling what proportion of the defects are due to problems in the primary vs. memory response is difficult.

      (2) Along these lines, the defects shown in Figure 3h-i may not be due to the authors' interpretation that Gls and Mpc2 are required for efficient plasma cell differentiation from memory B cells. This interpretation would only be correct if the absence of Gls/Mpc2 leads to preferential recruitment of low-affinity memory B cells into secondary plasma cells. The more likely interpretation is that ongoing primary germinal centers are negatively impacted by Gls and Mpc2 deficiency, and this, in turn, leads to reduced affinities of serum antibodies

      We provisionally plan to edit the wording of the conclusion a bit to add a possibility we consider unlikely to avoid a conclusion that MBCs bearing switched BCRs are affected once reactivated. We also will perform a new experiment to investigate, but unfortunately time before lab closure has been and remains our enemy both for performance and multiple replication of the work presented in Figure 3, panels h & i, and the related Supplemental Data (Supplemental Fig. 3a-j). Unfortunately, it will not be possible to do a memory experiment with recall immunization out at 8 weeks.  Despite the grant funding running out and institutional belt-tightening, however, we'll try to perform a new head-to-head comparison of 4 wk post-immunization with and without the boost at three weeks.

      The intriguing concern (points 1 & 2) provides a springboard for consideration of generalizations and simplifications. Germinal center durability is not at all monolithic, and instead is quite variable**. The premise (cognitive bias, perhaps?) in the interpretation is that in our previous work we find few if any GC B cells - NP-APC-binding or otherwise - above the background (non-immunized controls) three weeks after immunization with NP-ovalbumin in alum. Recognizing that it is not NP-carrier in alum as immunizations, we note for the readers and referee that Fig. 1 of the Taylor, Pape, & Jenkins paper considered above [PMID: 22370719] reported 10-fold more Ag-specific MBCs than GC B cells at day 29 post-immunization (the point at which the boost / recall challenge was performed in our Figure 3h, i).

      Viewed from that perspective, the surmise of the comment is that a major contribution to the differences in both all-affinity and high-affinity anti-NP IgG1 shown in Fig. 3i derives from the immunization at 4 wk stimulating GC B cells we cannot find as opposed to memory B cells. However, it is true that in the literature (especially with the experimentally different approach of transferring BCR-transgenic / knock-in versions of an NP-biased BCR) there may be meaningful pools of IgG1 and IgG2c GC B cells. Alternatively, our current reagents for immunizations may have become better at maintaining GC than those in the past - which we will try to test.

      The issue and question also relate to rates of output of plasma cells or rises in the serum concentrations of class-switched Ab. To this point, our prior experiences agree with the long-published data of the Kurosaki lab in Figure 3c of the Aiba et al paper noted above (Immunity, 2006) (and other such time courses). Readers can note that the IgG1 anti-NP response (alum adjuvant, as in our work) hits its plateau at 2 wk, and did not increase further from 2 to 3 wk. In other words, GC are on the decline and  Ab production has reached its plateau by the time of the 2nd immunization in Fig. 3h). 

      Assuming we understand the comment and line of reasoning correctly, we also lean towards disagreeing with the statement "This interpretation would only be correct if the absence of Gls/Mpc2 leads to preferential recruitment of low-affinity memory B cells into secondary plasma cells." Our evidence shows that both low-affinity as well as high-affinity anti-NP Ab (IgG1) went down as a result of combined gene-inactivation after the peak primary response (Fig. 3i). Recent papers show that affinity maturation is attributable to greater proliferation of plasmablasts with high-affinity BCR. Accordingly, the findings with loss of GLS and MPC function are quite consistent with the interpretation that much of the response after the second immunization draws on MBC differentiation into plasmablasta and then plasma cells, where the proliferative advantage of high-affinity cells is blunted by the impaired metabolism. The provisional plan, however, is to note the alternative, if less likely, interpretation proposed by the review.

      ** In some contexts, of course, especially certain viral infections or vaccination with lipid nanoparticles carrying modified mRNA, germinal centers are far more persistent; also, in humans even the seasonal flu vaccine **

      (3) The gating strategies for germinal centers and memory B cells in Supplemental Figure 2 are problematic, especially given that these data are used to claim only modest and/or statistically insignificant differences in these populations when Gls and Mpc2 are ablated. Neither strategy shows distinct flow cytometric populations, and it does not seem that the quantification focuses on antigen-specific cells.

      We will enhance these aspects of the presentation, using old and hopefully new data, but note for readers that many many other papers in the best journals show plots in which the separation of, say, GC-Tfh from overall Tfh is based on cut-off within what essentially is a continuous spectrum of emission as adjusted or compensated by the cytometer (spectral or conventional).

      Perhaps incorrectly, we omitted presenting data that included the results with NP-APC-staining - in part because within the GC B cell gate the frequencies of NP-binding events (GCB cells) were similar in double-knockout samples and controls. In practice, that would mean that the metabolic requirement applied about equally to NP+ and the total population. We will try to rectify this point in the revision.

      (4) Along these lines, the conclusions in Figure 6a-d may need to be tempered if the analysis was done on polyclonal, rather than antigen-specific cells. Alum induces a heavily type 2-biased response and is not known to induce much of an interferon signature. The authors' observations might be explained by the inclusion of other ongoing GCs unrelated to the immunization. 

      We will make sure the text is clear that the in vitro experiments do not represent GC B cells and that the RNA-seq data were not an Ag (SRBC)-specific subset.

      We also will try to work in a schematic along with expanding the Legends to make it more readily clear that the RNA-seq data (and hence the GSEA) involved immunizations with SRBC (not the alum / NP system which - it may be noted - in these experiments actually generated a robust IgG2c (type 1-driven) response along with the type 2-enhanced IgG1 response.

      Reviewer #3 (Public review): 

      Summary: 

      In their manuscript, the authors investigate how glutaminolysis (GLS) and mitochondrial pyruvate import (MPC2) jointly shape B cell fate and the humoral immune response. Using inducible knockout systems and metabolic inhibitors, they uncover a "synthetic auxotrophy": When GLS activity/glutaminolysis is lost together with either GLUT1-mediated glucose uptake or MPC2, B cells fail to upregulate mitochondrial respiration, IL 21/STAT3 and IFN/STAT1 signaling is impaired, and the plasma cell output and antigen-specific antibody titers drop significantly. This work thus demonstrates the promotion of plasma cell differentiation and cytokine signaling through parallel activation of two metabolic pathways. The dataset is technically comprehensive and conceptually novel, but some aspects leave the in vivo and translational significance uncertain.

      Strengths:

      (1) Conceptual novelty: the study goes beyond single-enzyme deletions to reveal conditional metabolic vulnerabilities and fate-deciding mechanisms in B cells.

      (2) Mechanistic depth: the study uncovers a novel "metabolic bottleneck" that impairs mitochondrial respiration and elevates ROS, and directly ties these changes to cytokine-receptor signaling. This is both mechanistically compelling and potentially clinically relevant.

      (3) Breadth of models and methods: inducible genetics, pharmacology, metabolomics, seahorse assay, ELISpot/ELISA, RNA-seq, two immunization models.

      (4) Potential clinical angle: the synergy of CB839 with UK5099 and/or hydroxychloroquine hints at a druggable pathway targeting autoantibody-driven diseases.

      We agree and thank the referee for the positive comments and this succinct summary of what we view as contributions of the paper.

      Weaknesses: 

      (1) Physiological relevance of "synthetic auxotrophy"

      The manuscript demonstrates that GLS loss is only crippling when glucose influx or mitochondrial pyruvate import is concurrently reduced, which the authors name "synthetic auxotrophy". I think it would help readers to clarify the terminology more and add a concise definition of "synthetic auxotrophy" versus "synthetic lethality" early in the manuscript and justify its relevance for B cells.

      We will edit the Abstract, Introduction, and Discussion to try to do better on this score. Conscious of how expansive the prose and data are even in the original submission, we appear to have taken some shortcuts that we will try to rectify. Thank you for highlighting this need to improve on a key concept!

      That said, we punctiliously & perhaps pedantically encourage readers to be completely accurate, in that under one condition of immunization GLS loss substantially reduced the anti-ovalbumin response (Fig. 1, Fig. 2a-c). And for this provisional response, we will expand a bit on the notion that synthetic auxotrophy represents effects on differentiation that appear to go beyond and not simply to be selective death, even though decreased population expansion is observed and one cannot exclude some contribution of enhanced death in vivo. Finally, we will note that this comment of the review raises interesting semantic questions about what represents "physiological relevance" but leave it at that.

      While the overall findings, especially the subset specificity and the clinical implications, are generally interesting, the "synthetic auxotrophy" condition feels a little engineered.

      One can readily say that CAR-T cells are 'a little engineered' so it is a matter of balancing this perspective of the referee against the strengths they highlight in points 1, 2, and 4. In any case, we will probably try to expand and be more explicit in the Discussion of the revised manuscript.

      In brief, even were the money not all gone, we would not believe that expanding the heft of this already rather large manuscript and set of data would be appropriate. As matters stand, a basic new insight about metabolic flexibility and its limits leads to evidence of a way to reduce generation of Ab and a novel impairment of STAT transcription factor induction by several cytokine receptors. The vulnerability that could be tested in later work on B cell-dependent autoimmunity includes the capacity to test a compound that already has been to or through FDA phase II in patients together with an FDA-approved standard-of-care agent.

      Put a different way, the point is that a basic curiosity to understand why decreasing glucose influx did not have an even more profound effect than what was observed, combined with curiosity as to why glutaminolysis was dispensable in relatively standard vaccine-like models of immunize / boost, provided a springboard to identification of new vulnerabilities. As above, we appreciate being made aware that this point merits being made more explicit in the Discussion of the edited version.

      Therefore, the findings strongly raise the question of the likelihood of such a "double hit" in vivo and whether there are conditions, disease states, or drug regimens that would realistically generate such a "bottleneck".

      Hence, the authors should document or at least discuss whether GC or inflamed niches naturally show simultaneous downregulation/lack of glutamine and/or pyruvate. The authors should also aim to provide evidence that infections (e.g., influenza), hypoxia, treatments (e.g., rapamycin), or inflammatory diseases like lupus co-limit these pathways. 

      Again, we appreciate some 'licensing' to be more expansive and explicit, and will try to balance editing in such points against undue tedium or tendentiously speculative length in the Discussion. In particular, we will note that a clear, simple implication of the work is to highlight an imperative to test CB839 in lupus patients already on hydroxychloroquine as standard-of-care, and to suggest development of UK5099 (already tested many times in mouse models of cancer) to complement glutaminase inhibition. 

      As backdrop, we note that the failure to advance imaging mass spectrometry to the capacity to quantify relative or absolute (via nano-DESI) concentrations of nutrients in localized interstitia is a critical gap in the entire field. Techniques that sample the interstitial fluid of tumor masses or in our case LN as a work-around have yielded evidence that there can be meaningful limitations of glucose and glutamine, but it needs to be acknowledged that such findings may be very model-specific and, as can be the case with cutting-edge science, are not without controversy. That said, yes, we had found that hypoxia reduced glutamine uptake but given the norms of focused, tidy packages only reported on leucine in an earlier paper [PMID27501247; PMCID5161594].

      It would hence also be beneficial to test the CB839 + UK5099/HCQ combinations in a short, proof-of-concept treatment in vivo, e.g., shortly before and after the booster immunization or in an autoimmune model. Likewise, it may also be insightful to discuss potential effects of existing treatments (especially CB839, HCQ) on human memory B cell or PC pools.

      We certainly agree that the suggestions offered in this comment are important next steps and the right approach to test if the findings reported here translate toward the treatment of autoimmune diseases that involve B cells, interferons, and pathophysiology mediated by auto-Ab. As practical points, performance and replication of such studies would take more time than the year allotted for return of a revised manuscript to eLife and in any case neither funds nor a lab remain to do these important studies. 

      Concrete evidence for our concurrence was embodied in a grant application to NIH that was essential for keeping a lab and doing any such studies. [We note, as a suggestion to others, that an essential component of such studies would be to test the effects of these compounds on B cells from patients and mice with autoimmunity]. Perhaps unfortunately for SLE patients, the review panelists did not agree about the importance of such studies. However, it can be hoped that the patent-holder of CB839 (and perhaps other companies developing glutaminase inhibitors) will see this peer-reviewed pre-print and the public dialogue, and recognize how positive results might open a valuable contribution to mitigation of diseases such as SLE.

      (2) Cell survival versus differentiation phenotype

      Claims that the phenotypes (e.g., reduced PC numbers) are "independent of death" and are not merely the result of artificial cell stress would benefit from Annexin-V/active-caspase 3 analyses of GC B cells and plasmablasts. Please also show viability curves for inhibitor-treated cell

      This comment leads us to see that the wording on this point may have been overly terse in the interests of brevity, and thereby open to some misunderstanding. Accordingly, we will expand out the text of the Abstract and elsewhere in the manuscript, to be more clear. In addition, we will add in some data on the point, hopefully including some results of new experiments.

      To clarify in this public context, it is not that an increase in death (along with the reported decrease in cell cycling) can be or is excluded - and in fact it likely exists in vitro. The point is that beyond any such increase, and taking into account division number (since there is evidence that PC differentiation and output numbers involve a 'division-counting' mechanism), the frequencies of CD138+ cells and of ASCs among the viable cells are lower, as is the level of Prdm1-encoded mRNA even before the big increase in CD138+ cells in the population. 

      (3) Subset specificity of the metabolic phenotype

      Could the metabolic differences, mitochondrial ROS, and membrane-potential changes shown for activated pan-B cells (Figure 5) also be demonstrated ex vivo for KO mouse-derived GC B cells and plasma cells? This would also be insightful to investigate following NP-immunization (e.g., NP+ GC B cells 10 days after NP-OVA immunization).

      We agree that such data could be nice and add to the comprehensiveness of the work. We will try to scrounge the resources (time; money; human) to test this roughly as indicated. That said, we would note that the frequencies and hence numbers of NP+ GC B cells are so low that even in the flow cytometer we suspect there will not be enough "events" to rely on the results with DCFDA in the tiny sub-sub-subset. It also bears noting that reliable flow cytometric identification of the small NP-specific plasmablast/plasma cell subset amidst the overall population, little of which arose from immunization or after deletion of the floxed segments in B cells, would potentially be misleading.

      (4) Memory B cell gating strategy

      I am not fully convinced that the memory-B-cell gate in Supplementary Figure 2d is appropriate. The legend implies the population is defined simply as CD19+GL7-CD38+ (or CD19+CD38++?), with no further restriction to NP-binding cells. Such a gate could also capture naïve or recently activated B cells. From the descriptions in the figure and the figure legend, it is hard to verify that the events plotted truly represent memory B cells. Please clarify the full gating hierarchy and, ideally, restrict the MBC gate to NP+CD19+GL7-CD38+ B cells (or add additional markers such as CD80 and CD273). Generally, the manuscript would benefit from a more transparent presentation of gating strategies.

      We will further expand the supplemental data displays to include more of the gating and analytic scheme, and hope to be able to have performed new experiments and analyses (including additional markers) that could mitigate the concern noted here. In addition, we will include flow data from the non-immunized control mice that had been analyzed concurrently in the experiments illustrated in this Figure.

      Although it should be noted that the labeling indicated that the gating included the important criterion that cells be IgD- (Supplemental Fig. 2b), which excludes the vast majority of naive B cells, in principle marginal zone (MZ) B cells might fall within this gate. However, the MZ B population is unlikely to explain the differences shown in Supplemental Fig. 2b-d.

      (5) Deletion efficiency - [The] mRNA data show residual GLS/MPC2 transcripts (Supplementary Figure 8). Please quantify deletion efficiency in GC B cells and plasmablasts.

      Even were there resources to do this, the degree of reduction in target mRNA (Gls; Mpc2) renders this question superfluous.

      Are there likely to be some cells with only one, or even neither, allele converted from fl to D? Yes, but they would be a minor subset in light of the magnitude of mRNA reduction, in contrast to our published observations with Slc2a1. As to plasmablasts and plasma cells, the pre-existing populations make such an analysis misleading, while the scarcity of such cells recoverable with antigen capture techniques is so low as to make both RNA and genomic DNA analyses questionable.

    1. Author response:

      The following is the authors’ response to the original reviews

      eLife Assessment

      This valuable study revisits the effects of substitution model selection on phylogenetics by comparing reversible and non-reversible DNA substitution models. The authors provide evidence that 1) non time-reversible models sometimes perform better than general time-reversible models when inferring phylogenetic trees out of simulated viral genome sequence data sets, and that 2) non time-reversible models can fit the real data better than the reversible substitution models commonly used in phylogenetics, a finding consistent with previous work. However, the methods are incomplete in supporting the main conclusion of the manuscript, that is that non time-reversible models should be incorporated in the model selection process for these data sets.

      The non-reversible models should be incorporated in the selection model process not because the significantly perform better but only because the do not perform worse than the reversible models and that true biochemical processes of nucleotide substitution does support the science of non-reversibility.

      Reviewer #1 (Public Review):

      The study by Sianga-Mete et al revisits the effects of substitution model selection on phylogenetics by comparing reversible and non-reversible DNA substitution models. This topic is not new, previous works already showed that non-reversible, and also covarion, substitution models can fit the real data better than the reversible substitution models commonly used in phylogenetics. In this regard, the results of the present study are not surprising. Specific comments are shown below.

      True

      It is well known that non-reversible models can fit the real data better than the commonly used reversible substitution models, see for example,

      https://academic.oup.com/sysbio/article/71/5/1110/6525257

      https://onlinelibrary.wiley.com/doi/10.1111/jeb.14147?af=R

      The manuscript indicates that the results (better fitting of non-reversible models compared to reversible models) are surprising but I do not think so, I think the results would be surprising if the reversible models provide a better fitting.

      I think the introduction of the manuscript should be increased with more information about non-reversible models and the diverse previous studies that already evaluated them. Also I think the manuscript should indicate that the results are not surprising, or more clearly justify why they are surprising.

      The surprise in the findings is in NREV12 performing better than NREV6 for double stranded DNA viruses as it was expected that NREV6 would perform better given the biochemical processes discussed in the introduction.

      In the introduction and/or discussion I missed a discussion about the recent works on the influence of substitution model selection on phylogenetic tree reconstruction. Some works indicated that substitution model selection is not necessary for phylogenetic tree reconstruction,

      https://academic.oup.com/mbe/article/37/7/2110/5810088

      https://www.nature.com/articles/s41467-019-08822-w

      https://academic.oup.com/mbe/article/35/9/2307/5040133

      While others indicated that substitution model selection is recommended for phylogenetic tree reconstruction,

      https://www.sciencedirect.com/science/article/pii/S0378111923001774

      https://academic.oup.com/sysbio/article/53/2/278/1690801

      https://academic.oup.com/mbe/article/33/1/255/2579471

      The results of the present study seem to support this second view. I think this study could be improved by providing a discussion about this aspect, including the specific contribution of this study to that.

      In our conclusion we have stated that:

      The lack of available data regarding the proportions of viral life cycles during which genomes exist in single and double stranded states makes it difficult to rationally predict the situations where the use of models such as GTR, NREV6 and NREV12 might be most justified: particularly in light of the poor over-all performance of NREV6 and GTR relative to NREV12 with respect to describing mutational processes in viral genome sequence datasets. We therefore recommend case-by-case assessments of NREV12 vs NREV6 vs GTR model fit when deciding whether it is appropriate to consider the application of non-reversible models for phylogenetic inference and/or phylogenetic model-based analyses such as those intended to test for evidence of natural section or the existence of molecular clocks.

      The real data was downloaded from Los Alamos HIV database. I am wondering if there were any criterion for selecting the sequences or if just all the sequences of the database for every studied virus category were analysed. Also, was any quality filter applied? How gaps and ambiguous nucleotides were considered? Notice that these aspects could affect the fitting of the models with the data.

      We selected varying number of sequences of the database for every studied virus type. Using the software aliview we did quality filter by re-aligning the sequences per virus type.

      How the non-reversible model and the data are compared considering the non-reversible substitution process? In particular, given an input MSA, how to know if the nucleotide substitution goes from state x to state y or from state y to state x in the real data if there is not a reference (i.e., wild type) sequence? All the sequences are mutants and one may not have a reference to identify the direction of the mutation, which is required for the non-reversible model. Maybe one could consider that the most abundant state is the wild type state but that may not be the case in reality. I think this is a main problem for the practical application of non-reversible substitution models in phylogenetics.

      True

      Reviewer #1 (Recommendations for the authors):

      The reversible and non-reversible models used in this study assume that all the sites evolve under the same substitution matrix, which can be unrealistic. This aspect could be mentioned.

      Done

      The manuscript indicates that "a phylogenetic tree was inferred from an alignment of real sequences (Avian Leukosis virus) with an average sequence identity (API) of ~90%.". I was wondering under which substitution model that phylogenetic tree reconstruction was performed? could the use of that model bias posterior results in terms of favoring results based on such a model?

      We have stated that the GTR+G model was used to reconstruct the tree. The use of the GTR+G model could yes bias the posterior results as we have stated in the paper too.

      I was wondering which specific R function was used to calculate the weighted Robinson-Foulds metric. I think this should be included in the manuscript.

      We stated that We used the weighted Robinson-Foulds metric (wRF; implemented in the R phangorn package (Schliep, 2011)⁠)

      Despite a minority, several datasets fitted better with a reversible model than with a non-reversible model. I think that should be clearly indicated. In addition, in my opinion the AIC does not enough penalizes the number of parameters of the models and favors the non-reversible models over the reversible models, but this is only my opinion based on the definition of AIC and it is not supported. Thus, I think the comparison between phylogenetic trees reconstructed under different substitution models was a good idea (but see also my second major comment).

      Noted

      When comparing phylogenetic trees I was wondering if one should consider the effect of the estimation method and quality of the studied data? For example, should bootstrap values be estimated for all the ancestral nodes and only ancestral nodes with high support be evaluated in the comparison among trees?

      Yes the estimation method and quality of the studied data should be considered. When using RF unlike wRF this will not matter but for weighted RF it does. When building the trees, using RaxML only high support nodes are added to the tree.

      In Figure 3, I do not see (by eye) significant differences among the models. I see in the legend that the statistical evaluation was based on a t test but I am not much convinced. Maybe it is only my view. Exactly, which pairs of datasets are evaluated with the t test? Next, I would expect that the influence of the substitution model on the phylogenetic tree reconstruction is higher at large levels of nucleotide diversity because with more substitution events there is more information to see the effects of the model. However, the t test seems to show that differences are only at low levels of nucleotide diversity (and large DNR), what could be the cause of this?

      The paired T-tests compares the wRF distances of the inferred tree real tree and the trees simulated using the GTR model verses the wRF distances of the inferred true tree from the trees simulated using the NREV12 model.

      The reason why the influence of the NREV12 model on the tree reconstructed is not significantly higher at large levels of nucleotide diversity could be because at a certain level the DNR are simply unrealistic.

      Can the user perform substitution model selection (i.e., AIC) among reversible and non-reversible substitution models with IQTREE? If yes, then doing that should be the recommendation from this study, correct?

      But, can DNR be estimated from a real dataset? DNR seems to be the key factor (Figure 3) for the phylogenetic analysis under a proper model.

      Substitution model selection can be performed among reversible and non-reversible using both HyPhy and IQTREE. And we have recommended that model tests should be done as a first step before tree building. Estimating DNR from real datasets requires a substation rate matrix of a non-reversible.

      The manuscript has many text errors (including typos and incorrect citations). For example, many citations in page 20 show "Error! Reference source not found.". I think authors should double check the manuscript before submitting. Also, some text is not formally written. For example, "G represents gamma-distributed rates", rates of what? The text should be clear for readers that are not familiar with the topic (i.e., G represents gamma-distributed substitution rates among sites). In general, I recommend a detailed revision of the whole text of the manuscript.

      Done

      Reviewer #2 (Public Review):

      The authors evaluate whether non time reversible models fit better data presenting strand-specific substitution biases than time reversible models. Specifically, the authors consider what they call NREV6 and NREV12 as candidate non time-reversible models. On the one hand, they show that AIC tends to select NREV12 more often than GTR on real virus data sets. On the other hand, they show using simulated data that NREV12 leads to inferred trees that are closer to the true generating tree when the data incorporates a certain degree of non time-reversibility.

      Based on these two experimental results, the authors conclude that "We show that non-reversible models such as NREV12 should be evaluated during the model selection phase of phylogenetic analyses involving viral genomic sequences". This is a valuable finding, and I agree that this is potentially good practice.

      However, I miss an experiment that links the two findings to support the conclusion: in particular, an experiment that solves the following question: does the best-fit model also lead to better tree topologies?

      By NREV12 leading to inferred trees that are closer to the true generating tree as compared to GTR, it then shows that the best-fit model in this case being NREV12 leads to better tree topologies.

      On simulated data, the significance of the difference between GTR and NREV12 inferences is evaluated using a paired t test. I miss a rationale or a reference to support that a paired t test is suitable to measure the significance of the differences of the wRF distance. Also, the results show that on average NREV12 performs better than GTR, but a pairwise comparison would be more informative: for how many sequence alignments does NREV12 perform better than GTR?

      We have used the popular paired t-test as it is the most widely used when comparing means values between two matched samples where the difference of each mean pair is normally distributed. And the wRF distances do match the guidelines above.

      The paired t-test contains the pairwise comparison and the boxplots side by side show the pairwise wRF comparisions.

      Reviewer #2 (Recommendations for the authors):

      The authors reference Baele et al., 2010 for describing NREV6 and NREV12. I suggest using the same name used in the referenced paper: GNR-SYM and GNR respectively. Although I do not think there is a standard name for these models, I would use a previously used one.

      We have built studies based on the names NREV6 and NREV12. We would like to keep the naming as standard for our studies.

      GTR and NREV12 models are already described in many other papers. I do not see the need to include such an extensive description. Also, a reference should be included to the discrete Gamma rate categories [1]

      We included the extensive description to enable other readers who are not super familiar with these models better understanding since we have given the models our own naming different from those used in other papers.

      We have added referencing for the discrete gamma rate as recommended. (Yang, 1994)

      To evaluate the exhaustiveness and correctness of the results, I would recommend publishing as supplementary material the simulated data sets or the scripts for generating the data set, the scripts or command lines for the analysis, and the versions of the software used (e.g., IQTREE). Also, to strongly support the main conclusion of the manuscript, I suggest adding to the simulations section results the RF-distances of the best-fit selected model under AIC, AICc, and BIC as well.

      We can go ahead and submit all the needed datasets. The simulated data RF-Distances results are available and will be submitted. We cannot however add them to the main document as this will create very long data tables.

      In some instances, it is mentioned that the selection criterion used is AIC, while in others, AIC-c is referenced. Even in the table captions, both terms are mixed. It should be made clearer which criterion is being employed, as AIC is not suitable for addressing the overparameterization of evolutionary models, given that it does not account for the sample size. A previous pre-print of this article [2] does not mention AIC-c, but also explicitly includes the formulas for AIC that do not take the sample size into account, and reports the same results as this manuscript, what indicates that AIC and not AIC-c was used here. This should be clarified. It is recommended to use AIC-c instead of AIC, especially if the sample size to model parameters ratio is low [3]. Two things may be appointed here: some authors consider tree branch lengths as model free parameters and others do not. In this paper it is not specified how the model parameters are counted. AIC tends to select more parameterized models than AIC-c, and overparameterization can lead to different tree inferences, as evidenced in Hoff et al., 2016. Therefore, it is expected that NREV12 is more frequently selected than NREV6 and GTR.

      In my opinion, a pairwise comparison between GTR and NREV12 performance is of great interest here, and the whiskers plots are not useful. Scatterplots would display the results better.

      Boxplots are meant to offer a simplified view of the results as the paired t-tests does all of the comparisons. We shall provide the scatter plots as supplementary information so that readers can get full detailed plots as recommended.

      Some references are missing.

      Missing references added

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      We would like to thank the reviewers for taking the time to review our manuscript and for providing valuable comments on how to improve it. We are pleased to see that both reviewers recognize the novelty and importance of our study, its conceptual advance and potential clinical significance. They also noted the novelty and value of our functional mechanistic approach using epigenetic editing. Below, we provide a point-by-point response to their questions and points raised. The changes introduced in response to their feedback are highlighted in yellow in the revised manuscript file.

      Point-by-point description of the revisions

      __Reviewer #1 (Evidence, reproducibility and clarity (Required)): __

      Summary This study by Prada et al. aimed to explore DNA methylation and gene expression in primary EpCAMhigh/PDPNlow cells, consisting of for (probably) the largest part of AT2 cells, to understand the molecular mechanisms behind the impaired regeneration of alveolar epithelial progenitor cells in COPD. They found that higher or lower promoter methylation in COPD-associated cells was inversely correlated with changes in gene expression, with interferon signaling emerging as one of the most upregulated pathways in COPD. IRF9 was identified as the master regulator of interferon signaling in COPD. Targeted DNA demethylation of IRF9 in an A549 cell line resulted in a robust activation of its downstream target genes, including OAS1, OAS3, PSMB8, PSMB9, MX2 and IRF7, demonstrating that demethylation of IRF9 is sufficient to activate the IFN signaling pathway, validating IRF9 as a master regulator of IFN signaling in (alveolar) epithelial cells.

      Major comments:

      • To remove airways (and blood vessels) completely from the lung tissue is difficult, if not impossible. This means that the assumption that the sorted EpCAMpos/PDPNlow cells primarily consisted of AT2 cells remains valid only if a quantitative analysis is conducted on the proportion of HT2-280pos cells in all samples in cytospins to exclude any significant contamination from bronchial epithelial cells. If authors cannot demonstrate >95% pure HT-280-positive cells, then the key conclusions suggesting that the epigenetic regulation of the IFN pathway might be crucial in AT2 progenitor cell regeneration could also potentially apply to bronchial progenitor cells. In addition, if >95% purity cannot be demonstrated, the data should be adjusted to account for differences in cell type composition.

      __Response: __

      We thank the reviewer for raising this important point. Although, as pointed out by the reviewer, we cannot guarantee that our sorted cells do not contain a minor contamination from respiratory / terminal bronchial cells, we carefully selected donors, tissue regions, and sorting strategy to ensure the highest possible enrichment of AT2 cells, as we explain below. We have now expanded the methods and results section and covered this point in the manuscript discussion.

      • The lung tissue pieces we received were distal, as evidenced by the presence of pleura. We collected representative tissue pieces for histology to validate sample quality. Our protocol includes a dissection of all visible airways and vessels using a dissecting microscope, which were cryopreserved separately from distal parenchyma. Hence, the starting material for tissue dissociation was depleted from airways and vessels. The importance of vessel/airway removal for enrichment of distal alveolar cells was established by Tata's group (PMID: 35712012).
      • We selected the AT2 sorting protocol (EpCAMpos/PDPNlow) based on previous publications that used tissue from both healthy and COPD lungs to separate AT2 cells from AT1 and airway basal cells, as AT1 and basal cells are both PDPNhigh (PMID: 22033268, PMID: 23117565; PMID: 35078977). This protocol was favoured due to the lack of information about HT2-280 expression and distribution in COPD lungs.
      • The sort quality for each sample was assessed by the FACS analysis (back sorting) of the sorted cells, where we observed 95-97% purity (EpCAMpos/PDPNlow, __ 1G __shown below). In addition, we validated the sorting protocol and high AT2 enrichment from both no COPD and COPD tissues by immunostaining the FACS-sorted cells with HT2-280, an AT2 marker widely used in the field (strategy suggested by the reviewer) and observed that close to 100% of cells were positive for this marker (__Fig. 1H __shown below). However, we could not do it retrospectively for those patients, where we didn't have enough material. Sorting primary AT2 from small tissue pieces is challenging, and we need at least 20.000 cells to obtain high-quality methylation & RNA-seq data.
      • AT2 marker genes (ABCA3, LPCAT1, LAMP3 and the surfactant genes SFTPA2, SFTPB and SFTPC) were among the top highly expressed genes in our RNA-seq data and were not significantly changed in COPD (please see expression data in __ S2A__ in the manuscript, and below for convenience), as well as Table 6, providing further evidence that the sorted cells carry a strong AT2 transcriptional signature. Fig. 1G* FACS plot examples showing the analysis of sorted AT2 cells (back sorting) from control (blue) and COPD (green) donors displayed over total cell lung suspensions (grey) H Representative IF staining of HT2-280 expression in sorted AT2 cells from no COPD (top) and COPD (bottom) donors. Nuclei (blue) were stained with DAPI, scale bars=20µm __Fig. S2A __Normalized read counts from RNA-seq data for AT2-specific genes in sorted AT2 cells from each donor (dots). Data points represent normalised counts from no COPD (blue), COPD I (light green) and COPD II-IV (dark green). Group median is shown as a black bar. *

      • In agreement with a previous study which profiled bulk AT2 using expression arrays (PMID: 23117565), we also observed upregulation of IFN signaling pathway in COPD AT2s. The enrichment of IFNα/β signature was also observed in COPD in the inflammatory AT2 cluster (iAT2) in a recent scRNA-seq study (PMID: 36108172). As part of the revision, we compared the IFN gene signature identified in our bulk AT2 RNA-seq with a recent scRNA-seq study (published after the submission of our manuscript, PMID: 39147413) that profiled EpCAMpos cells from COPD and non-smoker donor lungs. We observed an upregulation of our IFN signature genes in AT2 in COPD (mostly in AT2c and rbAT2 subsets), suggesting that similar signatures were observed in COPD AT2s in this dataset as well (please see __ S4E-F__ below). ____Figure S4E Expression values for the indicated genes of the IFN pathway from an external scRNA-seq dataset of AT2 cells from COPD patients and healthy controls (Hu et al, 2024). Y-axis shows log-normalized gene expression levels. F. Combined gene set score of the genes shown in (E) in different subsets of AT2 cells from Hu et al, 2024. The IFN signature genes were identified in our integrative analysis of TWGBS and RNA-seq in sorted AT2 cells.

      • We have also carefully examined DNA methylation profiles across all samples. The density plots of our T-WGBS DNA methylation data are very similar among the individual samples in all 3 groups, indicating that the sorted cells consist mostly of a single cell type, as there are no obvious intermediate (25-75%) methylation peaks, as observed in cell mixtures ( 2A and the panel below). No reference DNA methylation profiles are available for respiratory or terminal bronchial cells; hence, we cannot compare how epigenetically different these cells would be from AT2 nor perform a deconvolution for potential minor contamination with distal airway cells. *Figure: DNA methylation density plots of sorted EpCAMpos/PDPNneg cells from no COPD (blue, n=3), COPD I (light green, n=3) and COPD II-IV (dark green, n=5) showing a homogeneous methylation pattern and low abundance at intermediate (25%-75%) methylation values across all profiled samples, indicating that the sorted cells were mostly of a single cell type. *

      • We have now added a sentence to the limitations section of the discussion to cover that point specifically. CHANGES IN THE MANUSCRIPT:

      AT2 cells were isolated by fluorescence-activated cell sorting (FACS) from cryopreserved distal lung parenchyma, depleted of visible airways and vessels of three no COPD controls, three COPD I and five COPD II-IV patients as previously described (24, 52, 53)

      The isolated cells were positive for HT2-280, a known AT2 marker (54)*, as confirmed by immunofluorescence (Fig. 1H), validating the identity and high enrichment of the isolated AT2 populations. ** *

      *Known AT2-specific genes, including ABCA3, LAMP3 and surfactant genes (SFTPA2, SFTPB and SFTPC) were among the top highly expressed genes and were not significantly changed in COPD AT2s (Fig. S2A, Table 6), further confirming the AT2-characteristic transcriptional signature of our isolated cells. *

      However, 5-AZA is a global demethylating agent, and the observed effects may not be direct. To validate the epigenetic regulation of central AT2 pathways further, we took advantage of locus-specific epigenetic editing technology *(73). We focused on the IFN pathway because it was the most significantly enriched Gene Ontology (GO) term in our integrative analysis of TWGBS and RNA-seq data. Several IFN pathway members had associated hypomethylated DMRs within promoter-proximal regions and concomitant increased gene expression (Fig. 4C and S2C). Additionally, we confirmed the elevated expression of IFN-related genes with associated DMRs identified in our study in AT2 cells and AT2 cell subclusters from a recently published scRNA-seq cohort (74) (Fig. S4E-F). *

      We observed upregulation of multiple IFN genes in AT2 in COPD, consistent with a previous expression array study (24). IFNα/β signaling was also enriched in COPD patients in the inflammatory AT2 cluster (iAT2) in a recent scRNA-seq study (84) and our INF signature genes were also upregulated in AT2c and AT2rb subsets in COPD, identified by another scRNA-seq study recently (74)*. ** *

      Finally, despite careful removal of airways from distal lung tissue using a dissecting microscope, we cannot exclude the presence of some terminal/respiratory bronchiole cells in our FACS-isolated EpCAMpos/PDPNlow population. Recent scRNA-seq studies provided an unprecedented resolution and identified several epithelial subpopulations and transitional cells residing in the terminal/respiratory bronchioles and alveoli, including respiratory airway secretory cells (93), terminal airway-enriched secretory cells (28), terminal bronchiole-specific alveolar type-0 (AT0) (70), and emphysema-specific AT2 cells (74). These cells may contribute to alveolar repair in healthy and COPD lungs; however, with our bulk DNA methylation and RNA-seq study, we are unable to resolve all these subpopulations. Future development of single-cell methylation and non-reference-based algorithms for DNA methylation deconvolution will enable deeper epigenetic phenotyping of specific AT2 and bronchiolar cell subsets.

      (Methods) Validation of IFN gene upregulation in a published scRNA-seq dataset

      scRNA-seq data from (74), generously provided by M. Köningshoff, were processed using the default Seurat workflow (117). Expression of IFN-related genes was extracted and plotted as log-normalised gene expression levels in AT2 cells from control and COPD donors. Seurat's AddModuleScore() function was used to compute a gene set score for a custom IFN program using the genes listed in __Fig. S4E __and to analyse the IFN gene set scores in AT2 cell subclusters identified in (74). Briefly, average gene expression scores were computed for the gene set of interest, and the expression of control features (randomly selected) was subtracted as described in (118).

      Fig. S4E and F: E. Expression values for the indicated genes of the IFN pathway from an external scRNA-seq dataset of AT2 cells from COPD patients and healthy controls (74). Y-axis shows log-normalized gene expression levels. F. Combined gene set score of the genes shown in (E) in different subsets of AT2 cells from (74). The IFN signature genes were identified in our integrative analysis of TWGBS and RNA-seq in sorted AT2 cells.

      • The overrepresentation of several keratins (KRT5, KRT14, KRT16, KRT17), mucins (MUC12, MUC13, MUC16, MUC20) and the transcription factor FoxJ1 is now attributed by the authors to a possible dysregulation of AT2 identity and differentiation in COPD (lines 282 - 284) where they cite refs 28, 69, 70. Authors try to support this with IF double stains for KRT5 and HT-280 to identify co-expression of KRT5 and HT2-280 in lung tissue (Figure S2H). However, the evidence for the co-expression of both markers could be presented more convincingly.

      __Response: __

      We found the potential co-expression of airway and alveolar markers in COPD lungs interesting and hence included it in the original manuscript. The initial discovery came from our bulk RNA-seq data, where we observed upregulation of several genes typically found in more proximal airways in COPD (mentioned above by the reviewer). Of note, some of them (e.g., FoxJ1) are expressed at very low levels. Following reviewer's comments, to validate possible colocalization of AT2 and airway markers on protein level, we performed further IF analysis. We took Z-stack images to demonstrate the co-localization of HT2-280 and Krt5 more convincingly and co-stained the same tissue regions with SCGB3A2 (a TASC/distal airway cell marker, PMID 36796082). Even though these are rare events, we were able to reproduce the existence of HT2-280/Krt5 positive, SCGB3A2 negative cells in the alveoli of COPD patients on the protein level (__Fig. S2H __and panels below). Although interesting, we decided to keep this finding in the supplement and did not include it in the discussion to focus the story on the epigenetic regulation of the IFN pathway, which is the main discovery of our study. We will investigate this observation in future studies.

      Figure S2H and here: Examples of HT2-280/Krt5 double positive cells. Top, immunofluorescence staining of the alveolar region of a COPD II donor showing the existence of AT2 cells (HT2-280 positive (red), which are SCGB3A2 negative (green, left) but KRT5 positive (green, right). In conclusion, double-positive HT2-280/KRT5 cells are rare but present in the alveoli of COPD patients. Magnification: 20x. Scale bar: 50 µm. Bottom, Z-stack images highlighting HT2-280 (red) and KRT5 (green) double-positive cells at 63x magnification. Scale bar: 5 µm.

      CHANGES IN THE MANUSCRIPT:

      In addition, we observed an upregulation of several keratins (KRT5, KRT14, KRT16, KRT17) and mucins (MUC12, MUC13, MUC16, MUC20), suggesting a potential dysregulation of alveolar epithelial cell differentiation programs in COPD (Table 6, Fig. S2F). Immunofluorescence staining confirmed the presence of KRT5-positive cells in the distal lung in COPD and identified cells positive for both KRT5 and HT2-280 (Fig. S2H). Collectively, these results indicate a dysregulation of stemness and identity in the alveolar epithelial cells in COPD.

      Fig. S2H legend: The zoomed-in panel (right corner, bottom) demonstrates the presence of rare HT2-280/KRT5 double-positive cells in the alveoli of COPD patients.* Slides were counterstained with DAPI, scale bars = 50µm, 20µm or 5µm, as displayed in images. *

      • Double staining for KRT5 and HT2-280 did highlight the proximity of both cell types in lung tissue, underscoring the challenge of removing airways (including the smaller and terminal bronchi) from the tissue. In addition, HT-280/KRT5 co-expression is not consistent with recent studies from refs 28, 69, 70 where other markers for distal airway cell transition, such as SCGB3A2 and BPIFB1, have been demonstrated, which were not investigated in this study.

      Response:

      We provided a general overview of the different signatures observed in our data, but we could not validate every deregulated pathway or gene. We include the relevant tables detailing all differentially expressed genes and differentially methylated regions to enable and encourage the community to follow up on the data in subsequent studies.

      As demonstrated above, we detect the co-occurrence of HT2-280/KRT5 staining on the protein level in the same cells in the alveoli of COPD patients. We would like to emphasize that alveolar epithelial cell identity in CODP lungs has not been investigated in detail on the protein or RNA level, and HT2-280/KRT5 co-expression/co-localization has not been directly tested in the studies mentioned by the reviewer since, among other reasons, the gene encoding HT2-280 has not been identified. Notably, a recent study (published after the submission of our manuscript) focusing on enriched epithelial cells from the distal lungs of COPD patients (PMID 35078977), identified an emphysema-specific AT2 subtype co-expressing the AT2 marker SFTPC and distal airway cell transition marker SCGB3A2, indicating that disease-specific AT2 populations with possible co-occurrence of AT2 and airway markers exist. In our dataset, SCGB3A2 was not deregulated (log2 fold change=0.22, adj p-value= 0.47), as shown in Table 6, and the HT2-280/Krt5 positive cells were negative for SCGB3A2 in our IF staining (see above).

      BPIFB1 is one of the antimicrobial peptides genes with an associated DMR and is significantly upregulated in COPD cells in our study (log2 fold change=1.17, adj p-value=0.0016), as shown in the supplementary figure Fig S4C and here below for convenience.

      Figure S4C Fold-change in gene expression of BPIFB1 in AT2 cells in COPD (RNA-seq) and A549 cells treated with 0.5µM AZA (RT-qPCR) compared to control samples. Left, RNA-seq data from AT2 cells (no COPD, blue, n=3; COPD II-IV, green, n=5). Right, A549 treated with AZA (orange, n=3) compared to control DMSO-treated cells (grey, n=3). The group median is shown as a black bar.

      • The small (and not evenly divided) sample size of both COPD and non-COPD specimens may lead to a higher risk for false positive results as adjustments for multiple testing typically rely on the number of comparisons, and small sample sizes may not provide enough data points to adequately control for this.

      __Response: __

      We acknowledge the problem of testing for multiple traits with relatively small numbers of samples. The availability of donor tissue, especially from non-COPD and COPD-I donors, was limited, and we applied very strict donor matching and quality control criteria for sample inclusion to avoid additional variability and confounding factors. The importance of strict quality control in selecting appropriate control samples was highlighted in our previous study (PMID: 33630765), where we demonstrated that approximately 50% of distal lung tissue from cancer patients with normal spirometry has pathological changes. Hence, we believe that the quality of the tissue was paramount to the reliability of the data. Strict quality control and sample matching for multiple parameters, including age, BMI, smoking status and smoking history (critical for DNA methylation studies), and cancer type (for background tissue), is a key strength of our approach, but it inevitably limited our sample size.

      First, all samples were cryopreserved and then processed in parallel in groups of 1 non-COPD and 2-3 COPD samples. This process included tissue dissociation, FACS sorting, back sorting (always), and immunofluorescence staining (when enough material was available). Cell pellets were stored at -80{degree sign}C until the entire cohort was ready for sequencing. This was done to limit the potential variation introduced by processing and sorting. RNA and DNA isolations were performed in parallel for all the sorted cell pellets, which were then sequenced as a single batch.

      During data analysis, we applied stringent cutoffs for DMR detection to reduce the risk of false positives due to multiple comparisons and a small sample size. Specifically, we filtered for regions with at least 10% methylation difference and containing at least 3 CpGs. Additionally, we applied a non-parametric Wilcoxon test using average DMR methylation levels to remove potentially false-positive regions, as the t-statistic is not well suited for non-normally distributed values, as expected at very low/high (close to 0% / 100%) methylation levels. A significance level of 0.1 has been used. Therefore, we are confident that the rigorous analysis and strict criteria applied in this study allowed us to detect trustworthy DMRs that we could further functionally validate using epigenetic editing. All the details of the DMR analysis are provided in the methods section. To address this point and limitation, we have added the following paragraphs in the discussion section of the manuscript:

      CHANGE IN THE MANUSCRIPT:

      *The strengths of our study include the use of purified human alveolar type 2 epithelial progenitor cells from a well-matched and carefully validated cohort of human samples, including mild and severe COPD patients, providing high relevance to human COPD. *

      However, we acknowledge several limitations of our study that warrant further investigation. First, the sample size was small. The use of strict quality criteria for donor selection limited the available samples, particularly for the ex-smoker control group. This resulted in an unequal distribution of COPD and control samples. This impacts the power of statistical analysis, particularly in the WGBS analysis, where millions of regions genome-wide are tested. Nevertheless, the clear negative correlation between promoter methylation and corresponding gene expression highlights the robustness of the DMR selection. Additionally, we were able to experimentally validate interferon-associated DMRs using epigenetic editing, highlighting the power of integrated epigenetic profiling in identifying disease-relevant regulators.

      __Minor suggestions for improvement __

      __Introduction __ • In general, refer to the actual experimental studies rather than review papers where appropriate.

      Response:

      We have now carefully checked all the references and amended them to refer to experimental studies when required.

      • Clearly specify whether a study was conducted in mice or humans, as this distinction is crucial for understanding the relevance of the findings to COPD.

      __Response: __

      All our experiments were performed with human lung cells and tissues. No mouse samples were used. As suggested, we have now clearly stated that our study was performed using human tissue samples and cells in different parts of the manuscript, including the discussion, where we now explicitly highlight the strengths and limitations of our study.

      CHANGES IN THE MANUSCRIPT:

      ...we generated whole-genome DNA methylation and transcriptome maps of sorted human primary alveolar type 2 cells (AT2) at different disease stages.

      However, the regulatory circuits that drive aberrant gene expression programs in human AT2 cells in COPD are poorly understood

      Therefore, we set out to profile DNA methylation of human AT2 cells at single CpG-resolution across COPD stages.

      ...*suggesting that aberrant epigenetic changes may drive COPD phenotypes in human AT2. *

      To identify genome-wide DNA methylation changes associated with COPD in purified human AT2 cells...

      The similarity of the methylation and gene expression profiles in the PCAs suggested that epigenetic and transcriptomic changes in human AT2 cells during COPD might be interrelated ...

      *In this work, we demonstrate that genome-wide DNA methylation changes occurring in human AT2 cells may drive COPD pathology by dysregulating key pathways that control inflammation, viral immunity and AT2 regeneration. *

      *Using high-resolution epigenetic profiling, we uncovered widespread alterations of the DNA methylation landscape in human AT2 cells in COPD that were associated with global gene expression changes. *

      *Currently, it is unclear how cigarette smoking leads to changes in DNA methylation patterns in human AT2 *

      The strengths of our study include the use of purified human alveolar epithelial progenitor cells from a well-matched and carefully validated cohort of human samples, including mild and severe COPD patients, providing high relevance to human COPD.

      __Methods __ • Line 473, here is meant 3 ex-smoker controls instead of smoker controls?

      __Response: __

      All donors (no COPD and COPD) used in our study are ex-smokers. Matching the samples with regard to smoking status and history is critical for epigenetic studies, as cigarette smoke profoundly affects DNA methylation genome-wide (PMID: 38199042, PMID: 27651444). This has now been clarified in the revised manuscript.

      CHANGE IN THE MANUSCRIPT____:

      Of note, we included only ex-smokers in our profiling to avoid acute smoking-induced inflammation as a confounding factor (50)*. *

      Importantly, we matched the smoking status and smoking history of all donors, which is key in epigenetic studies, as cigarette smoking profoundly impacts the DNA methylation landscape of tissues (96).

      In total, 3 ex-smoker controls (no COPD), 3 mild COPD donors ex-smokers (GOLD I, COPD I) and 5 moderate-to-severe COPD donors ex-smokers (GOLD II-IV, COPD II-IV) were profiled (Fig. 1A-C, Table 1)

      __Discussion __ • A list of limitation should be added to the discussion. One is the use of the alveolar cell line A549, which produces mucus, a characteristic more commonly associated with bronchial epithelial cells. (ref 43)l530:

      __Response: __

      The profiling was performed using purified primary human alveolar epithelial progenitor cells. For technical reasons, A549 cells were only used for validation of the results using epigenetic editing. The A549 phenotype depends on the growth medium used, in our case, Ham's F-12 medium, which is recommended for long-term A549 culture and promotes multilamellar body formation and differentiation toward an AT2-like phenotype (PMID: 27792742)__. __We are developing epigenetic editing technology for use in primary lung cells; however, the approach currently relies on the high efficiency of transient transfections, which cannot yet be achieved with primary adult AT2 cells. We were positively surprised by how well the methylation data obtained from patient AT2s translated into mechanistic insights when using A549 cells, despite being a cancer cell line. This suggests that the fundamental mechanisms of epigenetic regulation of IRF9 and the IFN signaling pathway are conserved between A549 and primary AT2 cells.

      • Another limitation to consider is that cells were isolated primarily from individuals with lung cancer, except for patients with COPD stage IV. In particular as COPD stage II and IV samples were taken together. And discuss the small and unevenly divided sample size

      __Response: __

      We thank the reviewer for bringing up this important point, which we carefully considered when designing our study. To match our samples across the cohort, all the no-COPD, COPD I, and two of the COPD II-IV samples were obtained from cancer resections. In addition to other characteristics, like age, BMI and smoking status, we also matched the donors by cancer type (all profiled donors had squamous cell carcinoma). We collected lung tissue as far away from the carcinoma as possible and sent representative pieces for histological analysis by an experienced lung pathologist to confirm the absence of visible tumours. In addition, to ensure that our data represents COPD-relevant signatures, we intentionally included samples from three COPD donors undergoing lung resections (without a cancer background) in the profiling.

      Following the reviewer's suggestion, to investigate the potential impact of non-cancer samples on driving the observed differences, we carefully checked the PCAs for both DNA methylation and RNA-seq. We could not identify a clear separation of no-cancer COPD samples from the cancer COPD samples (or other cancer samples) in any examined PCs, indicating no cofounding effect of cancer background in the samples. We observed that one sample contributing to PC2 is a non-cancer sample, but this was a rather sample-specific effect, as the other two non-cancer samples clustered together with the other severe COPD samples with a cancer background. Notably, in our DNA methylation data, we do not observe typical features of cancer methylomes, like global loss of DNA methylation or aberrant methylation of CpG islands (e.g., in tumour suppressor genes) (see Fig 2A), further suggesting that we do not "pick up" confounding cancer signatures in our data.

      Following the comments from both reviewers, to clarify that point, we added the information about cancer and non-cancer samples to the PCA figures for DNA methylation (new Fig. 2B) and RNA-seq (new Fig. 3A) data in the revised manuscript, as shown below

      CHANGE IN THE MANUSCRIPT____:

      COPD samples from donors with a cancer background clustered together with the COPD samples from lung resections, confirming that we detected COPD-relevant signatures (Fig. 2B).

      Fig.2B* Principal component analysis (PCA) of methylation levels at CpG sites with > 4-fold coverage in all samples. COPD I and COPD II-IV samples are represented in light and dark green triangles, respectively, and no COPD samples as blue circles. COPD samples without a cancer background are displayed with a black contour. The percentage indicates the proportion of variance explained by each component. *

      Unsupervised principal component analysis (PCA) on the top 500 variable genes revealed a clear influence of the COPD phenotype in separating no COPD and COPD II-IV samples, as previously observed with the DNA methylation analysis, irrespective of the cancer background of COPD samples (Fig.3A, Fig. S2B).

      *Principal component analysis (PCA) of 500 most variable genes in RNA-seq analysis. PCA 1 and 2 are shown in Fig.3A, PCA 1 and 4 in Fig.S2B. COPD I and COPD II-IV samples are represented in light and dark green triangles, respectively, and no COPD samples as blue circles. COPD samples without a cancer background are displayed with a black contour. The percentage indicates the proportion of variance explained by each component. *

      __Response: __

      We thank the reviewer for suggestions on how to improve the discussion of our manuscript. We have now added a strength/limitation section to our discussion and included the points suggested by both reviewers.

      CHANGE IN THE MANUSCRIPT____:

      The strengths of our study include the use of purified human alveolar epithelial progenitor cells from a well-matched and carefully validated cohort of human samples, including mild and severe COPD patients, providing high relevance to human COPD. Importantly, we matched the smoking status and smoking history of all donors, which is key in epigenetic studies, as cigarette smoking profoundly impacts the DNA methylation landscape of tissues (96). With the first genome-wide high-resolution methylation profiles of isolated cells across COPD stages, we offer novel insights into the epigenetic regulation of gene expression in epithelial progenitor cells in COPD, expanding our understanding of how alterations in regulatory regions and specific genes could contribute to disease development. We identified IRF9 as a key IFN transcription factor regulated by DNA methylation. Notably, by targeting IRF9 through epigenetic modifications, we modulated the activity of the IFN pathway, which plays a crucial role in the immune response and lung tissue regeneration. Epigenetic editing techniques could offer a novel therapeutic strategy for COPD by downregulating IFN pathway activation and promoting the regeneration of epithelial progenitor cells in the lungs. Further preclinical and clinical studies are needed to validate the efficacy and safety of epigenetic editing approaches in COPD treatment (33)*. *

      *However, we acknowledge several limitations to our study that warrant further investigation. First is the small sample size and replication difficulty due to the lack of available data, common challenges for studies working with sparse human material and hard-to-purify cell populations. The use of strict quality criteria in donor selection limited the available samples, especially for the ex-smoker control group, leading to an unequal distribution of COPD and control samples. Overall, this impacts the power of statistical analysis, especially in the WGBS analysis, where millions of regions genome-wide are tested. Nevertheless, the clear negative correlation of promoter methylation to the corresponding gene expression highlights the robustness of the DMR selection. Furthermore, we could experimentally validate interferon-associated DMRs using epigenetic editing, highlighting the power of integrated epigenetic profiling for the discovery of disease-relevant regulators. *

      Overall, we detected a higher number of correlated DMR-DEG associations using our simple promoter-proximal linkage compared to the GeneHancer approach. Assigning enhancers to their target genes with high confidence is a complex and challenging task. Enhancers are often located far from the genes they regulate and can interact with their target genes through three-dimensional chromatin loops. Furthermore, enhancers can operate in a highly context-dependent manner, with the same enhancer regulating different genes depending on the cell type, developmental stage, or environmental signals. Determining which enhancer is active under specific conditions remains a hurdle in the field, especially since the AT2-specific chromatin profiles of enhancer marks are not yet available.

      In addition, while WGBS provides unprecedented resolution and high coverage of the DNA methylation sites across the genome, it does not allow distinguishing 5-methylcytosine from 5-hydroxymethylcytosine. Therefore, we cannot exclude that some methylated sites we detected are 5-hydroxymethylated. However, as 5-hydroxymethylcytosine is present at very low levels in the lung tissue (97)*, its effect is likely marginal. *

      Finally, despite careful removal of airways from distal lung tissue using a dissecting microscope, we cannot exclude the presence of some terminal/respiratory bronchiole cells in our FACS-isolated EpCAMpos/PDPNlow population. Recent scRNA-seq studies provided an unprecedented resolution and identified several epithelial subpopulations and transitional cells residing in the terminal/respiratory bronchioles and alveoli, including respiratory airway secretory cells (93), terminal airway-enriched secretory cells (28), terminal bronchiole-specific alveolar type-0 (AT0) (70), and emphysema-specific AT2 cells (74). These cells may contribute to alveolar repair in healthy and COPD lungs; however, with our bulk DNA methylation and RNA-seq study, we are unable to resolve all these subpopulations. Future development of single-cell methylation and non-reference-based algorithms for DNA methylation deconvolution will enable deeper epigenetic phenotyping of specific AT2 and bronchiolar cell subsets.

      __References __ • Check references. For instance, there is no reference in the text to ref 43.

      • Align format of references

      __Response: __

      We thank the reviewer for spotting this inconsistency. We have carefully checked and aligned the format of all references. The (old) reference 43 is now mentioned in the discussion part.

      __Reviewer #1 (Significance (Required)): __

      The strength of this study lies in its focus on the molecular mechanisms underlying the impaired regeneration of epithelial progenitor cells in COPD. The discovery of IRF9, which regulates IFN signaling and is prominently upregulated in COPD, together with the convincing validation of the epigenetic control of the IFN pathway by targeted DNA demethylation of the IRF9 gene, adds significant value to the COPD research field.

      Main limitations of the study are the relatively small sample size of both COPD and non-COPD specimens and the claim that the sorted EpCAMpos/PDPNlow cells primarily consisted of AT2 cells.

      __- Describe the nature and significance of the advance (e.g. conceptual, technical, clinical) for the field. __

      The nature and significance of the advance in epigenetic editing of IRF9 in COPD can be described as both conceptual and potentially clinical:

      Conceptual Advance: The epigenetic editing of IRF9 enhances our understanding of the molecular mechanisms underlying COPD pathogenesis. By targeting IRF9 through epigenetic modifications, researchers were able to modulate the activity of the IFN pathway, which plays a crucial role in the immune response and lung tissue regeneration. This approach offers insights into the epigenetic regulation of gene expression in epithelial progenitor cells in COPD and expands our understanding of how alterations in specific gene methylation could contribute to disease progression.

      Clinical Significance: The potential clinical significance of epigenetic editing of IRF9 lies in its implications for COPD therapy. If successful, epigenetic editing techniques could offer a novel therapeutic strategy for COPD by downregulating IFN pathway activation and promoting regeneration of epithelial progenitor cells in the lungs. Obviously, further preclinical and clinical studies are needed to validate the efficacy and safety of epigenetic editing approaches in COPD treatment.

      __Response: __We thank the reviewer for recognising the importance of our study, its conceptual advance and potential clinical significance. We are pleased to see that the reviewer highlights the promise of epigenetic editing in both furthering our basic understanding of molecular mechanisms of chronic diseases and its future potential as a therapeutic strategy.

      __- Place the work in the context of the existing literature (provide references, where appropriate). __ Few experimental papers have been published on epigenetic editing in lung diseases, with limited research available beyond the study referenced in citation 43. Song J, Cano-Rodriquez D, Winkle M, Gjaltema RA, Goubert D, Jurkowski TP, Heijink IH, Rots MG, Hylkema MN. Targeted epigenetic editing of SPDEF reduces mucus production in lung epithelial cells. Am J Physiol Lung Cell Mol Physiol. 2017 Mar 1;312(3):L334-L347. doi: 10.1152/ajplung.00059.2016. Epub 2016 Dec 23. PMID: 28011616.

      Response:

      We thank the reviewer for recognising the uniqueness and novelty of our study and the lack of research on the functional understanding of DNA methylation in the context of lung and lung diseases.

      - State what audience might be interested in and influenced by the reported findings.

      This study is of broad interest to researchers investigating the pathogenesis and treatment of COPD.

      __- Define your field of expertise with a few keywords to help the authors contextualize your point of view. __

      Expertise in: Lung pathology, Immunology, COPD, Epigenetics

      - Indicate if there are any parts of the paper that you do not have sufficient expertise to evaluate. Less expertise in: Epigenetic Editing

      __Reviewer #2 (Evidence, reproducibility and clarity (Required)): __

      __Summary: __

      This study aim to understand the molecular mechanisms underlying dysfunction in AT2 cells in COPD, by profiling bulk genome wide DNA methylation using Tagmentation-based whole-genome bisulfite sequencing (T-WGBS) and RNA sequencing in selectively sorted primary AT2 cells. The study stands out in it's sequencing breadth and use of an incredibly difficult cell population, and has the potential to add substantially to our mechanistic understanding of epigenetic contributions to COPD. A further highlight is the concluding aspect of the study where the authors undertook targeted modification of specific CpG methylation, provided direct, site-specific evidence for transcriptional regulation by CpG methylation.

      Response:

      We thank the reviewer for recognizing the conceptual and methodological advance of our study and for noting the value of our functional mechanistic approach.

      __Major comments: __

      The authors clearly show that there is DNA methylation alteration in AT2 cells from COPD individuals that links functional to gene expression at some level. However, I think the statement "to identify genome-wide changes associated with COPD development and progression..." and similar other references to disease development understanding is not accurate given the DNA methylation primary comparison is between control and moderate to severe COPD, with no temporal detail or evidence that they drive progression rather than are a result of COPD development. The paragraph starting on line 186 where this is a addressed to some extent is quite vague and doesn't really provide confidence that DNAm dysregulation occurs at an early stage in this context. This can be addressed by changing the focus/style of the text.

      __Response: __

      Thank you for raising this point. We agree with the reviewer that our cross-sectional study describes the association of methylation changes with either COPD I or more established disease (COPD II-IV) and that the observed changes may be either the driver or a result of COPD development. This has been clarified in the revised manuscript, and we removed the statements about disease initiation and progression. This is an important point; hence, we added an extra line to the discussion to make that clear.

      __CHANGE IN THE MANUSCRIPT____: __

      Therefore, we set out to profile DNA methylation of human AT2 cells at single CpG-resolution across COPD stages to identify epigenetic changes associated with disease and combine this with RNA-seq expression profiles.

      To identify epigenetic changes associated with COPD, we collected lung tissue from patients with different stages of COPD,

      ....to identify methylation changes associated with mild disease, we included TWGBS data from AT2 isolated from COPD I patients (n=3) in the analysis.

      Currently, we do not know whether the identified DNA methylation changes are the cause or the consequence of the disease process and not much is known about the correlation of DNA methylation with disease severity.

      *However, our study is cross-sectional, our cohort included only 3 COPD I donors, and we did not have any follow-up data on the patients, so future large-scale profiling of mild disease (or even pre-COPD cohorts) in an extended patient cohort will be crucial for a better understanding of early disease and its progression trajectories. *

      __Results comments and suggestions: __

      For the integrated analysis, there is a focus on DMRs in promoters with very little analysis on other regions. The paragraph starting on line 317 describes some analysis on enhancers but is very brief, doesn't include information on how many/which DMRs were included, making it hard to interpret the impact of the 147 DMRs and 93 genes identified - is this nearly all DMRs and genes analysed or very few? A comparison to the promoter analysis would be of interest. Especially as the targeted region followed up with lovely functional assessment in the last sections is a gene body DMR, not a promoter DMR.

      __Response: __

      We thank the reviewer for pointing out the importance of changes in enhancers. We agree that extending the enhancer analysis is very interesting. However, assigning enhancers to their target genes with high confidence is a complex and challenging task. Enhancers are often located far from the gene they regulate, sometimes spanning hundreds of kilobases. They can interact with their target genes through three-dimensional chromatin loops, potentially bypassing nearby genes to activate more distant ones, making it difficult to confidently link specific enhancers to their target genes. Furthermore, enhancers can operate in a highly context-dependent manner. The same enhancer can regulate different genes depending on the cell type, developmental stage, or environmental signals. Another challenge is that enhancers often work in clusters or "enhancer landscapes," where multiple enhancers contribute to the regulation of a single gene. Disentangling the contribution of individual enhancers within such clusters and determining which enhancer is active under specific conditions remains an ongoing hurdle in the field, especially since the AT2-specific chromatin profiles of enhancer marks are not yet available.

      One approach we tried to account for more distal regulatory regions was to assign DMRs to the nearest gene with a maximum distance of up to 100 kb using GREAT (Genomic Regions Enrichment of Annotations Tool) and simultaneously perform gene enrichment analysis of the associated genes. The old Figure S1C (now S1D) shows the top 10 enriched terms of either hyper- or hypomethylated DMRs, and Table 4 shows the full list of enriched terms. However, in this analysis, we did not integrate the results of the RNA-seq analysis. To demonstrate that we can correlate methylation with gene expression associations in this analysis, we then took a closer look at the WNT/b-catenin pathway, which contains 147 DMRs associated with 93 genes from the respective pathway (old Figure S3D, now S3G). Here, we showed that distal DMRs up to 100 kb away from the TSS show a high correlation with gene expression. We are including the two figures below for convenience:

      *Left panels, functional annotation of genes located next to hypermethylated (top) and hypomethylated (bottom) DMRs using GREAT. Hits were sorted according to the binominal adjusted p-value and the top 10 hits are shown. The adjusted p-value is indicated by the color code and the number of DMR associated genes is indicated by the node size. Right panel, scatter plot showing distal DMR-DEG pairs associated with Wnt-signaling. Pairs were extracted from GREAT analysis (hypermethylated, DMR-DEG distance Following the reviewer's suggestion, we have now extended the enhancer analysis using the GeneHancer database, the most comprehensive, integrated resource of enhancer/promoter-gene associations. We used the GeneHancer version 5.14, which annotates 392,372 regulatory genomic elements (GeneHancer element) on the hg19 reference genome. Of the 25,028 DMRs, 18,289 DMRs (73% of all DMRs) coincided with at least one GeneHancer element, resulting in 19,661 DMR-GeneHancer associations. Next, we extracted the GeneHancer elements associated with protein-coding or long-non-coding RNAs genes, which left us with 2,144 DMR-GeneHancer associations. Next, we used only high-scoring gene GeneHancer associations ("Elite"), leaving 1,485 DMR-GeneHancer associations. Of those, we selected the GeneHancer elements, which are linked to genes differentially expressed in our RNA-seq analysis resulting in a final table of 376 DMR-GeneHancer associations (Table 9 DMR_DEG_GeneHancer, Tab 2). Similar to the promoter-proximal analysis, we analysed the correlation of expression and methylation changes of the DMR-GeneHancer associations, demonstrating a high number of negatively and positively correlated events (Fig.S3D). Finally, we performed the gene enrichment analysis for positively and negatively correlating genes. We detected significant GO term enrichments only for negatively correlating genes (Fig.S3E and Table 10_Enrichment_results, Tab2).

      CHANGE IN THE MANUSCRIPT

      To harness the full resolution of our whole-genome DNA methylation data, we extended the analysis beyond promoter-proximal regions and assessed how epigenetic changes in distal regulatory regions (enhancers) may relate to transcriptional differences in COPD. As the assignment of enhancer elements to the corresponding genes is challenging, we tried two different approaches. First, we used the GeneHancer database (72) to link DMRs to regulatory genomic elements (GeneHancer element). Of the 25,028 DMRs, 18,289 DMRs (73%) coincided with at least one GeneHancer element. Of those 2,144 DMR-GeneHancer associations were linked either to protein-coding or lncRNA genes. Next, we filtered for high-scoring gene GeneHancer associations ("Elite"), leaving 1,485 DMR-GeneHancer Elite associations. Of those, we selected the GeneHancer elements, which are linked to genes differentially expressed in our RNA-seq analysis, resulting in 376 DMR-GeneHancer associations (Table 9). Similar to the promoter-proximal analysis, we assessed the correlation of expression and methylation changes of the DMR-GeneHancer associations, demonstrating a high proportion of negatively and positively correlated events (Fig. S3E). Finally, we performed gene enrichment analysis for positively and negatively correlated genes. We detected significant GO term enrichments for negatively correlating genes only (Fig. S3F and Table 10), with the most pronounced term "regulation of tumor necrosis factor". In an alternative approach, we linked proximal and distal (within 100 kb from TSS) DMRs to the next gene using GREAT (57) (Fig S1C, Table 4) *and calculated Spearman correlation between DMRs and associated DEGs__. 147 DMRs were associated with high correlation rates with 93 genes from the WNT/β-catenin pathway (Fig. S3G)__, suggesting that DNA methylation may also drive the expression of genes of the WNT/β-catenin family. *

      Figure S3E and F: E. Spearman correlation between gene expression and DMR methylation of DMRs assigned to gene regulatory elements using the GeneHancer database. F. GO-Term over-representation analysis of DEGs negatively correlated to DMRs in gene regulatory elements. The adjusted p-value is indicated by the color code and the percentage number of associated DEGs is indicated by the node size.

      (Methods) For enhancer analysis, the GeneHancer database version 5.14, which annotates 392,372 regulatory genomic elements (GeneHancer element) on the hg19 reference genome, was used (72). Of the 25,028 DMRs 18,289 DMRs coincided with at least one GeneHancer element, resulting in 19,661 DMR-GeneHancer associations. Next, the GeneHancer elements were filtered for association with protein-coding or long-non-coding RNAs genes and high-scoring gene GeneHancer associations ("Elite"), leaving 1,485 DMR-GeneHancer associations. Of those, the GeneHancer elements were selected, which are linked to differentially expressed genes in COPD resulting in a final table of 376 DMR-GeneHancer associations. Similar to the promoter-proximal analysis, the Spearman correlation of expression and methylation changes of the DMR-GeneHancer associations was assessed. GO gene enrichment analysis for positively and negatively correlating genes was done using Metascape (111).

      A comparison to the promoter analysis would be of interest.

      Response:

      We detected more highly correlated (|correlation coefficient| > 0.5) DMR-DEG associations using our simple promoter proximal linkage (n=643) in comparison with the GeneHancer approach comprising annotated enhancer elements (n=327/2,144). Gene enrichment results pointed to the interferon pathway, which we could confirm using epigenetic editing. This pathway was not present in the GeneHancer analysis, indicating that regulation of the IFN pathway may be controlled by proximal elements.

      CHANGE IN THE MANUSCRIPT____:

      Overall, we detected a higher number of correlated DMR-DEG associations using our simple promoter-proximal linkage compared to the GeneHancer approach. Assigning enhancers to their target genes with high confidence is a complex and challenging task. Enhancers are often located far from the genes they regulate and can interact with their target genes through three-dimensional chromatin loops. Furthermore, enhancers can operate in a highly context-dependent manner, with the same enhancer regulating different genes depending on the cell type, developmental stage, or environmental signals. Determining which enhancer is active under specific conditions remains a hurdle in the field, especially since the AT2-specific chromatin profiles of enhancer marks are not yet available.

      Especially as the targeted region followed up with lovely functional assessment in the last sections is a gene body DMR, not a promoter DMR.

      Response:

      We thank the reviewer for bringing up that point. To clarify, we defined the promoter regions for the analysis as regions located {plus minus} 6 kb (upstream and downstream) from the transcriptional start site (TSS). Since the term "promoter" often refers to the region upstream of the transcriptional start site, its use may have been misleading. For clarity, we changed the text correspondingly to __promoter proximal methylation __and explained in the methods how the regions for analysis were defined.

      __CHANGE IN THE MANUSCRIPT____: __

      "DMR association per gene promoter" was changed to "Gene promoter proximal DMRs"

      Fig. S3B: "DMR in promoter" was changed to "promoter proximal DMR(s)"

      "by DNA methylation changes in promoters" was changed to "by DNA methylation changes in promoter proximity"

      "regulated by promoter methylation" was changed to "regulated by promoter-proximal methylation"

      "analysis of the promoter DMRs" was changed to "analysis of the promoter-proximal DMRs"

      "between promoter methylation" was changed to "between promoter proximal methylation"

      Cytoscape was used to analyse negatively or positively correlated DMR DEG pairs. ClueGO (v2.5.6) analysis was conducted using all DEG associated with a promoter proximal DMR (+/- 6 kb from TSS) and the Spearman correlation coefficient 0.5 (112).

      • Lines 299-301 - I'm not sure the graph in Fig S3A support the conclusion that there was a preferential negative relationship between DNAm and gene expression. Looks like there are a substantial number of cases where a positive relationship is observed and this needs to be acknowledged.

      Response:

      In this part, we refer to Fig S3C. In the left panel, downregulated genes clearly show higher counts for the hypermethylated DMRs, whereas the hypomethylated DMRs are enriched at upregulated genes (right panel), indicating a preference for negative correlation: lower methylation, higher gene expression. If there were no preference, we would expect a 50:50 ratio of hypo- and hypermethylated DMRs, and we observed a 77:23 ratio. Nevertheless, we agree that there is a substantial number of cases (n=151) with a high positive correlation, which we now highlight in the text. For clarity, we also modified the figure legend to indicate that a stacked histogram is represented in the panel.

      __CHANGE IN THE MANUSCRIPT____: __

      L303: Interestingly, 23.5% of the identified DMR DEG pairs (n=151) showed a positive correlation between gene expression and DNA methylation.

      *Figure legend in Fig. S3C was changed to: C Stacked histogram showing location of hyper- and hypomethylated DMRs relative to the TSS of DEGs in downregulated (left) and upregulated (right) genes. *

      • Line 307 - what are the "analysed DEGs"? Are they the methylation associated genes?

      Response:

      Those are the DEGs we identified in RNA-seq analysis. To clarify, we changed the text to "identified DEGs".

      __CHANGE IN THE MANUSCRIPT____: __

      • "analysed DEGs" was changed to "identified DEGs"*

      • Line 307-309 - "Among the analyzed DEGs, 76.5% (492) displayed a negative correlation (16.8% of the total DEGs), indicating a possible direct regulation by DNA methylation, while 23.5% (151) showed a positive correlation between gene expression and DNA methylation" - are the authors suggesting the positive correlation doesn't indicate direct regulation?

      __Response: __

      Thank you for highlighting this point. We did not intend to suggest that negative correlation indicates direct regulation, while positive correlation suggests a lack thereof. To clarify that point, we have reformulated this sentence.

      __CHANGE IN THE MANUSCRIPT____: __

      Among the identified DEGs, 76.5% (n=492) displayed a negative correlation (16.8% of the total DEGs), consistent with a repressive role of promoter DNA methylation. Interestingly, 23.5% of the identified DEG (n=151) showed a positive correlation between gene expression and DNA methylation.

      • Line 313 - why did the authors focus on only negatively correlated genes to identify their top dysregulated pathway of IFN signalling? Why not do pathway analysis on the DNAm associated genes separately to identify DNAm associated pathways?

      Response:

      We have also performed a pathway enrichment analysis using the positively correlated genes but did not identify any significantly enriched pathways/process/terms. When we examined the top hit of the gene set enrichment analysis, the interferon signaling pathway, we observed only negatively correlated DMR gene associations (Fig. 5B). Therefore, we decided to use only the negatively correlated DMRs, as using all correlated genes would give a higher background and dilute our results.

      CHANGE IN THE MANUSCRIPT____:

      Cytoscape was used to analyse negatively or positively correlated DMR DEG pairs. ClueGO (v2.5.6) analysis was conducted using all DEG associated with a promoter proximal DMR (+/- 6 kb from TSS) and the Spearman correlation coefficient 0.5 (113).

      • A comparison of the gene expression data with previous data in AT2 cell/single cell data would strengthen the gene expression section.

      __Response: __

      We compared our gene expression signatures with the study of Fujino et al., who profiled sorted AT2 cells (EpCAMhighPDPNlow) from COPD/controls using expression arrays (PMID: 23117565). Consistent with our study, the authors also observed the upregulation of interferon signalling (among other pathways) in COPD AT2s. However, no raw data was available in the published manuscript for a more in-depth analysis.

      Several recent scRNA-seq studies identified transcriptional signatures of COPD and control cells (e.g., PMIDs: 36108172, 35078977, 36796082, 39147413__). However, most studies did not match the smoking status of the control and COPD donors and looked at the whole lung tissue, with limited power to detect gene expression changes in distal alveolar cells. It is difficult to directly compare our data to the gene expression data from non-smokers vs COPD patients, as cigarette smoking profoundly remodels the epigenome and transcriptional signatures of cells. In addition, differences in technologies and depth of sequencing make such comparisons challenging. However, one study (PMID: 36108172) performed scRNA-seq analysis on 3 non-smokers, 4 ex-smokers and 7 COPD ex-smoker lungs. Despite relatively limited coverage of epithelial cells in the dataset (We also compared the main AT2 IFN signature identified in the integration of our DNA methylation in promoter-proximal regions and RNA-seq with a recent study (published after the submission of our manuscript, PMID: 39147413) that profiled EpCAMpos cells from COPD and control lungs (non-smokers) using scRNA-seq. We observed an upregulation of our IFN signature genes in AT2 in COPD (specifically in AT2-c and rbAT2 subsets), suggesting that similar signatures were observed in this dataset as well. However, ex-smokers were not included in this study, making direct comparisons difficult. We have now included the panels shown below as __Figure S4E and S4F:

      Figure S4E and F: Expression values for the indicated genes of the IFN pathway from an external scRNA-seq dataset of AT2 cells from COPD patients and healthy controls (74). Y-axis shows log-normalized gene expression levels. F. Combined gene set score of the genes shown in (E) in different subsets of AT2 cells from (74)*. The IFN signature genes were identified in our integrative analysis of TWGBS and RNA-seq in sorted AT2 cells. *

      CHANGES IN THE MANUSCRIPT:

      However, 5-AZA is a global demethylating agent, and the observed effects may not be direct. To validate the epigenetic regulation of central AT2 pathways further, we took advantage of locus-specific epigenetic editing technology (73). We focused on the IFN pathway because it was the most significantly enriched Gene Ontology (GO) term in our integrative analysis of TWGBS and RNA-seq data. Several IFN pathway members had associated hypomethylated DMRs within promoter-proximal regions and concomitant increased gene expression (Fig. 4C and Fig.S2C). Additionally, we confirmed the elevated expression of IFN-related genes with associated DMRs identified in our study in AT2 cells and AT2 cell subclusters from a recently published scRNA-seq cohort (74)* (Fig. S4E-F). *

      (Methods) Validation of IFN gene upregulation in a published scRNA-seq dataset

      scRNA-seq data from (74), generously provided by M. Köningshoff, were processed using the default Seurat workflow (117). Expression of IFN-related genes was extracted and plotted as log-normalised gene expression levels in AT2 cells from control and COPD donors. Seurat's AddModuleScore() function was used to compute a gene set score for a custom IFN program using the genes listed in __Fig. S4E __and to analyse the IFN gene set scores in AT2 cell subclusters identified in (74). Briefly, average gene expression scores were computed for the gene set of interest, and the expression of control features (randomly selected) was subtracted as described in (118).

      Fig. S4 E and F. E. Expression values for the indicated genes of the IFN pathway from an external scRNA-seq dataset of AT2 cells from COPD patients and healthy controls (74). Y-axis shows log-normalized gene expression levels. F. Combined gene set score of the genes shown in (E) in different subsets of AT2 cells from (74). The IFN signature genes were identified in our integrative analysis of TWGBS and RNA-seq in sorted AT2 cells. __ __

      • The paragraph starting on line 173 feels a little redundant when we know there is RNA available to test if the differential DNAm links to altered gene expression - this selected of example regions/genes would be better placed after the gene expression has been reported, at which point you could say whether the linked genes displayed altered transcription.

      Response:

      The current structure (with DNA methylation, followed by RNA-seq and integration) is intentional and serves several important purposes. As this is the first genome-wide high-resolution COPD DNA methylation study of AT2, we aimed to describe the methylation landscape independently of gene expression (noting the limitation of current understanding of how DNA methylation regulates expression). This early focus on DMRs lays clear groundwork by highlighting potential regulatory elements and pathways that could be disrupted, independent of or even before corroborative transcriptional data. Additionally, positioning these examples early in the narrative helps to frame subsequent gene expression analyses. Once RNA data are introduced later, the reader can directly compare the methylation patterns with transcriptional outcomes, thereby enhancing the overall story. In other words, by first showcasing disease-relevant methylation changes, we underscore a hypothesis that these epigenetic modifications are functionally meaningful. The later integration of gene expression data then serves as a confirmatory or complementary layer, rather than the sole basis for inferring biological significance. This is important as we still do not fully understand the function of DNA methylation outside promoters, and its role is also important for splicing, 3D genome organisation, non-coding RNA regulation, enhancer regulation, etc.

      • Similarly, the TF enrichment analysis is great but maybe would have added value to be done on DNA regions later shown to be linked to differential expression - was there different enrichment at DNA regions that are vs are not associated with altered expression? And could you test in vitro whether changing methylation of DNA (maybe a blunt too like 5-aza would be ok) alters TF binding (cut+run/ChIP?). Furthermore, it would be interesting to understand the TF sensitivity analysis within the context of positive versus negative DNA methylation:gene expression correlations.

      Response:

      As suggested by the reviewer, we now performed the TF enrichment analysis using the DMRs with a high correlation (|correlation coefficient|>0.5) between methylation and expression (Figure S3D) and expanded the method section to include TF analysis. We observed ETS domain motifs enriched at hypomethylated regions. They prefer unmethylated DNA (MethylMinus) and are therefore expected to bind with higher affinity to the respective DMRs in COPD. We agree with the reviewer that further verifying altered TF binding using cut&run or ChIP assays would be very interesting, but it is out of the scope of this manuscript. Such analysis is technically very challenging to perform with low numbers of primary AT2 cells and will be the focus of our follow-up mechanistic studies.

      CHANGE IN THE MANUSCRIPT____:

      Additionally, motif analysis of DMRs that were highly correlated (|Spearman correlation coefficient| > 0.5) with DEGs revealed a prominent enrichment of the cognate motif for ETS family transcription factors, such as ELF5, SPIB, ELF1 and ELF2 at hypomethylated DMRs (Fig. S3D). Interestingly, SPIB was shown to facilitate the recruitment of IRF7, activating interferon signaling (71)*, and our WGBS data uncovers SPIB motifs at hypomethylated DMRs, which aligns with its binding preferences at unmethylated DNA (methyl minus, Fig. S3D). *

      Figure S3D: Enrichment of methylation-sensitive binding motifs at hypo- (right) and hypermethylated (left) DMRs, using DMRs with a high correlation (|Spearman correlation coefficient| > 0.5) between methylation and gene expression. Methylation-sensitive motifs were derived from Yin et al (64). Transcription factors, whose binding affinity is impaired upon methylation of their DNA binding motif, are shown in red (Methyl Minus), and transcription factors, whose binding affinity upon CpG methylation is increased, are shown in blue (Methyl Plus).

      (Methods) To obtain information about methylation-dependent binding for transcription factor motifs which are enriched at DMRs, the results of a recent SELEX study (64)* were integrated into the analysis. They categorised transcription factors based on the binding affinity of their corresponding DNA motif to methylated or unmethylated motifs. Those whose affinity was impaired by methylation were categorised as MethylMinus, while those whose affinity increased were categorised as MethylPlus. A motif database of 1,787 binding motifs with associated methylation dependency was constructed. The log odds detection threshold was calculated for the HOMER motif search as follows. Bases with a probability > 0.7 got a score of log(base probability/0.25); otherwise, the score was set to 0. The final threshold was calculated as the sum of the scores of all bases in the motif. Motif enrichment analysis was carried out against a sampled background of 50,000 random regions with matching GC content using the findMotifsGenome.pl script of the HOMER software suite, omitting CG correction and setting the generated SELEX motifs as the motif database. *

      __Methods: __ • The authors should include more detail of the TWGBS rather than directing the reader to a previous publication. Also DNA concentration post bisulfite conversion would be a useful metric to provide.

      __Response: __

      Following the suggestion, we have now expanded the details of TWGBS in the methods part of the manuscript. Due to limited space, we did not include a detailed protocol but instead referred to a published step-by-step protocol (55). Of note, we do not measure DNA concentration post-bisulfite conversion but consistently use the starting input of 30 ng of genomic DNA across all samples.

      __CHANGE IN THE MANUSCRIPT____: __

      (Methods): 15 pg of unmethylated DNA phage lambda was spiked in as a control for bisulfite conversion. Tagmentation was performed in TAPS buffer using an in-house purified Tn5 assembled with load adapter oligos (55) at 55 {degree sign}C for 8 min. Tagmentation was followed by purification using AMPure beads, oligo replacement and gap repair as described (55). Bisulfite treatment was performed using EZ DNA Methylation kit (Zymo) following the manufacturer's protocol.

      *The T-WGBS library preparations were performed for all donors in parallel and sequenced in a single batch to minimize batch effects and technical variability. *

      • Differential DNA methylation analysis: It is stated that DNA regions had to contain 3 CpG sites but was this within a defined DNA size range?

      Response:

      The maximum distance between individual CpGs within DMR was set to 300 bp. To clarify, we added that information to the methods part.

      __CHANGE IN THE MANUSCRIPT____: __

      *"regions with at least 10% methylation difference and containing at least 3 CpGs with a maximum distance of 300 bp between them. *

      • Refence genome only provided for RNAseq not TWGBS?

      __Response: __We used hg19 as the reference genome. The information on the reference genome for DNA methylation analysis was provided in the methods L574 (original manuscript_: "The reads were aligned to the transformed strands of the hg19 reference genome using BWA MEM")

      • The tables do not appear in the PDF and I struggled to tally to the "Dataset" files provided if that is what they were referring to?

      Response:

      Full tables (uploaded as Datasets in the manuscript central due to their size) were uploaded together with the manuscript files. They are quite large and will not convert to pdf, so they may not have been included in the merged pdf file. We assume that they should be available to the reviewers with the other files and will clarify that with the editorial staff in the resubmission cover letter.

      • For the gene expression analysis, can it be made clearer that a full analysis was done on COPD I samples. It is a little confusing to the reader as this was not done for DNAm so might be assumed the same targeted analysis on only genes found to be differentially expressed between control and COPD II-IV, but that cannot be the case as an overlap of COPD1 vs COPD II-IV genes if provided. For this overlap, do genes show the same effect direction?

      __Response: __

      To clarify, for the RNA-seq analysis, we performed DEG analysis for no-COPD versus COPD II-IV, as well as no-COPD versus COPD I. We then took all differentially expressed genes (presented in the Venn diagram) and plotted them for all samples as a heatmap. To split the genes into groups displaying similar effect directions, we applied a clustering approach and identified 3 main signatures. Cluster 3 primarily comprises genes unique to COPD I samples, which are associated with the adaptive immune system and hemostasis (Fig. 4E). In the other two clusters, we mainly observe a transitioning pattern from control to severe COPD samples, correlating with the FEV1 values of the patients. This has now been clarified in the manuscript.

      • Replication is difficult on these studies as the samples are so difficult to come by. Also limited by sample size for the same reason. It doesn't mean the study is not worth doing and the data are still valuable. However, it may be pertinent to include technical validation of a few regions of interest, acknowledge the limitation (along side strengths) in the discussion, and perhaps provide actual p value rather than blanket Response:

      We thank the reviewer for acknowledging the replication challenges for studies working with sparse human material and hard-to-purify cell populations. Following the reviewer's suggestion, we have now included a strengths and limitations section in the discussion where we summarised the points highlighted by both reviewers.

      Regarding technical validation, we would like to note that the whole genome bisulfite sequencing (WGBS) technology, as well as the tagmentation-based WGBS (T-WGBS), have been validated in the past few years in several publications (e.g., PMID: 24071908) and shown to yield reliable DNA methylation quantification in comparison to other technologies (PMID: 27347756). For us, technical validation using alternative methods (e.g. bisulfite sequencing or pyrosequencing) is difficult as it requires significantly more input DNA than the low-input T-WGBS we have performed and obtaining sufficient amounts of material from primary human AT2 cells (especially from severe COPD) is not possible with the size of tissue we can access. However, while establishing the T-WGBS for this project, we initially validated our approach using Mass Array, a sequencing-independent method. For this, we performed T-WGBS on the commercially available smoker and COPD lung fibroblasts and selected 9 regions with different methylation levels for validation using a Mass Array. We obtained an excellent correlation between both methods, providing technical validation of T-WGBS and our analysis workflow. This validation was published in our earlier manuscript (PMID: 37143403), but we provided the data below for convenience.

      Scatter plots showing correlation of average methylation obtained with T-WGBS and Mass Array from COPD and smoker fibroblasts. Each dot represents one region with varying methylation levels. The blue diagonal represents the linear regression. Shaded areas are confidence intervals of the correlation coefficient at 95%. Correlation coefficients and P values were calculated by the Pearson correlation method.

      To enable further validation and follow-up by the community, we included the full list of DMRs, associated p-values and additional information for DNA methylation analysis (DMR width, n.CpGs, MethylDiff, etc) in Table 3 (Table_3_wgbs_dmr_info.xlsx) and the information about DEGs from RNA-seq in Table 6 (Table_6_RNAseq_DEG_info.xlsx).

      • It isn't clear to me if DNA and RNA are from the same cells? The results say "cells matching those used for T-WGBS" but the methods suggest separate extractions so not the same cells? If they are not the same cells a comment on the implications of this should be included in the discussion for example, potentially some differences in cell type composition, storage time etc.

      Response:

      Lung tissue samples were freshly cryopreserved, and H&E slides derived from exemplary pieces of the tissue analyzed. Once we had a group of at least 3 samples comprising one non-COPD and 2 COPD samples, we processed them in parallel to limit sorting variation between control and disease samples. The sorted cells were counted, aliquoted and pelleted at 4{degree sign}C before flash freezing and storing at -80{degree sign}C. The storage time of the cell pellets varied between the donors. RNA and DNA were isolated from cell pellets collected from the same FACS sorting experiment; therefore, we do not expect differences in cell type composition. In addition, RNA and DNA isolation were performed for all sorted pellets in parallel. All library preparations for TWGBS and RNA-seq were performed for all donors in parallel and sequenced in a single batch to minimise batch effects and technical variability. This has now been clarified in the methods part of the manuscript.

      __CHANGE IN THE MANUSCRIPT____: __

      To minimize potential technical bias, samples from no COPD and COPD donors were processed in parallel in groups of 3 (one no COPD and 2 COPD samples).

      RNA and genomic DNA for RNA-seq and TWGBS were isolated from identical aliquots of sorted cell pellets.

      Genomic DNA was extracted from 1-2x104 sorted alveolar epithelial cells isolated from cryopreserved lung parenchyma from 11 different donors in parallel using QIAamp Micro Kit

      The TWGBS library preparations were performed for all donors in parallel and sequenced in a single batch to minimize batch effects and technical variability.* *

      RNA was isolated from flash-frozen pellets of 2x104 sorted AT2 cells from 11 different donors in parallel.

      The RNA-seq library preparation for all donors was performed in parallel and all samples were sequenced in a single batch to minimize batch effects and technical variability.

      • Line 193 the authors say "Since DMRs were overrepresented at cis-regulatory sites...." - "cis" needs to be defined. If you link DNAm regions to gene via "closest gene" does this not automatically mean you're outputs will be cis? Just needs better definition/explanation.

      Response:

      The term "cis‐regulatory sites" in our manuscript is intended to denote regulatory elements-such as enhancers, promoters, and other nearby control regions-that reside on the same chromosome and close to the genes they regulate. While it's true that linking a DMR to its closest gene captures a cis association, our phrasing emphasises that the DMRs are enriched specifically at these functional regulatory elements (Fig. 2E) rather than being randomly distributed. This usage aligns with established conventions in the field. To avoid any misunderstandings, we have now changed the term to gene regulatory sites.

      __CHANGE IN THE MANUSCRIPT____: __

      *We changed the "cis-regulatory sites" to "gene regulatory sites" *

      __Minor comments: __

      Line 157: "we identified site-specific differences....". Change to region specific?

      Response:

      This has now been corrected as suggested.

      Line 102-103: needs a reference for the statement "Alterations in DNA methylation patterns have been implicated......"

      Response:

      Following the reviewer's suggestion, we added the relevant references (34-36) to this statement.

      Line 266 - what does "strong dysregulation" mean? Large fold change, very significant?

      Response:

      We removed the word "strong" from this sentence.

      Lines 423-425 - statement needs a reference

      Response:

      Following the reviewer's suggestion, we added the relevant reference to this statement.

      Line 428 - word missing between "epigenetic , we"?

      Response:

      This has now been corrected. The text reads: "Through treatment with a demethylating drug and targeted epigenetic editing, we demonstrated the ability to modulate..."

      Prior studies are well references, text and figures are clear and accurate.

      __Reviewer #2 (Significance (Required)): __

      This study has several strengths:

      1) Sample collection and characterisation. AT2 cells are incredibly hard to come by and the authors should be commended to generating the samples. However, proximity to cancer is always a potential issue, especially in epigenetic studies. Is it feasible to include any analysis to show the samples derived from those with cancer don't drive the changes observed? Even a high level PCA or an edit of fig 2A with non-cancer in a different colour in supplemental - looks like there is one outlier, is that a non-cancer? Or a correlation of change in beta between control and cancer/COPD and control and non-cancer:COPD (for want a better phrase!). just an indicator that the non-cancer COPD samples are not driving differences.

      Response:

      We thank the reviewer for highlighting the value of generating data from hard-to-work-with AT2 populations and bringing up the important point of cancer proximity, which we considered very carefully when designing our study. To match our samples across the cohort, all the no-COPD, COPD I, and two of the COPD II-IV distal lung samples were obtained from cancer resections. In addition to other characteristics, like age, BMI and smoking status, we also matched the donors by cancer type (all profiled donors had squamous cell carcinoma). We collected lung tissue as far away from the carcinoma as possible and sent representative pieces for histological analysis by an experienced lung pathologist to confirm the absence of visible tumours. In addition, to ensure that our data represents COPD-relevant signatures, we intentionally included samples from three COPD donors undergoing lung resections (without a cancer background) in the profiling.

      Following the reviewer's suggestion, to investigate the potential impact of non-cancer samples on driving the observed differences, we carefully checked the PCAs for both DNA methylation and RNA-seq. We could not identify a clear separation of no-cancer COPD samples from the cancer COPD samples (or other cancer samples) in any examined PCs, indicating no cofounding effect of cancer samples. We observed that one sample contributing to PC2 is a non-cancer sample, but this was a rather sample-specific effect, as the other two non-cancer samples clustered together with the other severe COPD samples with a cancer background. Notably, in our DNA methylation data, we do not observe typical features of cancer methylomes, like global loss of DNA methylation or aberrant methylation of CpG islands (e.g., in tumour suppressor genes) (see Fig. 2A), further suggesting that we do not "pick up" confounding cancer signatures in our data.

      Following the comments from both reviewers, to clarify that point, we added the information about cancer and non-cancer samples to the PCA figures for DNA methylation (new Fig. 2B) and RNA-seq (new Fig. 3A) data in the revised manuscript, as shown below

      CHANGE IN THE MANUSCRIPT____:

      COPD samples from donors with a cancer background clustered together with the COPD samples from lung resections, confirming that we detected COPD-relevant signatures (Fig. 2B).

      Fig. 2B.* Principal component analysis (PCA) of methylation levels at CpG sites with > 4-fold coverage in all samples. COPD I and COPD II-IV samples are represented in light and dark green triangles, respectively, and no COPD samples as blue circles. COPD samples without a cancer background are displayed with a black contour. The percentage indicates the proportion of variance explained by each component. *

      Unsupervised principal component analysis (PCA) on the top 500 variable genes revealed a clear influence of the COPD phenotype in separating no COPD and COPD II-IV samples, as previously observed with the DNA methylation analysis, irrespective of the cancer background of COPD samples (Fig.3A, Fig. S2B).

      *Principal component analysis (PCA) of 500 most variable genes in RNA-seq analysis. PCA 1 and 2 are shown in Fig.3A, PCA 1 and 4 in Fig.S2B. COPD I and COPD II-IV samples are represented in light and dark green triangles, respectively, and no COPD samples as blue circles. COPD samples without a cancer background are displayed with a black contour. The percentage indicates the proportion of variance explained by each component. *

      2) This is the first time DNAm has been profiled in AT2 cells. It is incredibly difficult, valuable and novel data that will increase the fields capability technically, their understanding of functional mechanisms and potential translation considerably. It's audience will be primarily translational respiratory however the fundamental science aspect of gene expression regulation by DNA methylation with have wider reach across developmental and disease science.

      Response:

      We thank the reviewer for recognising the uniqueness and novelty of our study and highlighting the value and potential impact of our datasets for the lung field.

      3) the functional analysis using targeted CRISPR-Cas9 is very well done and adds impact.

      Response:

      We thank the reviewer for recognising the strengths and added value of the functional analysis using epigenetic editing.

      __Potential weaknesses/areas for development __

      I feel the main weakness is the in the section integrating DNA methylation and gene expression. The rationale for a focus on various aspects, for example inversely related DNAm/gene expression pairs, the IFN pathway and IRF9, are not clear. Also further understanding of the differences between DNAm associated genes and non-DNAm associated genes could be expanded, at the pathway level, TF regulation level, effect size level (are DNAm associated changes to gene expression larger, enriched for earlier differential expression)

      Response:

      Our rationale for focusing on the inversely related DNAm/gene expression pairs in promoter proximal is purely data-driven, as they represent the biggest group in our data (Fig. 4A-B). Among those negatively correlated genes, we observed the strongest enrichment for the IFN pathway (Fig. C), making it an obvious, data-driven target for further studies. The negative correlation of expression and methylation for IFN pathway genes could be validated in 5-AZA assays in A549 cells (Fig. 5A). Next, we made an interaction network analysis showing IRF9 and STAT2 as master regulators (Fig. 5B) of the negatively correlated IFN genes. As IRF9 itself displayed a negative correlation between DNA methylation and expression (Fig. 5C), we used the associated DMR for further epigenetic editing (Fig. 5D-E). We performed the additional requested analyses of the enhancer-associated changes and genes, as described above. We fully agree with the reviewer that our data sets are a great resource and can be further used to elaborate on other relationships of DNA methylation and RNA expression or other pathways, but this is out of the scope of this study. To enable further studies by the research community, we provide all necessary information about DMRs and DEGs in the associated supplementary tables and the raw data through the EGA, as well as the CRISPRa editing assay.

      The authors could comment on potential masking of differences between 5hmC and mC and the implications it may have

      Response:

      We thank the reviewer for bringing up this important point. Indeed, bisulfite sequencing cannot differentiate between methylated and hydroxymethylated cytosines; hence, some of the methylated sites may be hydroxymethylated. However, the overall levels of hydromethylation in differentiated adult tissues are very low (except for the brain), orders of magnitude lower compared to DNA methylation. Following the reviewer's suggestion, we have added a sentence in the limitation section of the discussion to clarify that point.

      __CHANGE IN THE MANUSCRIPT: __

      In addition, while WGBS provides unprecedented resolution and high coverage of the DNA methylation sites across the genome, it does not allow distinguishing 5-methylcytosine from 5-hydroxymethylcytosine. Therefore, we cannot exclude that some methylated sites we detected are 5-hydroxymethylated. However, the 5-hydroxymethylcytosine is present at very low levels in the lung tissue (97)*. ** *

      Furthermore, while the rationale for looking at DMRs is clear, especially given the sample number, I am interested to understand what proportion of the assayed CpGs "fit" within the cut off stipulations of the DMR analysis - that is, is their potentially COPD effects at sparse CpG regions/individual CpG sites that are not being identified. A comment on this would be useful and seems the strength of profiling genome wide. I'm happy genome wide is beneficial it just feels a little circular that the authors have chosen whole genome to avoid the bias of the Illumina array and a focus on promotors, but have primarily reported promoter DNAm. This caught my attention again in the discussion where the authors state that cis-regulatory regions were also identified in their fibroblast data .....is this finding a factor of the analysis performed? (also a comparison of regions Identified in AT2 cells versus fibroblasts would be really interesting for a future paper)

      Response:

      We decided to focus our analysis on regions rather than individual CpG sites when looking at differential methylation, as DNA methylation is spatially correlated, and methylation changes in larger regions are more likely to have a biological function. Extending the analysis to single CpG sites would require a higher number of samples for a reliable analysis compared to the DMR analysis (as mentioned by the reviewer).

      Of note, we addressed the platform comparison between Illumina array technology and WGBS in our previous fibroblast study (PMID: 37143403), where we compared our WGBS data with the published 450k array data of COPD parenchymal fibroblasts (Clifford et al., 2018). We observed only a marginal overlap between the CpGs from our DMRs and the CpGs probes available on the array (which was due to the differences in technologies used and the limited coverage of the 450K array in comparison to our genome-wide approach, in which we covered 18 million CpGs). Out of the 6279 DMRs identified in our fibroblast study, only 1509 DMRs overlapped with at least one CpG probe on the 450K array, and after removing low-quality CpGs from the array data, only 1419 DMRs were left. This comparison highlighted the increased resolution of the WGBS compared to Illumina arrays.

      The reason why we focused on promoter proximal DMRs are the following: 1) the assignment of the enhancer elements in AT2 to the corresponding gene is still too inaccurate in the absence of AT2 specific enhancer chromatin maps 2) regulation at enhancers by DNA methylation might be more complex and might change (increase or attenuate) binding affinities of certain transcription factors (Fig.2H), which might lead to gene expression changes or 3) methylation changes might be an indirect effect of differential TF binding PMID: 22170606). However, we agree with the reviewer that despite these limitations, expanding the analysis beyond promoters adds value to the manuscript; hence, as described above, we expanded the analysis of non-promoter regions, including enhancers, in the revised manuscript.

      We thank the reviewer for the suggestion to compare the regions identified in AT2 cells and fibroblasts in a future paper.

      My expertise:Respiratory, cell biology, epigenetics.

    2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #2

      Evidence, reproducibility and clarity

      Summary:

      This study aim to understand the molecular mechanisms underlying dysfunction in AT2 cells in COPD, by profiling bulk genome wide DNA methylation using Tagmentation-based whole-genome bisulfite sequencing (T-WGBS) and RNA sequencing in selectively sorted primary AT2 cells. The study stands out in it's sequencing breadth and use of an incredibly difficult cell population, and has the potential to add substantially to our mechanistic understanding of epigenetic contributions to COPD. A further highlight is the concluding aspect of the study where the authors undertook targeted modification of specific CpG methylation, provided direct, site-specific evidence for transcriptional regulation by CpG methylation.

      Major comments:

      The authors clearly show that there is DNA methylation alteration in AT2 cells from COPD individuals that links functional to gene expression at some level. However, I think the statement "to identify genome-wide changes associated with COPD development and progression..." and similar other references to disease development understanding is not accurate given the DNA methylation primary comparison is between control and moderate to severe COPD, with no temporal detail or evidence that they drive progression rather than are a result of COPD development. The paragraph starting on line 186 where this is a addressed to some extent is quite vague and doesn't really provide confidence that DNAm dysregulation occurs at an early stage in this context. This can be addressed by changing the focus/style of the text.

      Results comments and suggestions:

      For the integrated analysis, there is a focus on DMRs in promoters with very little analysis on other regions. The paragraph starting on line 317 describes some analysis on enhancers but is very brief, doesn't include information on how many/which DMRs were included, making it hard to interpret the impact of the 147 DMRs and 93 genes identified - is this nearly all DMRs and genes analysed or very few? A comparison to the promoter analysis would be of interest. Especially as the targeted region followed up with lovely functional assessment in the last sections is a gene body DMR, not a promoter DMR.

      • Lines 299-301 - I'm not sure the graph in Fig S3A support the conclusion that there was a preferential negative relationship between DNAm and gene expression. Looks like there are a substantial number of cases where a positive relationship is observed and this needs to be acknowledged.

      • Line 307 - what are the "analysed DEGs"? Are they the methylation associated genes?

      • Line 307-309 - "Among the analyzed DEGs, 76.5% (492) displayed a negative correlation (16.8% of the total DEGs), indicating a possible direct regulation by DNA methylation, while 23.5% (151) showed a positive correlation between gene expression and DNA methylation" - are the authors suggesting the positive correlation doesn't indicate direct regulation?

      • Line 313 - why did the authors focus on only negatively correlated genes to identify their top dysregulated pathway of IFN signalling? Why not do pathway analysis on the DNAm associated genes separately to identify DNAm associated pathways?

      • A comparison of the gene expression data with previous data in AT2 cell/single cell data would strengthen the gene expression section.

      • The paragraph starting on line 173 feels a little redundant when we know there is RNA available to test if the differential DNAm links to altered gene expression - this selected of example regions/genes would be better placed after the gene expression has been reported, at which point you could say whether the linked genes displayed altered transcription.

      • Similarly, the TF enrichment analysis is great but maybe would have added value to be done on DNA regions later shown to be linked to differential expression - was there different enrichment at DNA regions that are vs are not associated with altered expression? And could you test in vitro whether changing methylation of DNA (maybe a blunt too like 5-aza would be ok) alters TF binding (cut+run/ChIP?). Furthermore it would be interesting to understand the TF sensitivity analysis within the context of positive versus negative DNA methylation:gene expression correlations.

      Methods:

      • The authors should include more detail of the TWGBS rather than directing the reader to a previous publication. Also DNA concentration post bisuphite conversion would be a useful metric to provide.

      • Differential DNA methylation analysis: It is stated that DNA regions had to contain 3 CpG sites but was this within a defined DNA size range?

      • Refence genome only provided for RNAseq not TWGBS?

      • The tables do not appear in the PDF and I struggled to tally to the "Dataset" files provided if that is what they were referring to?

      • For the gene expression analysis, can it be made clearer that a full analysis was done on COPD I samples. It is a little confusing to the reader as this was not done for DNAm so might be assumed the same targeted analysis on only genes found to be differentially expressed between control and COPD II-IV, but that cannot be the case as an overlap of COPD1 vs COPD II-IV genes if provided. For this overlap, do genes show the same effect direction?

      • Replication is difficult on these studies as the samples are so difficult to come by. Also limited by sample size for the same reason. It doesn't mean the study is not worth doing and the data are still valuable. However, it may be pertinent to include technical validation of a few regions of interest, acknowledge the limitation (along side strengths) in the discussion, and perhaps provide actual p value rather than blanket < p 0.1, seems very lenient but may all be super significant (this may already be in the tables I wasn't able to find).

      • It isn't clear to me if DNA and RNA are from the same cells? The results say "cells matching those used for T-WGBS" but the methods suggest separate extractions so not the same cells? If they are not the same cells a comment on the implications of this should be included in the discussion for example, potentially some differences in cell type composition, storage time etc.

      • Line 193 the authors say "Since DMRs were overrepresented at cis-regulatory sites...." - "cis" needs to be defined. If you link DNAm regions to gene via "closest gene" does this not automatically mean you're outputs will be cis? Just needs better definition/explanation.

      Minor comments:

      • Line 157: "we identified site-specific differences....". Change to region specific?

      • Line 102-103: needs a reference for the statement "Alterations in DNA methylation patterns have been implicated......"

      • Line 266 - what does "strong dysregulation" mean? Large fold change, very significant?

      • Lines 423-425 - statement needs a reference

      • Line 428 - word missing between "epigenetic , we"?

      • Prior studies are well references, text and figures are clear and accurate.

      Significance

      This study has several strengths:

      1) Sample collection and characterisation. AT2 cells are incredibly hard to come by and the authors should be commended to generating the samples. However, proximity to cancer is always a potential issue, especially in epigenetic studies. Is it feasible to include any analysis to show the samples derived from those with cancer don't drive the changes observed? Even a high level PCA or an edit of fig 2A with non-cancer in a different colour in supplemental - looks like there is one outlier, is that a non-cancer? Or a correlation of change in beta between control and cancer/COPD and control and non-cancer:COPD (for want a better phrase!). just an indicator that the non-cancer COPD samples are not driving differences.

      2) This is the first time DNAm has been profiled in AT2 cells. It is incredibly difficult, valuable and novel data that will increase the fields capability technically, their understanding of functional mechanisms and potential translation considerably. It's audience will be primarily translational respiratory however the fundamental science aspect of gene expression regulation by DNA methylation with have wider reach across developmental and disease science.

      3) the functional analysis using targeted CRISPR-Cas9 is very well done and adds impact.

      Potential weaknesses/areas for development:

      I feel the main weakness is the in the section integrating DNA methylation and gene expression. The rationale for a focus on various aspects, for example inversely related DNAm/gene expression pairs, the IFN pathway and IRF9, are not clear. Also further understanding of the differences between DNAm associated genes and non-DNAm associated genes could be expanded, at the pathway level, TF regulation level, effect size level (are DNAm associated changes to gene expression larger, enriched for earlier differential expression) The authors could comment on potential masking of differences between 5hmC and mC and the implications it may have

      Furthermore, while the rationale for looking at DMRs is clear, especially given the sample number, I am interested to understand what proportion of the assayed CpGs "fit" within the cut off stipulations of the DMR analysis - that is, is their potentially COPD effects at sparse CpG regions/individual CpG sites that are not being identified. A comment on this would be useful and seems the strength of profiling genome wide. I'm happy genomewide is beneficial it just feels a little circular that the authors have chosen whole genome to avoid the bias of the Illumina array and a focus on promotors, but have primarily reported promoter DNAm. This caught my attention again in the discussion where the authors state that cis-regulatory regions were also identified in their fibroblast data ..... is this finding a factor of the analysis performed? (also a comparison of regions Id'ed in AT2 cells versus fibroblasts would be really interesting for a future paper)

      My expertise: Respiratory, cell biology, epigenetics.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review): 

      In this manuscript, Gruber et al perform serial EM sections of the antennal lobe and reconstruct the neurites innervating two types of glomeruli one that is narrowly tuned to geosmin and one that is broadly tuned to other odours. They quantify and describe various aspects of the innervations of olfactory sensory neurons (OSNs), uniglomerlular projection neurons (uPNs), and the multiglomerular Local interneurons (LNs) and PNs (mPNs). They find that narrowly tuned glomeruli had stronger connectivity from OSNs to PNs and LNs, and considerably more connections between sister OSNs and sister PNs than the broadly tuned glomeruli. They also had less connectivity with the contralateral glomeruli. These observations are suggestive of strong feed-forward information flow with minimal presynaptic inhibition in narrowly tuned glomeruli, which might be ecologically relevant, for example, while making quick decisions such as avoiding a geosmin-laden landing site. In contrast, information flow in more broadly tuned glomeruli show much more lateralisation of connectivity to the contralateral glomerulus, as well as to other ipsilateral glomeruli. 

      The data are well presented, the manuscript clearly written, and the results will be useful to the olfaction community. I wonder, given the hemibrain and FAFB datasets exist, whether the authors have considered verifying whether the trends they observe in connectivity hold across three brains? Is it stereotypic? 

      We appreciate the reviewer’s positive view of our study and their thoughtful and relevant comment on the issue of individual variation. We agree in that this is a very important question and notice that it was also asked for by the second Reviewer. It reflects both our limited understanding of the range of individual variation in synaptic connectivity—whether in flies, humans, or other species—and the challenge of determining which of the differences observed in our study are stereotypical features of each glomerulus type. Undoubtedly this criticism addresses a crucial problem of practically all connectome studies so far and for which there is no immediate solution. This type of studies requires so much time, efforts and money that increasing the number of samples is seldom feasible. The Reviewer wonders if we could compare our data with that made available by two of the largest connectome studies of Drosophila. This appeared to us to be a very good idea and we have tried to follow the advice but, unfortunately, it was impracticable because of the reasons we explain below. The hemibrain data cannot be used for this purpose because it does not contain the full glomerulus DA2 (Schlegel et al., 2021). A different problem hindered us from using the FAFB dataset, the other dataset mentioned by the Reviewer. In this case the three glomeruli were sectioned and reconstructed but the dataset lacks an annotated list of all synaptic connections corresponding to each glomerulus. Such annotation (a compendium of all synaptic connections inside each glomerulus informing for each connection which type of neuron provides the presynaptic site and which the postsynaptic site) is essential for direct comparison with our data. It is important to keep in mind that the current analytical tools available for the use of these datasets (e.g., NeuPrint, FlyWire and CATMAID) do not offer the ability to extract data on synapses exclusively from the glomerular volume of DA2 or DL5. In this case, it certainly is theoretically possible to obtain the data by doing ourselves the annotation. However, such a study will demand so much time, efforts and financial resources, which we believe would not be justified solely to increase the number of individuals from one to two. Instead, our manuscript includes a comparison of the OSN connectivity in VA1v and DL5 using the hemibrain dataset published by Schlegel et al. (2021) (see revised manuscript: lines 311–315; 431–434; 558–562; 602–606).

      Beyond the opinion, that we share in full with the Reviewer, that a comparison including three flies will be better than a comparison made with one glomerulus of each type we are still challenged by the question of which -if any- of the differences are stereotypic. The clarification of what are stereotypical differences between particular glomeruli in features as those discussed in our study and what is simply differences within the normal range of individual variation is basically a statistical problem. A first attempt at a comprehensive comparison focusing on intra- and inter-individual variability was recently made by comparing two connectome datasets from two different Drosophila individuals (Dorkenwald et al., 2024; Schlegel et al., 2024). At present, it is still unclear how many samples are needed to make a statistically robust comparison of olfactory synaptic circuits in adult flies—perhaps 3, 6, or even 18 individuals?  

      Reviewer #2 (Public Review):

      The chemoreceptor proteins expressed by olfactory sensory neurons differ in their selectivity such that glomeruli vary in the breadth of volatile chemicals to which they respond. Prior work assessing the relationship between tuning breadth and the demographics of principal neuron types that innervate a glomerulus demonstrated that narrowly tuned glomeruli are innervated more projection neurons (output neurons) and fewer local interneurons relative to more broadly tuned glomeruli. The present study used high-resolution electron microscopy to determine which synaptic relationships between principal cell types also vary with glomerulus tuning breadth using a narrowly tuned glomerulus (DA2) and a broadly tuned glomerulus (DL5). The strength of this study lies in the comprehensive, synapse-level resolution of the approach. Furthermore, the authors implement a very elegant approach of using a 2-photon microscope to score the upper and lower bounds of each glomerulus, thus defining the bounds of their restricted regions of interest. There were several interesting differences including greater axo-axonic afferent synapses and dendrodentric output neuron synapses in the narrowly tuned glomerulus, and greater synapses upon sensory afferents from multiglomerular neurons and output neuron autapses in the broadly tuned glomerulus.     The study is limited by a few factors. There was a technical need to group all local interneurons, centrifugal neurons, and multiglomerular projection neurons into one category ("multiglomerular neurons") which complicates any interpretations as even multiglomerular projection neurons are very diverse. Additionally, there were as many differences between the two narrowly tuned glomeruli as there were comparing the narrowly and broadly tuned glomeruli. Architecture differences may therefore not reflect differences in tuning breadth, but rather the ecological significance of the odors detected by cognate sensory afferents. Finally, some synaptic relationships are described as differing and others as being the same between glomeruli, but with only one sample from each glomerulus, it is difficult to determine when measures differ when there is no measure of inter-animal variability. If these caveats are kept in mind, this work reveals some very interesting potential differences in circuit architecture associated with glomerular tuning breadth.

      This work establishes specific hypotheses about network function within the olfactory system that can be pursued using targeted physiological approaches. It also identifies key traits that can be explored using other high-resolution EM datasets and other glomeruli that vary in their tuning selectivity. Finally, the laser "branding" technique used in this study establishes a reduced-cost procedure for obtaining smaller EM datasets from targeted volumes of interest by leveraging the ability to transgenically label brain regions in Drosophila.

      CLASSIFICATION OF NEURONAL TYPES

      We agree that grouping diverse types of interneurons into a single category (referred to as MGNs) limits the ability to make interpretations about synaptic similarities and differences between specific neuronal types. This was, however, an unavoidable compromise resulting from our decision to generate a comprehensive, synapse-level reconstruction of the restricted regions encompassing the DA2 and DL5 glomeruli. As both reviewers have noted, this approach offers significant value and we hope the Editor will also recognize that this limitation does not prevent readers from gaining important and novel insights into the synaptic circuitry of these two glomeruli.  

      Similar to the approach taken by Tobin at al. (2017) we prioritized producing a densely reconstructed neuropile, in which no synapses were omitted (Tobin et al., 2017). The downside of this method is that not all synaptic connections could be reliably assigned to specific neuronal types, with about 12% remaining unassigned." We anticipate that future research, supported by advances in semi-automated tracing methods, improved imaging technologies, and increased personnel resources, will allow not only for the generation of more complete connectomes of the entire brain (Scheffer et al., 2020; Zheng et al., 2018), but also, for the accurate reconstruction and classification of individual synapses—even in highly complex regions such as the olfactory glomeruli. We also expect that a second complete connectome of a male Drosophila will soon become available, which will provide valuable opportunities for comparisons across individuals and between male and female brains in future studies.

      INTERGLOMERULAR DIFFERENCES

      Thank you for this insightful comment. It is indeed true that despite both DA2 and VA1v being narrowly tuned glomeruli, they exhibit considerable differences in specific connectivity features (e.g., relative synaptic strengths above certain thresholds) and that those differences can be as pronounced as those observed between DA2 and the broadly tuned DL5. For this reason, comparing each individual glomerulus to every other is not a practical or informative approach. To derive robust interpretations, we focused instead on whether two glomeruli that share a particular functional characteristic—namely, being narrowly tuned for single odorants—also share connectivity patterns that distinguish them from a broadly tuned reference glomerulus.

      Our results support this. Furthermore, additional connectomics data reinforce our conclusions.

      For example, OSN-OSN connectivity is stronger in the two narrowly tuned glomeruli (DA2 and VA1v) relative to the broadly tuned glomerulus (DL5). While these pairwise differences alone are not conclusive, the finding that the two narrowly tuned glomeruli studied here share features that distinguish them from the broadly tuned glomerulus supports our interpretation. We found further support for this idea in the data reported by Schlegel et al. (2021) further. In that dataset, other narrowly tuned glomeruli (DA1, DL3, and DL4) also exhibit stronger OSNOSN connectivity than other broadly tuned glomeruli (DM1 or DM4).

      We do not deny that there are many differences between any given pair of glomeruli, regardless of whether they are narrowly or broadly tunned. Instead, we propose that our findings on circuit features indicate that most of the observed differences actually grouped the two narrowly tuned glomeruli together relative to the broadly tuned glomerulus. A more concise summary is now provided in the newly added Figure 8. We also added explanatory lines of text in the beginning of the chapter ‘specific features of narrowly tuned glomerular circuits. 

      ECOLOGICAL SIGNIFICANCE

      This is an interesting point. However, it is difficult to disentangle the "ecological significance" of processed odorants from the "tuning breadth" of a glomerulus. In the Drosophila olfactory system, glomerular circuits that respond to ecologically important odorants—such as those involved in reproduction or danger—tend to be more narrowly tuned. Moreover, while we refer to odorants with specific ecological significance as those linked to survival or reproductive behaviors, defining the significance of an odorant with precision is inherently challenging, as it can vary depending on context and environmental conditions.

      What both circuits share is their narrow tuning breadth. We therefore propose that the common circuit features of VA1v and DA2, highlighted in this study, are functionally related to the fact that each circuit processes single odorants. Consequently, their specificity is most likely determined at the level of the receptor. 

      INDIVIDUAL VARIABILITY

      We agree that accounting for inter-animal variability would strengthen the study. However, we are confident that even a modest statistically sound assessment of this variability would require a larger sample size, certainly more than just two or three flies, which is presently not feasible.

      We refer the reviewer to our response to Reviewer #1 regarding this important issue.

      Initial insights into variability between flies have been provided through comparative analyses of the two most comprehensive female Drosophila melanogaster connectomes—the FAFB and hemibrain datasets (Schlegel et al., 2024). For more detailed quantitative comparisons regarding inter-animal variability, please refer to our response to the second major point raised by Reviewer #2. As highlighted by Schlegel et al. (2024), making definitive statements about the stereotypy of neuron numbers, unitary cell-cell connections (edges), or synaptic strengths (weights) remains a complex challenge."

      While appreciating the rigour of this work we were surprised to notice the omission of a comparison of their observations with the two other existing datasets. This would not only have addressed the technical limitation of this particular study - the inability to identify specific neuron types due to imaging a small part of the brain - but would also have shed light on inter-animal variability 

      We strongly recommend that the authors do make this comparison - the datasets are currently extremely user friendly and so we don't estimate the replication of their key findings will be too onerous. This will be particularly important to resolve the issue of having to classify all multiglomerular local interneurons and multiglomerular projection neurons - broadly into "MGN. Such a comparison will dramatically strengthen this study that poses very interesting questions, but in its current form, has this striking shortcoming. 

      INDIVIDUAL VARIABILITY AS EXPRESSED HERE:

      Earlier on we were of the same opinion that the Reviewer express here but, unfortunately, it was not possible to follow his advice. As far as it was possible, we have compared some of our results to the values of the two datasets that the Reviewer refers to, but the absence of glomerulus DA2 in one of the datasets and the absence of synapse annotation for all the relevant glomeruli in the other dataset prevented us from making a full comparison. Moreover, believe that the problem of individual variation most probably cannot be solved by increasing the comparison with one or two more flies.

      Reviewer #1 (Recommendations for The Authors): 

      The lines 270 - 282 confused me in the backdrop of Figure 3B. 

      The concern may stem from our inclusion of a comparison between the uPNs of glomerulus DA2 and the single uPN of glomerulus DL5 in the statistical analysis presented in Figure 3. This comparison was included to ensure a comprehensive representation of the data, highlighting the variability across all major cell groups. We have clarified this rationale in the revised manuscript (see lines 274-282).

      Reviewer #2 (Recommendations for The Authors): 

      I commend the authors for taking such a thorough approach to advance an interesting topic in olfaction. The following suggestions are intended to strengthen this study: 

      Major points: 

      A color-blind-friendly palette should be used for all figures. Currently, five of seven figures use red and green, and in particular, Figure 5 will be uninterpretable for red/green color-blind readers. 

      We are thankful for this important comment. We changed the color palette as suggested by the reviewer, and replaced Red with Magenta and changed the figure legend accordingly.

      This level of analysis is extremely resource and time-consuming, so even obtaining this information at this resolution is an impressive achievement. However, this study would be well served by strategically supplementing the analysis of this dataset with information from other publicly available connectomics datasets. For instance, some interpretations are limited because there is information from only a single DL5 and DA2 glomerulus. Any claims in which one glomerulus has more, less, or the same of a metric must be tempered because without replicates, there are no measures of inter-animal variability. As an example, on lines 386-387 the authors state "The relative synaptic strength between MGN>uPN was stronger in DA2 (12%) than DL5 (10%)". It is difficult to assess whether this represents a difference that is outside of the range of inter-animal variability inherent to the olfactory system. Taking select measures from the Hemibrain and FAFB (via FlyWire) datasets could help strengthen these claims. 

      We fully agree with the Reviewer’s opinion that since our data is from one glomerulus of each type “It is difficult to assess whether this represents a difference that is outside of the range of inter-animal variability inherent to the olfactory system.” This is a weakness of practically all connectome studies based on electron microscopy in both Drosophila and other animals We cannot be sure that measurements from the Hemibrain and FAFB datasets could help strengthen our claims, because the magnitude of the range of individual variation is presently not known and most probably solving this problem will require more than one or two more flies. In any case, it is not possible to follow this advice and compare our data with that of the hemibrain because the DA2 was not included in that study. We ask the Reviewer to read our more detailed explanation in our response to Reviewer 1.

      In the particular case commented by the Reviewer above, the relative difference in synaptic strength exceeds 20%. Whether such a difference has functional relevance remains an open question but Schlegel et al. (2024) support our interpretation. They showed that synaptic weights with differences larger than 20% tend to be consistent across individuals, with strong correlations within and between animals (Pearson’s R = 0.97 and R = 0.8; Fig. 4).

      Grouping all local interneurons, centrifugal neurons response and multiglomerular PNs into one category limits the ability to make interpretations about similarities or differences in the synaptic relationships involving MGNs. The authors could get an estimate of the number of multiglomerular PNs in DL5, VA1v, and DA2 from Hemibrain and FlyWire platforms to get a better sense of differences between glomeruli in the MGN category. 

      We agree in that grouping a variety of interneurons into a single category (called MGNs) limits the ability to make interpretations about similarities or differences in the synaptic relationships involving different neurons. This was the unavoidable price to be paid once we decided to register a “comprehensive, synapse-level resolution” map of these two glomeruli. It appears to us that both reviewers have clearly recognized the intrinsic value of this approach and we hope that the Editor will share this opinion. 

      Consistent with the assumptions of Tobin et al., (2017) our hypothesis on LN connectivity differences is based on the fact that they are the most numerous and broadly arborizing neurons of the class that we call multiglomerular neurons in the AL (Chou et al., 2010; Lin et al., 2012; Tanaka et al., 2012). Recent connectome studies confirm this feature across all glomeruli (Bates et al., 2020; Horne et al., 2018; Scheffer et al., 2020; Schlegel et al., 2021; Zheng et al., 2018).  

      In response to the reviewer’s question, we conducted a case-specific reanalysis of the data from Horne (2018), which provides comprehensive connectivity information for the VA1v glomerulus. This allowed us to quantify the proportional contributions of LNs (n = 56) and mPNs (n = 13) to all MGN connections (MGN-MGN, MGN>OSN, MGN>uPN, uPN>MGN, OSN>MGN).

      Our analysis showed that 84% of MGN output originates from LNs. 57% of the input to MGN comes from LNs and 43% from mPNs, largely due to strong OSN>mPN input. Thus, for the filtered MGN connections relevant to distinguishing narrowly from broadly tuned circuits (e.g., MGN>OSN, uPN>MGN; see Fig. 8), LNs are the dominant contributors in VA1v. (These data are not included in the resubmitted manuscript.) This supports our interpretation that the LN are responsible for the majority of MGN connections underlying the observed differences between glomeruli.

      For instance, prior work has reported fewer local interneurons innervating DA2, but in this study there was an unexpected result that there was greater MGN innervation density and synapse # for DA2 relative to DL5 This discrepancy could be due to differences in the number of multiglomerular PNs innervating each glomerulus, which would be obscured when these PNs are combined with local interneurons in the MGN category. 

      "We agree that the greater MGN innervation density in DA2 in our study could reflect a stronger contribution from mPNs. However, innervation density alone does not indicate how many mPNs actually innervate DA2 or DL5. Alternatively, increased innervation and/or synaptic frequency of local interneurons (LNs) could also account for this observation. In our view, neuron number does not necessarily correlate with branching complexity or synaptic density. 

      For example, the dendritic length of the single uPN in glomerulus DL5 is approximately equal to the combined dendritic length of the multiple uPNs of the DA2. Similarly, Tobin et al. (2017) reported that when comparing uPNs in glomerulus DM6 between the left and right brain hemispheres, they found variability in cell number but not in dendritic length. More recently, the FAFB and hemibrain datasets showed a similar pattern in another neuronal type. A substantial variation in cell number was observed for Kenyon cells between the two Drosophila individuals, but this cell type consistently makes and receives, in both individuals, similar presynapses and post-synapses (Schlegel et al., 2024).

      On line 33 the authors cannot claim that DA2-OSNs experience less presynaptic inhibition based on the data in this study. Even without the limitations of the MGN category (described above), presynaptic inhibition depends on more than just the number of synapses, rather it is affected by GABA B receptor expression levels and the second messenger components downstream of this receptor. Physiological experiments are needed to justify this claim, so I recommend adjusting accordingly.

      We agree with the Reviewer and have adjusted the text on line 33 and in the main body of the text by referring to this finding as “presynaptic input”, which is what we have quantified, instead of “less presynaptic inhibition”.

      Figures 5 and 6 seek to distill the wealth of information from this study into broad takehome points for the reader, while still providing a good amount of detail. I think a final more concise graphic summary (similar to the graphical abstract or Figure 6 of Grabe et al 2016) depicting the most critical differences between glomeruli would further clarify the broad findings of this study. 

      We appreciate this comment and we have added a “graphic summary” as the Reviewer proposed. We made a new figure that becomes Figure 8 and summarizes our results and highlights differences between narrowly and broadly tuned glomeruli in a more concise graphical abstract format.

      Minor points: 

      Much of the manuscript provides details about synapse fractions or % synapses for a given synaptic relationship. Please ensure that it is clear which principal cell types are being described, as it can be easy to get lost.  - Should line 284 say "...than DL5 as it has been reported that DA2 is innervated by fewer LNs..."?

      We appreciate the reviewer’s comment and we have corrected this sentence that now reads as follows: (see text: beginning at line 290).  

      Taisz et al.  has been published, so the citation should be updated. 

      We have updated the corresponding citation.  

      On line 233, the authors ascribe the small electron-dense vesicles as likely housing sNPF released by MGNs. However, Carlsson et al. (2010) demonstrated that sNPF is released by OSNs, which was further functionally characterized by Root et al. (2011) and Ko et al. (2014). In terms of MGNs that release neuropeptides, Carlsson et al. 2010 demonstrated that local interneurons immunolabel for tachykinin, myoinhibitory peptide, and allatostatin-A, while two extrinsic neurons release SIFamide. In theory, aminergic neurons could also have small electron-dense vesicles, but this can be variable. 

      The Reviewer is completely right in his criticism. The MGN certainly contain neurons that have been reported to contain neuropeptides other than sNPF. We have corrected this sentence and it now reads as follows (page7, line 236): “Interestingly, besides the abundant clear small vesicles..

      On line 636, the Berck and Schlegel studies demonstrated that panglomerular local interneurons synapse upon OSN, but not that they induce presynaptic inhibition (which was demonstrated in the studies cited in the next sentence). I recommend adjusting this sentence.

      We agree and we have corrected the text following the Reviewers advice. It now reads as follows (page 19. Line 663): “We also observed that OSNs received less MGN feedback.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #1 (Recommendation For the Authors):

      Thanks to the authors for addressing my suggestions. I think these modifications have improved the clarity of the data and the overall presentation of the manuscript. The methods are now more clearly explained, and the additional details help make the results easier to interpret. Where addressing the comment wasn't feasible, the authors gave reasonable explanations. Overall, the revisions strengthen the paper, and I have no further concerns.

      Thank you for your recommendations, which have significantly improved our paper.

      Reviewer #2 (Recommendation For the Authors):

      The additional work conducted by the authors is greatly appreciated. All concerns (and beyond) have been thoroughly addressed by the authors and I am thankful for their consideration and attention to detail. Only one possible issue with the revisions is described below for consideration:

      Regarding the CFU counts and/or axis labels in Figure S3B, some of the listed "CFU per 1 mL" values (in both the figure itself and File S2B) are extraordinarily high. For example, the greatest CFU for PA14 observed in Figure 4E is ~1x10^9. However, PA14 at 0 ug/mL Ceftazidime reaches nearly 1x10^16 in Figure S3B. From what I can tell, this should be beyond the capacity of bacteria in this space by several orders of magnitude. (E.g., a cubic centimeter [~1 mL] is ~1x10^12 cubic micrometers. At their smallest dimensions and volume, a maximum of ~1x10^13 cells could theoretically fit in this space assuming no liquid and perfect organization.) Similarly, both "AMM" and "AMM (+PA14)" consistently reach CFUs between 1x10^12 and 1x10^14 in this assay. Are the authors confident in the values and/or depiction of CFUs for this figure? It seems like this could be a labeling or dilutioncounting issue.

      Thank you for your positive remarks on our revised manuscript and for your constructive comments that have strengthened our work.

      We agree with the concern regarding the CFU counts in Figure S3B. The very high values (>10<sup>12</sup>CFU) reflect a technical enumeration artifact that, due to the nature of the assay, cannot be fully avoided. The origin of these inflated counts is described in more detail below:

      Following competition assays between Pseudomonas aeruginosa and Stenotrophomonas maltophilia in liquid culture with antibiotics, we enumerate survivors for each species by colony forming unit (CFU) counts. Because two different bacterial species must be quantified from mixed cultures, we use a gentamicin resistance marker carried by one species at a time.

      Each condition is therefore enumerated twice, as we alternate which species harbors the gentamicin cassette.

      During coculture in antibiotics and minimal medium, clinical isolates of P. aeruginosa and S. maltophilia, like those used here, can transiently increase their tolerance to antibiotics, including aminoglycosides. This reduces the effectiveness of gentamicin selection at the plating step necessary for CFU enumeration. For the data presented in Figure S3B, in a subset of highOD₆₀₀ conditions in the competition assay, this tolerance produces artificially inflated CFU values that exceed the biological carrying capacity during the CFU enumeration step.

      We evaluated alternative enumeration strategies (e.g., fluorescent protein markers with a nonselective medium), but these proved unsuitable for these strains due to differences in growth rates and media compatibility, introducing other large biases. Given these constraints, selective plating remains the only feasible approach for this work, and the associated artifact cannot be eliminated entirely.

      Importantly, transient resistance (tolerance), although common, is not a universal occurrence (e.g., we did not observe it when we performed the experiments shown in Figure 4E). When it does arise, it occurs reproducibly under the same experimental high-OD<sub>600</sub> conditions and does not obscure any of the relative comparisons that underpin our conclusions.

      For transparency, we have retained the measured values in Figure S3B and we note in the legend that counts above ~10<sup>12</sup> CFU represent a technical overestimation due to transient gentamicin tolerance. Counts below 10<sup>12</sup> CFU are accurately enumerated.

      Reviewer #3 (Recommendation For the Authors):

      All concerns have been satisfied and the manuscript is ready for publishing.

      Thank you for your recommendations, which have significantly improved our paper.


      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      The study would benefit from presenting raw data in some cases, such as MIC values and SDS-PAGE gels, by clarifying the number of independent experiments used, as well as further clarification on statistical significance for some of the data.

      All original data used to generate Fig. 1, Fig. 4E, Fig. S3 and Fig. S4A are presented in File S2. Tab (A) is dedicated to data used for Fig. 1 and Fig. S4A, while tabs (B) and (C) show the data used for Fig. 4E and S3, respectively. This information is indicated in the legends of the relevant figures.

      All experiments in this study were performed in three independent (biological) experiments (with the exception of the complementation data shown in Fig. S1 and Fig. S5, which were performed in two independent (biological) experiments). The number of biological and technical replicates for each experiment is stated in the figure legends, as well as in the “Statistical analysis of experimental data” part of the “Materials and Methods” section of the paper. Specifically, for antibiotic MIC assays we have not performed statistical analyses as per recommended practice. The reason for this is stated in the following section from the “Statistical analysis of experimental data” part of the “Materials and Methods” section of the paper (lines 699-711 of the revised manuscript):

      “Antibiotic MIC values were determined in biological triplicate, except for MIC values recorded for dsbA complementation experiments in our E. coli K-12 inducible system that were carried out in duplicate. All ETEST MICs were determined as a single technical replicate, and all BMD MICs were determined in technical triplicate. All recorded MIC values are displayed in the relevant graphs; for MIC assays where three or more biological experiments were performed, the bars indicate the median value, while for assays where two biological experiments were performed the bars indicate the most conservative of the two values (i.e., for increasing trends, the value representing the smallest increase and for decreasing trends, the value representing the smallest decrease). We note that in line with recommended practice, our MIC results were not averaged. This should be avoided because of the quantized nature of MIC assays, which only inform on bacterial survival for specific antibiotic concentrations and do not provide information for antibiotic concentrations that lie in-between the tested values.”

      Reviewer #2 (Public review):

      While Figure 5E demonstrates a protective effect of DsbA-dependent β-lactamase, the omission of CFU data for S. maltophilia makes it difficult to assess the applicability of the polymicrobial strategy. Since S. maltophilia is pre-cultured prior to the addition of P. aeruginosa and antibiotics, it is unclear whether the protective effect is dependent on high S. maltophilia CFU. It is also unclear what the fate of the S. maltophilia dsbA dsbL mutant is under these conditions. If DsbA-deficient S. maltophilia CFU is not impacted, then this treatment will result in the eradication of only one of the pathogens of interest. If the mutant is lost during treatment, then it is not clear whether the loss of protection is due specifically to the production of non-functional β-lactamase or simply the absence of S. maltophilia.

      We have simultaneously tracked the abundance of P. aeruginosa and S. maltophilia strains in our cross-protection experiment for select antibiotic concentrations. To be able to perform this experiment, we had to label two extremely-drug-resistant strains of S. maltophilia with an antibiotic resistance marker that allowed us to quantify them in mixtures with P. aeruginosa. Our results can be found in Fig. S3 of our revised manuscript and, in a nutshell, show that ceftazidime treatment leads to eradication of both P. aeruginosa and S. maltophilia when disulfide bond formation is impaired in S. maltophilia.

      The following text was added to address the questions of the reviewer:

      “Due to the naturally different growth rates of these two species (S. maltophilia grows much slower than P. aeruginosa) especially in laboratory conditions, the protocol we followed [1] requires S. maltophilia to be grown for 6 hours prior to co-culturing it with P. aeruginosa. To ensure that at this point in the experiment our two S. maltophilia strains, with and without dsbA, had grown comparatively to each other, we determined their cell densities (Fig. S3A). We found that S. maltophilia AMM dsbA dsbL had grown at a similar level as the wild-type strain, and both were at a higher cell density [~10<sup>7</sup> colony forming units (CFUs)] compared to the P. aeruginosa PA14 inoculum (5 x 10<sup>4</sup> CFUs)” (lines 353-361 of the revised manuscript).

      “To ensure that ceftazidime treatment leads to eradication of both P. aeruginosa and S. maltophilia when disulfide bond formation is impaired in S. maltophilia, we monitored the abundance of both strains in each synthetic community for select antibiotic concentrations (Fig. S3B). In this experiment we largely observed the same trends as in Fig. 4E. At low antibiotic concentrations, for example 4 μg/mL of ceftazidime, S. maltophilia AMM is fully resistant and thrives, thus outcompeting P. aeruginosa PA14 (dark pink and dark blue bars in Fig. S3B). The same can also be seen in Fig. 4E, whereby decreased P. aeruginosa PA14 CFUs are recorded. By contrast S. maltophilia AMM dsbA dsbL already displays decreased growth at 4 μg/mL of ceftazidime because of its non-functional L1-1 enzyme, allowing comparatively higher growth of P. aeruginosa (light pink and light blue bars in Fig. S3B). Despite the competition between the two strains, P. aeruginosa PA14 benefits from S. maltophilia AMM’s high hydrolytic activity against ceftazidime, which allows it to survive and grow in high antibiotic concentrations even though it is not resistant (see 128 μg/mL; dark pink and dark blue bars in Fig. S3B). In stark opposition, without its disulfide bond in S. maltophilia AMM dsbA dsbL, L1-1 cannot confer resistance to ceftazidime, resulting in killing of S. maltophilia AMM dsbA dsbL and, consequently, also of P. aeruginosa PA14 (see 128 μg/mL; light pink and light blue bars in Fig. S3B).

      The data presented here show that, at least under laboratory conditions, targeting protein homeostasis pathways in specific recalcitrant pathogens has the potential to not only alter their own antibiotic resistance profiles (Fig. 3 and 4A-D), but also to influence the antibiotic susceptibility profiles of other bacteria that co-occur in the same conditions (Fig. 5). Admittedly, the conditions in a living host are too complex to draw direct conclusions from this experiment. That said, our results show promise for infections, where pathogen interactions affect treatment outcomes, and whereby their inhibition might facilitate treatment” (lines 381406 of the revised manuscript).

      The alleged clinical relevance and immediate, theoretical application of this approach should be properly contextualized. At multiple junctures, the authors state or suggest that interactions between S. maltophilia and P. aeruginosa are known to occur in disease or have known clinical relevance related to treatment failure and disease states. For instance, the citations provided for S. maltophilia protection of P. aeruginosa in the CF lung environment both describe simplified laboratory experiments rather than clinical or in vivo observations. Similarly, the citations provided for both the role of S. maltophilia in treatment failure and CF disease severity do not support either claim. The role of S. maltophilia in CF is currently unsettled, with more recent work reporting conflicting results that support S. maltophilia as a marker, rather than cause, of severe disease. These citations also do not support the suggestion that S. maltophilia specifically contributes to treatment failure. While it is reasonable to pursue these ideas as a hypothesis or potential concern, there is no evidence provided that these specific interactions occur in vivo or that they have clinical relevance.

      Thank you for your comment. You are entirely correct. We have amended the test throughout our revised manuscript to avoid overstating the role of S. maltophilia in CF infections and to reference additional relevant works in the literature. Please find below representative examples of such passages:

      “On the other hand, CF microbiomes are increasingly found to encompass S. maltophilia [2-4], a globally distributed opportunistic pathogen that causes serious nosocomial respiratory and bloodstream infections [5-7]. S. maltophilia is one of the most prevalent emerging pathogens [6] and it is intrinsically resistant to almost all antibiotics, including β-lactams like penicillins, cephalosporins and carbapenems, as well as macrolides, fluoroquinolones, aminoglycosides, chloramphenicol, tetracyclines and colistin. As a result, the standard treatment option for lung infections, i.e., broad-spectrum β-lactam antibiotic therapy, is rarely successful in countering S. maltophilia [7,8], creating a definitive need for approaches that will be effective in eliminating both pathogens” (lines 33-41 of the revised manuscript).

      “Of the organisms studied in this work, S. maltophilia deserves further discussion because of its unique intrinsic resistance profile. The prognosis of CF patients with S. maltophilia lung carriage is still debated [4,9-16], largely because studies with extensive and well-controlled patient cohorts are lacking. This notwithstanding, the therapeutic options against this pathogen are currently limited to one non-β-lactam antibiotic-adjuvant combination, , which is not always effective, trimethoprim-sulfamethoxazole [17-20], and a few last-line β-lactam drugs, like the fifth-generation cephalosporin cefiderocol and the combination aztreonam-avibactam. Resistance to commonly used antibiotics causes many problems during treatment and, as a result, infections that harbor S. maltophilia have high case fatality rates [7]. This is not limited to CF patients, as S. maltophilia is a major cause of death in children with bacteremia [5]” (lines 440-450 of the revised manuscript).

      Reviewer #3 (Public review):

      The impact of the work can be strengthened by demonstrating increased efficacy of antibiotics in mice models or wound models for Pseudomonas infections. Worm models are relevant, but still distant from investigations in animal models.

      Thank you for this comment. We appreciate the sentiment, and we would have liked to be able to perform experiments in a murine model of infection. There are several reasons that made this not possible, and as a result we used G. mellonella as an informative preliminary in vivo infection model. The DSB proteins have been shown to play a central role in bacterial virulence. Because of this our P. aeruginosa and S. maltophilia mutant strains are not efficient in establishing an infection, even in a wound model. This could be overcome had we been able to use the chemical inhibitor of the DSB system in vivo, however this also is not possible This is due to the fact that the chemical compound that we use to inhibit the function of DsbA acts on DsbB. Inhibition of DsbB blocks the re-oxidation of DsbA and leads to its accumulation in its inactive reduced form. However, the action of the inhibitor can be bypassed through reoxidation and re-activation of DsbA by small-molecule oxidants such as L-cystine, which are abundant in rich growth media or animal tissues. This makes the inhibitor only suitable for in vitro assays that can be performed in minimal media, where the presence of small-molecule oxidants can be strictly avoided, but entirely unsuitable for an insect or a vertebrate animal model.

      Reviewer #1 (Recommendation For the Authors):

      (1) The analysis of the role of DsbA in the assembly of cysteine-containing β-lactamases is a significant finding. However, in addition to showing the MIC fold difference, I think, it would be important to show the raw data for the actual MIC values obtained for each β-lactamase enzyme/antibiotic combination and in both strains (+ and - dsbA).

      Also, can the authors clarify whether these experiments were conducted on 3 independent samples (there seems to be some contradicting information in the paper and the supplementary figures). If possible, I would also recommend showing in the figure whether the MIC differences observed were statistically significant.

      All original data used to generate Fig. 1, Fig. 4E, Fig. S3 and Fig. S4A are presented in File S2. Tab (A) is dedicated to data used for Fig. 1 and Fig. S4A, while tabs (B) and (C) show the data used for Fig. 4E and S3, respectively. This information is indicated in the legends of the relevant figures.

      All experiments in this study were performed in three independent (biological) experiments (with the exception of the complementation data shown in Fig. S1 and Fig. S5, which were performed in two independent (biological) experiments). The number of biological and technical replicates for each experiment is stated in the figure legends, as well as in the “Statistical analysis of experimental data” part of the “Materials and Methods” section of the paper. Specifically, for antibiotic MIC assays we have not performed statistical analyses as per recommended practice. The reason for this is stated in the following section from the “Statistical analysis of experimental data” part of the “Materials and Methods” section of the paper (lines 699-711 of the revised manuscript):

      “Antibiotic MIC values were determined in biological triplicate, except for MIC values recorded for dsbA complementation experiments in our E. coli K-12 inducible system that were carried out in duplicate. All ETEST MICs were determined as a single technical replicate, and all BMD MICs were determined in technical triplicate. All recorded MIC values are displayed in the relevant graphs; for MIC assays where three or more biological experiments were performed, the bars indicate the median value, while for assays where two biological experiments were performed the bars indicate the most conservative of the two values (i.e., for increasing trends, the value representing the smallest increase and for decreasing trends, the value representing the smallest decrease). We note that in line with recommended practice, our MIC results were not averaged. This should be avoided because of the quantized nature of MIC assays, which only inform on bacterial survival for specific antibiotic concentrations and do not provide information for antibiotic concentrations that lie in-between the tested values.”

      (2) For Figure 2A, can the authors provide the full Westerns and ideally the SDS-PAGE gel corresponding to the Westerns where the Β-lactamases and the control DNA-K were detected.

      Thank you for this comment. Full immunoblots and SDS PAGE analysis of the immunoblot samples for total protein content are shown in File S3 of our revised manuscript.

      (3) For the enzymatic assays, was the concentration of enzyme used "normalised " based on the amount detected in the westerns where possible or was only the total amount of protein considered. When similar amounts of enzyme were added, was the activity still compromised?

      The β-lactam hydrolysis assay was normalized based on the weight of the cell pellets (wet cell pellet mass) of the tested strains. This means, that for each enzyme expressed in cells with and without DsbA, strains were normalized to the same weight to volume ratio, and thus strains expressing the same enzyme were only compared to each other.

      Because enzyme degradation in the absence of DsbA is a key factor underlying the effects we describe for most of the tested β-lactamases (see Fig. 2A and S4A; no protein band is detected for 5 of the 7 enzymes in the dsbA mutant), it was not possible to normalize our samples based on enzyme levels detected by immunoblot. Normalization based on enzyme amounts would be feasible had we purified each β-lactamase after expression in the two different strain backgrounds (+/- dsbA) assuming sufficient protein amounts could be isolated from the dsbA mutant strain. Nonetheless, we feel that such a comparison would be misleading, since enzyme degradation likely plays the biggest role in the lack of activity observed for most of these enzymes in the absence of DsbA.

      (4) Not sure whether Fig 3 is very informative. Perhaps it could be redesigned to better encapsulate the findings in this manuscript (combine figurer 3 and 6 into one). I would also include the chemical structure of the inhibitors used and perhaps include how they block the system by binding to DsbB.

      Thank you for this comment. Fig. 3 was combined with Fig. 6 of the submitted manuscript. The new model figure is Fig. 5 in our revised manuscript.

      The inhibitor compound used in our study has been extensively characterized in a previous publication [21]. Considering that this inhibitor is not the main focus of our paper, we have avoided showing its chemical structure in any of the main display items. That said, its structure can be found in File S5 of our revised manuscript, which contains the quality control information on this compound. As suggested, we included the following sentence to describe the mode of action of this inhibitor: “Compound 36 was previously shown to inhibit disulfide bond formation in P. aeruginosa via covalently binding onto one of the four essential cysteine residues of DsbB in the DsbA-DsbB complex [21]” (lines 309-311 of the revised manuscript).

      (5) Figure 4: Similar to my comment above showing in the figure whether the differences observed in Figure 4, particularly A-C, are statistically significant (i.e. galleria survival difference in the presence and absence of dsbA) would be beneficial.

      As mentioned in our answer to comment 1 above, we have not performed statistical analyses for antibiotic MIC assays because, in line with recommended practice, our MIC results were not averaged (Fig. 3A,B,D,E of our revised manuscript). This should be avoided because of the quantized nature of MIC assays, which only inform on bacterial survival for specific antibiotic concentrations and do not provide information for antibiotic concentrations that lie in-between the tested values. Statistical analysis of G. mellonella survival data (Fig. 3C,F of our revised manuscript) was performed and is described fully in the legend of Fig. 3, as well as in the “Statistical analysis of experimental data” part of the “Materials and Methods” section of the paper (lines 729-738 of the revised manuscript). Finally, the statistical analyses for the most important comparisons in panels (C) and (F) of Fig. 3 are also marked directly on the figure.

      (6) Were the authors able to test the redox state of DsbA upon addition of the DsbB inhibitor to further demonstrate that the effects observed were indeed due to the obstruction of the Dsb machinery and not due to off target effects.

      Thank you for the opportunity to clarify this. In previous work from our lab, we have used a DSB system inhibitor termed “compound 12” in [22] with activity against DsbB proteins from Enterobacteria. In our previous study [23] we, indeed, tested the redox state of DsbA in the presence of this inhibitor compound. We could not perform the same experiment here with “compound 36” from [21], because we do not have an antibody against the DsbA protein of S. maltophilia. That said, we have carried out experiments that confirm that our results are due to specific inhibition of the DSB system and not because of off-target effects. In particular, we show that the gentamicin MIC values of S. maltophilia AMM remain unchanged in the presence of the inhibitor and treatment of S. maltophilia AMM dsbA dsbL with the compound does not affects its colistin MIC value (Fig. S2E and lines 317-320 of the revised manuscript).

      (7) Given the remarkable effects shown by the DsbB inhibitor, did the authors use this compound to assess whether inhibition of the Dsb system with small molecules would block cross-resistance in S. maltophilia - P. aeruginosa mixed communities (Fig 5D).

      Unfortunately, this was not possible. The decrease in the ceftazidime MIC value of S. maltophilia AMM in the presence of the DSB inhibitor compound is more modest than the effects we observed when the dsbA dsbL mutant is used (compare Fig. 4D (left) with Fig.4A of the revised manuscript). This means that in the presence of the DSB inhibitor there are still sufficient amounts of functional β-lactamase present and we expect that they would contribute to cross-protection of P. aeruginosa. While the use of the DSB inhibitor does have a drastic impact on the colistin resistance profile of S. maltophilia AMM (Fig. 4D of the revised manuscript), unlike β-lactamases, which act as common goods, MCR enzymes act solely on the lipopolysaccharide of their producer and do not contribute to bacterial interactions, precluding the use of colistin for a cross-protection experiment.

      Reviewer #2 (Recommendation For the Authors):

      (1) The acronym used for synthetic cystic fibrosis sputum medium (lines 523, 531, 535, 601, and 603) is defined in the manuscript as 'SCF', but the common formulation is 'SCFM', including in the provided citation. Suggest changing to SCFM for consistency.

      Thank you for this comment. This has been amended throughout our revised manuscript.

      (2) In Figure 1, while the legend states that "No changes in MIC values are observed for strains harboring the empty vector control (pDM1)[...]" (lines 729-30), the median of ceftazidime in the pDM1 control appears to indicate a 2-fold decrease in MIC. This would not seem to significantly impact the other results since the MIC decreases observed for other conditions are all 3-fold or greater, but this should be addressed and/or explained in the text.

      You are correct. Thank you for the opportunity to clarify this. Generally, since MIC assays have a degree of variability, we have only followed decreases in MIC values that are greater than 2fold. Generally, for most of our controls, the recorded MIC fold changes are below 2-fold. The only exception to this is the ceftazidime MIC drop of the empty-vector control, showing a 2fold change, which we do not consider significant.

      To ensure that this is clear in our text and figure legends the following changes were made:

      The clause “only differences larger than 2-fold were considered” was added to the text (lines 110-111 of the revised manuscript).

      We amended the legend of Fig. 1 accordingly: “No changes in MIC values are observed for the aminoglycoside antibiotic gentamicin (white bars) confirming that absence of DsbA does not compromise the general ability of this strain to resist antibiotic stress. Minor changes in MIC values (≤ 2-fold) are observed for strains harboring the empty vector control (pDM1) or those expressing the class A β-lactamases L2-1 and LUT-1, which contain two or more cysteines (Table S1), but no disulfide bonds (top row)”.

      (3) Similarly, in Fig S1E, there appears to be only partial complementation for BPS-1m. Do the authors hypothesize that this observation is related to a folding defect, rather than degradation of protein, as described for BPS-1m for Figure 2?

      Thank you for the opportunity to clarify this. You are correct that we only achieve partial complementation for the E. coli strain expressing the BPS-1m enzyme from the Burkholderia complex. Despite the fact that the gene for this enzyme was codon optimized, we observed that its expression in E. coli is sub-optimal and incurs fitness effects. In fact, to record the data presented in our manuscript the E. coli strains had to be transformed anew every time. Considering that the related enzyme BPS-6 does not present any of these challenges, we attribute the partial complementation to technical difficulties with the expression of the bps-1m gene in E. coli. 

      We clarified this by adding the following clause to our manuscript: “we only achieve partial complementation for the dsbA mutant expressing BPS-1m, which we attribute to the fact that expression of this enzyme in E. coli is sub-optimal” (lines 132-134 of the revised manuscript).

      (4) Lines 204-206: "[...]we deleted the principal dsbA gene, dsbA1 (pathogenic bacteria often encode multiple DsbA analogues [24,25]), in several multidrug-resistant (MDR) P. aeruginosa clinical strains (Table S2)". That multiple DsbA analogues are often encoded is good information to provide, but it was unclear from quickly looking at the citations whether Pa is counted among these. Is it expected that all oxidative protein folding in Pa functions through DsbA1? Conveying this information, if possible, may make the impact of the results in this model clearer.

      Thank you for this comment. To address it we added the following text to our manuscript:

      “To determine whether the effects on β-lactam MICs observed in our inducible system (Fig. 1 and [23]) can be reproduced in the presence of other resistance determinants in a natural context with endogenous enzyme expression levels, we deleted the principal dsbA gene, dsbA1, in several multidrug-resistant (MDR) P. aeruginosa clinical strains (Table S2). Pathogenic bacteria often encode multiple DsbA analogues [24,25] and P. aeruginosa is no exception. It encodes two DsbAs, but DsbA1 has been found to catalyze the vast majority of the oxidative protein folding reactions taking place in its cell envelope [26]” (lines 172-178 of the revised manuscript).

      (5) Regarding the clinical Pa isolates G4R7 and G6R7, have the authors performed any phenotypic testing on these strains to identify differences that might explain the substantial difference in piperacillin MIC? I.e., can these isolates be distinguished by growth rate, genetic markers or expression levels, early or late infection, mucoidy, etc. This is not essential for the current work, but could weigh on the efficacy of this treatment strategy for AIM1expressing clinical isolates. (E.g., the G4R7 dsbA1 strain exhibits a piperacillin MIC still ~2fold higher than WT G6R7).

      Thank you for the opportunity to clarify this. For clinical strains used in our study, we have evaluated their antibiotic resistance profiles, but we have not performed any additional phenotypic characterization. There are many reasons that contribute to differences in antibiotic resistance, starting simply from β-lactamase expression levels and extending to organismal effects, like the ones mentioned by the reviewer. Such characterization would fall outside the scope of our paper, especially since we sensitize our tested P. aeruginosa clinical isolates for the majority of the β-lactams antibiotics tested. 

      We acknowledged this by adding the following sentence to our revised manuscript: 

      “Despite the fact that P. aeruginosa G4R7 dsbA1 was not sensitized for piperacillintazobactam, possibly due to the high level of piperacillin-tazobactam resistance of the parent clinical strain, our results across these two isolates show promise for DsbA as a target against β-lactam resistance in P. aeruginosa” (lines 191-194 of the revised manuscript).

      (6) Lines 180-2: "This shows that without their disulfide bonds, these proteins are unstable and are ultimately degraded by other cell envelope proteostasis components [33]". While it is clear that protein is significantly lost in all cases except for BPS-1m in 2A, the dsbA pDM1bla constructs in 2B appear to all retain non-trivial (>10-fold) nitrocefin hydrolysis activity compared to the dsbA pDM1 control. This does not impact the other results in 2B, but it would seem that a loss-of-function folding defect, as described subsequently for BPS-1m, is also part of the explanation for the observed MIC decreases, and this was not necessarily clear from the quoted passage. This could simply be clarified in the final sentence - that both mechanisms are potentially in play - if the authors agree with that interpretation.

      You are correct, thank you for your comment. We amended the text in our revised manuscript as follows: 

      The data presented so far (Fig. 1 and 2) demonstrate that disulfide bond formation is essential for the biogenesis (stability and/or protein folding) and, in turn, activity of an expanded set of clinically important β-lactamases, including enzymes that currently lack inhibitor options” (lines 158-161 of the revised manuscript).

      (7) While it is clear from Figure S2 that the various dsb mutants do not have a general growth defect or collateral sensitivity to another antibiotic, it does not appear that there is an analogous control for the DSB inhibitor demonstrating no growth/toxic effects at the concentration used. This could be provided similarly to Figure S2, using gentamicin as a control antibiotic.

      We have carried out experiments that confirm that our results are due to specific inhibition of the DSB system and not because of off-target effects. In particular, we show that the gentamicin MIC values of S. maltophilia AMM remain unchanged in the presence of the inhibitor and treatment of S. maltophilia AMM dsbA dsbL with the compound does not affects its colistin MIC value (Fig. S2E and lines 317-320 of the revised manuscript).

      (8) Complementation is appropriately provided for experiments with E. coli, but are not provided for P. aeruginosa or S. maltophilia. It should be straightforward to complement in Pa, but is also probably less critical considering the evidence from E. coli. However, since the Sm mutant is a gene cluster with two genes, it would seem more imperative to complement this strain. This reviewer is not familiar enough with Sm to know if complementation is routine or feasible with this organism; if not, the controls for the DSB inhibitor should at least be provided.

      As mentioned in our response to comment 7 above, we have carried out experiments that confirm that our DSB inhibitor results are due to specific inhibition of the DSB system and not because of off-target effects.

      Moreover, in response to this comment, we have further demonstrated that our results are due to the specific interaction of DsbA with β-lactamase enzymes by complementing dsbA deletions in representative clinical strains of multidrug-resistant Pseudomonas aeruginosa and extremely-drug-resistant Stenotrophomonas maltophilia. We would like to note here that gene complementation in clinical isolates remains very rare in the literature due to their high levels of resistance and limited genetic tractability. Most of the few complementation examples reported for these two organisms are limited to strains that, although pathogenic, are commonly used in the lab, or to complementation efforts in non-clinical strain systems (for example use of P. aeruginosa PA14 for complementation, instead of the focal clinical isolate).

      We tested three different complementation strategies, two of which ended up being unsuccessful. After approximately 9 months of work, we succeeded in complementing a representative clinical strain for each organism (P. aeruginosa CDC #769 dsbA1 and S. maltophilia AMM dsbA dsbL) by inserting the dsbA1 gene from P. aeruginosa PAO1 into the Tn7 site on the chromosome. Both clinical strains show full complementation for every antibiotic tested; our complementation results can be found in Fig. S2B,D of the revised manuscript.

      The following text was added for P. aeruginosa clinical isolates:

      We have demonstrated the specific interaction of DsbA with the tested β-lactamase enzymes in our E. coli K-12 inducible system using gentamicin controls (Fig. 1 and File S2A) and gene complementation (Fig. S1). To confirm the specificity of this interaction in P. aeruginosa, we performed representative control experiments in one of our clinical strains, P. aeruginosa CDC #769. We first tested the general ability of P. aeruginosa CDC #769 dsbA1 to resist antibiotic stress by recording MIC values against gentamicin, and found it unchanged compared to its parent (Fig. S2A). Gene complementation in clinical isolates is especially challenging and rarely attempted due to the high levels of resistance and lack of genetic tractability in these strains. Despite these challenges, to further ensure the specificity of the interaction of DsbA with tested β-lactamases in P. aeruginosa, we have complemented dsbA1 from P. aeruginosa PAO1 into P. aeruginosa CDC #769 dsbA1. We found that complementation of dsbA1 restores MICs to wild-type values for both tested β-lactam compounds (Fig. S2B) further demonstrating that our results in P. aeruginosa clinical strains are not confounded by off-target effects” (lines 226-239 of the revised manuscript).

      The following text was added for S. maltophilia clinical isolates: 

      “Since the dsbA and dsbL are organized in a gene cluster in S. maltophilia, we wanted to ensure that our results reported above were exclusively due to disruption of disulfide bond formation in this organism. First, we recorded gentamicin MIC values for S. maltophilia AMM dsbA dsbL and found them to be unchanged compared to the gentamicin MICs of the parent strain (Fig. S2C). This confirms that disruption of disulfide bond formation does not compromise the general ability of this organism to resist antibiotic stress. Next, we complemented S. maltophilia AMM dsbA dsbL. The specific oxidative roles and exact regulation of DsbA and DsbL in S. maltophilia remain unknown. For this reason and considering that genetic manipulation of extremely-drug-resistant organisms is challenging, we used our genetic construct optimized for complementing P. aeruginosa CDC #769 dsbA1 with dsbA1 from P. aeruginosa PAO1 (Fig. S2B) to also complement S. maltophilia AMM dsbA dsbL. We based this approach on the fact that DsbA proteins from one species have been commonly shown to be functional in other species [27-30]. Indeed, we found that complementation of S. maltophilia AMM dsbA dsbL with P. aeruginosa PAO1 dsbA1 restores MICs to wild-type values for both ceftazidime and colistin (Fig. S2D), conclusively demonstrating that our results in S. maltophilia are not confounded by off-target effects” (lines 282-297 of the revised manuscript).

      (9) In Figure 5E, the growth inhibition and loss of Pa CFU in 4 ug/mL ceftazidime for the Sm co-culture condition, which is subsequently lost in the Sm dsbA dsbL co-culture, does not appear to be discussed. As Pa is shown to grow fine in monoculture at this concentration, this result should be discussed in relation to the co-culture dynamics. Is it expected or observed that WT Sm is out-competing Pa under this condition and growing to a high CFU/mL? This would seem to have parallels to citation 49.

      As requested by this reviewer (see comment 10 below), we simultaneously tracked the abundance of P. aeruginosa and S. maltophilia strains in our cross-protection experiment. During this process we probed the abundances of the two organisms at 4 µg/mL of ceftazidime. Our results can be seen in Fig. S3B of the revised manuscript. The reviewer is correct and these effects are due to competition between P. aeruginosa and S. maltophilia with the latter being able to reach very high CFUs in this antibiotic concentration. 

      The following text on co-culture dynamics was added to our revised manuscript: 

      At low antibiotic concentrations, for example 4 μg/mL of ceftazidime, S. maltophilia AMM is fully resistant and thrives, thus outcompeting P. aeruginosa PA14 (dark pink and dark blue bars in Fig. S3B). The same can also be seen in Fig. 4E, whereby decreased P. aeruginosa PA14 CFUs are recorded. By contrast S. maltophilia AMM dsbA dsbL already displays decreased growth at 4 μg/mL of ceftazidime because of its non-functional L1-1 enzyme, allowing comparatively higher growth of P. aeruginosa (light pink and light blue bars in Fig. S3B)” (lines 384-390 of the revised manuscript).

      (10) The data presented in Figure 5E would be augmented by the inclusion of, for at least a few representative cases, the Sm CFUs relative to the Pa CFUs. In describing the protective effects of Sm on Pa for imipenem treatment, the authors of citation 12 note that the effect was dependent on Sm cell density. This raises the immediate question of whether the protection observed in this work is similarly dependent on cell density of Sm. It is unclear if the authors expect Sm to persist under these conditions, and it seems Sm CFU should be expected to be relatively high considering it is pre-incubated for 6 hours prior to the assay. What is the physiological state of these cells, and how are they affected by ceftazidime? While many other variables are likely relevant to the translation of this protection, the relative abundance and localization of Sm and Pa commonly observed in CF patients, as well as the effective concentration of antibiotic observed in vivo, is likely worth consideration.

      As mentioned in our response to comment 9 above, we have simultaneously tracked the abundance of P. aeruginosa and S. maltophilia strains in our cross-protection experiment for select antibiotic concentrations. To be able to perform this experiment, we had to label two extremely-drug-resistant strains of S. maltophilia with an antibiotic resistance marker that allowed us to quantify them in mixtures with P. aeruginosa. Our results can be found in Fig. S3 of our revised manuscript and, in a nutshell, show that ceftazidime treatment leads to eradication of both P. aeruginosa and S. maltophilia when disulfide bond formation is impaired in S. maltophilia.

      The following text was added to address the questions of the reviewer:

      “Due to the naturally different growth rates of these two species (S. maltophilia grows much slower than P. aeruginosa) especially in laboratory conditions, the protocol we followed [1] requires S. maltophilia to be grown for 6 hours prior to co-culturing it with P. aeruginosa. To ensure that at this point in the experiment our two S. maltophilia strains, with and without dsbA, had grown comparatively to each other, we determined their cell densities (Fig. S3A). We found that S. maltophilia AMM dsbA dsbL had grown at a similar level as the wild-type strain, and both were at a higher cell density [~10<sup>7</sup> colony forming units (CFUs)] compared to the P.aeruginosa PA14 inoculum (5 x 10<sup>4</sup> CFUs)” (lines 353-361 of the revised manuscript).

      “To ensure that ceftazidime treatment leads to eradication of both P. aeruginosa and S. maltophilia when disulfide bond formation is impaired in S. maltophilia, we monitored the abundance of both strains in each synthetic community for select antibiotic concentrations (Fig. S3B). In this experiment we largely observed the same trends as in Fig. 4E. At low antibiotic concentrations, for example 4 μg/mL of ceftazidime, S. maltophilia AMM is fully resistant and thrives, thus outcompeting P. aeruginosa PA14 (dark pink and dark blue bars in Fig. S3B). The same can also be seen in Fig. 4E, whereby decreased P. aeruginosa PA14 CFUs are recorded. By contrast S. maltophilia AMM dsbA dsbL already displays decreased growth at 4 μg/mL of ceftazidime because of its non-functional L1-1 enzyme, allowing comparatively higher growth of P. aeruginosa (light pink and light blue bars in Fig. S3B). Despite the competition between the two strains, P. aeruginosa PA14 benefits from S. maltophilia AMM’s high hydrolytic activity against ceftazidime, which allows it to survive and grow in high antibiotic concentrations even though it is not resistant (see 128 μg/mL; dark pink and dark blue bars in Fig. S3B). In stark opposition, without its disulfide bond in S. maltophilia AMM dsbA dsbL, L1-1 cannot confer resistance to ceftazidime, resulting in killing of S. maltophilia AMM dsbA dsbL and, consequently, also of P. aeruginosa PA14 (see 128 μg/mL; light pink and light blue bars in Fig. S3B).

      The data presented here show that, at least under laboratory conditions, targeting protein homeostasis pathways in specific recalcitrant pathogens has the potential to not only alter their own antibiotic resistance profiles (Fig. 3 and 4A-D), but also to influence the antibiotic susceptibility profiles of other bacteria that co-occur in the same conditions (Fig. 5). Admittedly, the conditions in a living host are too complex to draw direct conclusions from this experiment. That said, our results show promise for infections, where pathogen interactions affect treatment outcomes, and whereby their inhibition might facilitate treatment” (lines 381406 of the revised manuscript).

      (11) Regarding the role of microbial interactions in CF and other disease/infection contexts, the authors should temper their descriptions in accordance with citations provided. As an example, lines 96-99: "For example, in the CF lung, highly drug-resistant S. maltophilia strains actively protect susceptible P. aeruginosa from β-lactam antibiotics [12], and ultimately facilitate the evolution of β-lactam resistance in P. aeruginosa [14]."

      Neither citation provided here attests to Sm protection of Pa "in the CF lung". Both papers use a simplified in vitro co-culture model to assess Sm protection of Pa from antibiotics and the evolution of Pa antibiotic resistance in the presence or absence of Sm, respectively. In the latter case, it should also be noted that while the authors observed somewhat faster Pa resistance evolution in one co-culture condition, they did not observe it in the other, and that resistance evolution in general was observed regardless of co-culture condition. There are also statements in the ultimate and penultimate paragraphs of the Discussion section that repeat these points. The authors could re-frame this aspect of their investigation as part of a working hypothesis related to potential interactions of these pathogens, and should appropriately caveat what is and is not known from in vitro and in vivo/clinical work.

      Thank you for your comment. You are entirely correct. We have amended the test throughout our revised manuscript to avoid overstating these finding and to be clear about the fact that they originate from experimental studies. Please find below representative examples of such passages:

      “In particular, some antibiotic resistance proteins, like β-lactamases, which decrease the quantities of active drug present, function akin to common goods, since their benefits are not limited to the pathogen that produces them but can be shared with the rest of the bacterial community. This means that their activity enables pathogen cross-resistance when multiple species are present [1,31], something that was demonstrated in recent work investigating the interactions between pathogens that naturally co-exist in CF infections. More specifically, it was shown that in laboratory co-culture conditions, highly drug-resistant S. maltophilia strains actively protect susceptible P. aeruginosa from β-lactam antibiotics [1]. Moreover, this crossprotection was found to facilitate, at least under specific conditions, the evolution of β-lactam resistance in P. aeruginosa [32]” (lines 47-57 of the revised manuscript).

      “The antibiotic resistance mechanisms of S. maltophilia impact the antibiotic tolerance profiles of other organisms that are found in the same infection environment. S. maltophilia hydrolyses all β-lactam drugs through the action of its L1 and L2 β-lactamases [7,8]. In doing so, it has been experimentally shown to protect other pathogens that are, in principle, susceptible to treatment, such as P. aeruginosa [1]. This protection, in turn, allows active growth of otherwise treatable P. aeruginosa in the presence of complex β-lactams, like imipenem [1], and, at least in some conditions, increases the rate of resistance evolution of P. aeruginosa against these antibiotics [32]” (lines 332-340 of the revised manuscript).

      (12) Regarding the role of S. maltophilia in CF disease, the authors should either discuss clinical associations more completely or note the conflicting data on its role in disease. As an example, lines 84-87: "As a result, the standard treatment option, i.e., broad-spectrum βlactam antibiotic therapy, constitutes a severe risk for CF patients carrying both P. aeruginosa and S. maltophilia [10,11], creating an urgent need for antimicrobial approaches that will be effective in eliminating both pathogens."

      It is unclear how this treatment results in a "severe risk" for CF patients colonized by both Sm and Pa. Citation 10 suggests an association between anti-pseudomonal antibiotic use and increased prevalence of Sm, but neither citation supports a worsening clinical outcome from this treatment. Citation 10 further notes that clinical scores between Sm-positive and control cohorts could not be distinguished statistically. Citation 11 is a review that makes note of this conflicting data regarding Sm, including reference to a more recent (at the time) result using multivariate analysis showing no independent affect of Sm on survival.

      The above point similarly applies to other statements in the manuscript, for example at lines 266-267: "Considering the contribution of S. maltophilia strains to treatment failure in CF lung infections [8,10,11][...]" As well as lines 79-80: "Pulmonary exacerbations and severe disease states are also associated with the presence of S. maltophilia [8]"

      Again, the provided citations do not support the implication that Sm specifically 'contributes to treatment failure in CF lung infections' or that Sm is specifically associated with severe disease states. In addition to the previously discussed citations, citation 8 describes broad "pulmotypes" composed of 10 species/genera that could be associated with particular clinical (e.g., exacerbation) or treatment (e.g., antibiotic therapy) characteristics, but these cannot, without further analysis, be associated with, or causally linked to, a specific pathogen. While pulmotype 2 in citation 8 was associated with a more severe clinical state and appeared to have the highest relative abundance of Sm compared to other pulmotypes, Sm was not identified (Figure 4A) as an independent factor that distinguishes between moderate and severe disease, unlike Pa and some anaerobes (4F-H). The authors also observed that decreasing relative abundance of Pa, in particuar, is correlated with subsequent exacerbation, but did not correlate this with the presence of any other species or genera. Again, this should be re-framed with the appropriate caveat that this is a hypothesis with possible clinical significance.

      Several suggested papers are included below on Sm association with clinical characteristics to incorporate into the manuscript if the authors choose to do so:

      https://doi.org/10.1177/14782715221088909

      https://doi.org/10.1016/j.prrv.2010.07.003

      https://doi.org/10.1016/j.jcf.2013.05.009 https://doi.org/10.1002/ppul.23943

      https://doi.org/10.1002/14651858.CD005405.pub2

      https://doi.org/10.1164/rccm.2109078 http://dx.doi.org/10.1136/thx.2003.017707

      https://erj.ersjournals.com/content/23/1/98.short

      Thank you for your comment. You are entirely correct. We have amended the test throughout our revised manuscript to avoid overstating the role of S. maltophilia in CF infections and to reference additional relevant works in the literature. Please find below representative examples of such passages:

      “On the other hand, CF microbiomes are increasingly found to encompass S. maltophilia [2-4], a globally distributed opportunistic pathogen that causes serious nosocomial respiratory and bloodstream infections [5-7]. S. maltophilia is one of the most prevalent emerging pathogens [6] and it is intrinsically resistant to almost all antibiotics, including β-lactams like penicillins, cephalosporins and carbapenems, as well as macrolides, fluoroquinolones, aminoglycosides, chloramphenicol, tetracyclines and colistin. As a result, the standard treatment option for lung infections, i.e., broad-spectrum β-lactam antibiotic therapy, is rarely successful in countering S. maltophilia [7,8], creating a definitive need for approaches that will be effective in eliminating both pathogens” (lines 33-41 of the revised manuscript).

      “Of the organisms studied in this work, S. maltophilia deserves further discussion because of its unique intrinsic resistance profile. The prognosis of CF patients with S. maltophilia lung carriage is still debated [4,9-16], largely because studies with extensive and well-controlled patient cohorts are lacking. This notwithstanding, the therapeutic options against this pathogen are currently limited to one non-β-lactam antibiotic-adjuvant combination, , which is not always effective, trimethoprim-sulfamethoxazole [17-20], and a few last-line β-lactam drugs, like the fifth-generation cephalosporin cefiderocol and the combination aztreonam-avibactam. Resistance to commonly used antibiotics causes many problems during treatment and, as a result, infections that harbor S. maltophilia have high case fatality rates [7]. This is not limited to CF patients, as S. maltophilia is a major cause of death in children with bacteremia [5]” (lines 440-450 of the revised manuscript).

      Reviewer #3 (Recommendation For the Authors):

      (1) The referencing of supplemental figures does not follow a sequential order. For example, Figure S2 appears in the text before S1. The sequential ordering of figure numbers improves the readability and can be considered while editing the manuscript for revision.

      Thank you for this comment. This is amended in our revised manuscript and supplemental figures and files are cited in order.

      (2 )It will be useful to provide a brief description of ambler classes since these are important to study design (for a broader audience).

      Thank you for this suggestion. This has been added and can be found in lines 91-101 of the revised manuscript.

      (3) The rationale for using K12 strain for E. coli should be provided. It appears that is a model system that is well established in their lab, but a scientific rationale can be listed. Maybe this strain does not have any lactamases in its genome other than the one being expressed as compared to pathogenic E. coli?

      Thank you for this suggestion. This has been added and can be found in lines 104-106 of the revised manuscript.

      (4) The reviewers used worm model to test their observations, which is relevant. Given the significant implications of their work in overcoming resistance to clinically used antibiotics and availability of already generated dsbA mutants in clinical strains, it will be useful to investigate survival in animal models or at least wound models of Pseudomonas infections. The reviewer does not deem this necessary, but it will significantly increase the impact of their seminal work.

      Thank you for this comment. We appreciate the sentiment, and we would have liked to be able to perform experiments in a murine model of infection. There are several reasons that made this not possible, and as a result we used G. mellonella as an informative preliminary in vivo infection model. The DSB proteins have been shown to play a central role in bacterial virulence. Because of this our P. aeruginosa and S. maltophilia mutant strains are not efficient in establishing an infection, even in a wound model. This could be overcome had we been able to use the chemical inhibitor of the DSB system in vivo, however this also is not possible This is due to the fact that the chemical compound that we use to inhibit the function of DsbA acts on DsbB. Inhibition of DsbB blocks the re-oxidation of DsbA and leads to its accumulation in its inactive reduced form. However, the action of the inhibitor can be bypassed through reoxidation and re-activation of DsbA by small-molecule oxidants such as L-cystine, which are abundant in rich growth media or animal tissues. This makes the inhibitor only suitable for in vitro assays that can be performed in minimal media, where the presence of small-molecule oxidants can be strictly avoided, but entirely unsuitable for an insect or a vertebrate animal model.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Dixit, Noe, and Weikl apply coarse-grained and all-atom molecular dynamics to determine the response of the mechanosensitive proteins Piezo 1 and Piezo 2 proteins to tension. Cryo-EM structures in micelles show a high curvature of the protein whereas structures in lipid bilayers show lower curvature. Is the zero-stress state of the protein closer to the micelle structure or the bilayer structure? Moreover, while the tension sensitivity of channel function can be inferred from the experiment, molecular details are not clearly available. How much does the protein's height and effective area change in response to tension? With these in hand, a quantitative model of its function follows that can be related to the properties of the membrane and the effect of external forces. 

      Simulations indicate that in a bilayer the protein relaxes from the highly curved cryo-EM dome (Figure 1). 

      Under applied tension, the dome flattens (Figure 2) including the underlying lipid bilayer. The shape of the system is a combination of the membrane mechanical and protein conformational energies (Equation 1). The membrane's mechanical energy is well-characterized. It requires only the curvature and bending modulus as inputs. They determine membrane curvature and the local area metric (Equation 4) by averaging the height on a grid and computing second derivatives (Equations 7, 8) consistent with known differential geometric formulas. 

      The bending energy can be limited to the nano dome but this implies that the noise in the membrane energy is significant. Where there is noise outside the dome there is noise inside the dome. At the least, they could characterize the noisy energy due to inadequate averaging of membrane shape. 

      My concern for this paper is that they are significantly overestimating the membrane deformation energy based on their numerical scheme, which in turn leads to a much stiffer model of the protein itself.

      We agree that “thermal noise” is intrinsic to MD simulations, as in “real” systems, leading to thermally excited shape fluctuations of membranes and conformational fluctuations of proteins. However, for our coarse-grained simulations, the thermally excited membrane shape fluctuations can be averaged out quite well, and the resulting average shapes are smooth, see e.g. the shapes and lines of the contour plots in Fig. 1 and 2. For our atomistic simulations, the averaged shapes are not as smooth, see Fig. 3a and the lines of the contour plots in Fig. 3b. Therefore, we do not report bending energies for the nanodome shapes determined from atomistic simulations, because bending energy calculations are sensitive to remaining “noise” on small scales (due to the scale invariance of the bending energy), in contrast to calculations of excess areas, which we state now on lines 620ff.

      For our coarse-grained simulations, we now corroborate our bending energy calculations based on averaged 3d shapes by comparing to bending energy values obtained from highly smoothened 2d mean curvature profiles (see Fig. 1c for mean curvature profiles in tensionless membranes). We discuss this in detail from line 323 on, starting with:

      “To corroborate our bending energy calculations for these averaged three-dimensional nanodome shapes, we note that essentially identical bending energies can be obtained from the highly smoothened mean curvatures M of the two-dimensional membrane profiles. …”

      Two things would address this: 

      (1) Report the membrane energy under different graining schemes (e.g., report schemes up to double the discretization grain). 

      There are two graining schemes in the modeling, and we have followed the reviewer’s recommendation regarding the second scheme. In the first, more central graining scheme, we use quadratic membrane patches with a sidelength of about 2 nm to determine membrane midplane shapes and lipid densities of each simulation conformation. This graining scheme has also been previously employed in Hu, Lipowsky, Weikl, PNAS 38, 15283 (2013) to determine the shape and thermal roughness of coarse-grained membranes. A sidelength of 2 nm is necessary to have sufficiently many lipid headgroups in the upper and lower leaflet in the membrane patches for estimating the local height of these leaflets, and the local membrane midplane height as average of these leaflet heights (see subsection “Membrane shape of simulation conformation” in the Methods section for details).  However, we strongly believe that doubling the sidelength of membrane patches in this discretization is not an option, because a discretization length of 4 nm is too coarse to resolve the membrane deformations in the nanodome, see e.g. the profiles in Fig. 1b. Moreover, any “noise” from this discretization is rather completely smoothened out in the averaging process used in the analysis of the membrane shapes, at least for the coarse-grained simulations. This averaging process requires rotations of membrane conformations to align the protein orientations of the conformations (see subsection “Average membrane shapes and lipid densities” for details). Because of these rotations, the original discretization is “lost” in the averaging, and a continuous membrane shape is generated. To calculate the excess areas and bending energies for this smooth, continuous membrane shape, we use a discretization of the Monge plane into a square lattice with lattice parameter 1 nm. As a response to the referee’s suggestion, we now report that the results for the excess area do not change significantly when doubling this lattice parameter to 2 nm. On line 597, we write:

      “For a lattice constant of a=2 nm, we obtain extrapolated values of the excess area Delta A from the coarse-grained simulations that are 2 to 3% lower than the values for a=1 nm, which is a small compared to statistical uncertainties with relative errors of around 10%.”

      On lines 614ff, we now state that the bending energy results are about 10% to 13% lower for a=2 nm, likely because of the lower resolution of the curvature in the nanodome compared to a=1 nm, rather than incomplete averaging and remaining roughness of the coarse-grained nanodome shapes.

      (2) For a Gaussian bump with sigma=6 nm I obtained a bending energy of 0.6 kappa, so certainly in the ballpark with what they are reporting but significantly lower (compared to 2 kappa, Figure 5 lower left). It would be simpler to use the Gaussian approximation to their curves in Figure 3 - and I would argue more accurate, especially since they have not reported the variation of the membrane energy with respect to the discretization size and so I cannot judge the dependence of the energy on discretization. I view reporting the variation of the membrane energy with respect to discretization as being essential for the analysis if their goal is to provide a quantitative estimate for the force of Piezo. The Helfrich energy computed from an analytical model with a membrane shape closely resembling the simulated shapes would be very helpful. According to my intuition, finite-difference estimates of curvatures will tend to be overestimates of the true membrane deformation energy because white noise tends to lead to high curvature at short-length scales, which is strongly penalized by the bending energy. 

      Instead of Gaussian bumps, we now calculate the membrane bending energy also from the two-dimensional, continuous mean curvature profiles (see Fig. 1c). These mean curvature profiles are highly smoothened (see figure caption for details). Nonetheless, we obtain essentially the same bending energies as in our discrete calculations of averaged, smoothened threedimensional membrane shapes, see new text on lines 326ff. We believe that this agreement corroborates our bending energy calculations. We still focus on values obtained for threedimensional membrane shapes, because of incomplete rotational symmetry. The three-dimensional membrane shapes exhibit variations with the three-fold symmetry of the Piezo proteins, see Figure 2a and b.

      We agree that the bending energy of thermally rough membranes depends on the discretization scheme, because the discretization length of any discretization scheme leads to a cut-off length for fluctuation modes in a Fourier analysis. But again, we average out the thermal noise, for reasons given in the Results section, and analyse smooth membrane shapes.  

      The fitting of the system deformation to the inverse time appears to be incredibly ad hoc ... Nor is it clear that the quantified model will be substantially changed without extrapolation. The authors should either justify the extrapolation more clearly (sorry if I missed it!) or also report the unextrapolated numbers alongside the extrapolated ones. 

      We report the values of the excess area and bending energy in the different time intervals of our analysis as data points in Fig. 4 with supplement. We find it important to report the time dependence of these quantities, because the intended equilibration of the membrane shapes in our simulations is not “complete” within a certain time window of the simulations. So, just “cutting” the first 20 and 50% of the simulation trajectories, and analysing the remaining parts as “equilibrated” does not seem to be a reasonable choice here, at least for the membrane properties, i.e. for the excess area and bending energy. We agree that the linear extrapolation used in our analysis is a matter of choice. At least for the coarse-grained simulations, the extrapolated values of excess areas and bending energies are rather close to the values obtained in the last time windows (see Figure 4). 

      In summary, this paper uses molecular dynamics simulations to quantify the force of the Piezo 1 and Piezo 2 proteins on a lipid bilayer using simulations under controlled tension, observing the membrane deformation, and using that data to infer protein mechanics. While much of the physical mechanism was previously known, the study itself is a valuable quantification. I identified one issue in the membrane deformation energy analysis that has large quantitative repercussions for the extracted model. 

      Reviewer #2 (Public review): 

      Summary: 

      In this study, the authors suggest that the structure of Piezo2 in a tensionless simulation is flatter compared to the electron microscopy structure. This is an interesting observation and highlights the fact that the membrane environment is important for Piezo2 curvature. Additionally, the authors calculate the excess area of Piezo2 and Piezo1, suggesting that it is significantly smaller compared to the area calculated using the EM structure or simulations with restrained Piezo2. Finally, the authors propose an elastic model for Piezo proteins. Those are very important findings, which would be of interest to the mechanobiology field. 

      Whilst I like the suggestion that the membrane environment will change Piezo2 flatness, could this be happening because of the lower resolution of the MARTINI simulations? In other words, would it be possible that MARTINI is not able to model such curvature due to its lower resolution? 

      Related to my comment above, the authors say that they only restrained the secondary structure using an elastic network model. Whilst I understand why they did this, Piezo proteins are relatively large. How can the authors know that this type of elastic network model restrains, combined with the fact that MARTINI simulations are perhaps not very accurate in predicting protein conformations, can accurately represent the changes that happen within the Piezo channel during membrane tension? 

      These questions regarding the reliability of the Martini model are very reasonable and are the reason why we include also results from atomistic simulations, at least for Piezo 2, and compare the results. In the Martini model, secondary structure constraints are standard. In addition, constraints on the tertiary structure (e.g. via an elastic network model) are also typically used in simulations of soluble, globular proteins. However, such tertiary constraints would make it impossible to simulate the tension-induced flattening of the Piezo proteins. So instead, as we write on lines 427ff, “we relied on the capabilities of the Martini coarse-grained force field for modeling membrane systems with TM helix assemblies (Sharma and Juffer, 2013; Chavent et al., 2014; Majumder and Straub, 2021).” In these refences, Martini simulations were used to study the assembly of transmembrane helices, leading to agreement with experimentally observed structures. As we state in our article, our atomistic simulations corroborate the Martini simulations, with the caveats that are now more extensively discussed in the new last paragraph of the Discussion section starting on line 362.

      Modelling or Piezo1, seems to be based on homology to Piezo2. However, the authors need to further evaluate their model, e.g. how it compares with an Alphafold model. 

      We understand the question, but see it beyond the scope of our article, also because of the computational demand of the simulations. The question is: Do coarse-grained simulations of Piezo1 based on an Alphafold model as starting structure lead to different results? It is important to note that we only model the rather flexible 12 TM helices at the outer ends of the Piezo 1 monomers via homology modeling to the Piezo 2 structure, which includes these TM helices. For the inner 26 TM helices, including the channel, we use the high-quality cryo-EM structure of Piezo 1. Alphafold may be an alternative for modeling the outer 12 helices, but we don’t think this would lead to statistically significant differences in simulations – e.g. because of the observed overall agreement of membrane shapes in all our Piezo 1 and Piezo 2 simulation systems.

      To calculate the tension-induced flattening of the Piezo channel, the authors "divide all simulation trajectories into 5 equal intervals and determine the nanodome shape in each interval by averaging over the conformations of all independent simulation runs in this interval.". However, probably the change in the flattening of Piezo channel happens very quickly during the simulations, possibly within the same interval. Is this the case? and if yes does this affect their calculations? 

      Unfortunately, the flattening is not sufficiently quick, so is not complete within the first time windows, see data points in Figure 4. We therefore report the time dependence with the plots in Figure 4 and extrapolate, see also our response above to reviewer 1.

      Finally, the authors use a specific lipid composition, which is asymmetric. Is it possible that the asymmetry of the membrane causes some of the changes in the curvature that they observe? Perhaps more controls, e.g. with a symmetric POPC bilayer are needed to identify whether membrane asymmetry plays a role in the membrane curvature they observe. 

      Because of the rather high computational demands, such controls are beyond our scope. We don’t expect statistically significant differences for symmetric POPC/cholesterol bilayers. On lines 229ff, we now state:

      “Our modelling assumes that any spontaneous curvature from asymmetries in the lipid composition is small compared to the curvature of the nanodome and, thus, negligible, which is plausible for the rather slight lipid asymmetry of our simulated membranes (see Methods).”

      Reviewer #3 (Public review): 

      Strengths: 

      This work focuses on a problem of deep significance: quantifying the structure-tension relationship and underlying mechanism for the mechanosensitive Piezo 1 and 2 channels. This objective presents a few technical challenges for molecular dynamics simulations, due to the relatively large size of each membrane-protein system. Nonetheless, the technical approach chosen is based on the methodology that is, in principle, established and widely accessible. Therefore, another group of practitioners would likely be able to reproduce these findings with reasonable effort. 

      Weaknesses: 

      The two main results of this paper are (1) that both channels exhibit a flatter structure compared to cryo-EM measurements, and (2) their estimated force vs. displacement relationship. Although the former correlates at least quantitatively with prior experimental work, the latter relies exclusively on simulation results and model parameters. 

      Below is a summary of the key points we recommend addressing in a revised version of the manuscript: 

      (1) The authors should report and discuss controls for the membrane energy calculations, specifically by increasing the density of the discretization graining. We also suggest validating the bending modulus used in the energy calculations for the specific lipid mixture employed in the study. 

      We have addressed both points, see our response to the reviewer’s comments for further details.

      (2) The authors should consider and discuss the potential limitations of the coarse-grained simulation force field and clarify how atomistic simulations validate the reported results, with a more detailed explanation of the potential interdependencies between the two. 

      We now discuss the caveats in the comparison of coarse-grained and atomistic simulations in more detail in a new paragraph starting on line 362.

      (3) The authors should provide further clarification on other points raised in the reviewers' comments, for instance, the potential role of membrane asymmetry. 

      We have done this – see above. We now further explain on lines 437ff why we use an asymmetric membrane. On lines 230ff, we discuss that any spontaneous membrane curvature due to lipid asymmetry is likely small compared to the nanodome curvature and, thus, negligible.

      Reviewer #1 (Recommendations for the authors): 

      (1) Report discretization dependence of the membrane energy (up to double the density of the current discretization graining). 

      We have added several text pieces in the paragraph “Excess area and bending energy” starting on line 583 in which we state how the results depend on the lattice constant a of the calculations.

      (2) Evaluate an analytical energy of a membrane bump with a shape similar to the simulation. This would be free of all sampling and discretization artifacts and would thus be an excellent lower bound of the energy. 

      We have done this for the curvature profile in Figure 1c and corresponding curvature profiles of the shape profiles in Figure 2d, see next text on lines 326ff.

      Minor: 

      (1)  The lipid density (Figure 1 right, 2c, 3c) is not interesting nor is it referred to. It can be dropped. 

      We think the lipid density maps are important for two reasons: First, they show the protein shape obtained after averaging conformations, as low-lipid-density regions. Second, the lipid densities are used in the calculation of the bending energies, to limit the bending energy calculations to the membrane in the nanodome, see Eq. 9. We therefore prefer to keep them.

      (2) Figure 7 is attractive but not used in a meaningful way. I suggest inserting the protein graphic from Figure 7 into Figure 1 with the 4-helix bundles numbered alongside the structure. Figure 7 could then be dropped. 

      Figure 7 is a figure of the Methods section. We need it to illustrate and explain aspects of the setup (numbering of helices, missing loops) and analysis (numbering scheme of 4-TM helix units).

      (3) Some editing of the use of the English language would be helpful. "Exemplary" is a bit of a funny word choice, it implies that the conformation is excellent, and not simply representative. I'd suggest "Representative conformation". 

      We agree and have replaced “exemplary” by “representative”.

      (4) Typos: 

      Equation 4 - Missing parentheses before squared operator inside the square root. 

      We have corrected this mistake.

      Reviewer #2 (Recommendations for the authors): 

      This study focuses mainly on Piezo2; the authors do not perform any atomistic simulations of Piezo1, and the coarse-grained simulations for Piezo1 are shorter. As a result, their analysis for Piezo2 seems more complete. It would be good if the authors did similar studies with Piezo1 as with Piezo2. 

      We agree that atomistic simulations of Piezo 1 would be interesting, too. However, because the atomistic simulations are particularly demanding, this is beyond our scope.

      Reviewer #3 (Recommendations for the authors): 

      (1) At line 63, a very large tension from the previous work by De Vecchis et al is reported (68 mN/m). The authors are sampling values up to about 21 mN/m, which is considerably smaller. However, these values greatly exceed what typical lipid membranes can sustain (about 10 mN/m) before rupturing. When mentioning these large tensions, the authors should emphasize that these values are not physiologically significant, because they would rupture most plasma membranes. That said, their use in simulation could be justified to magnify the structural changes compared to experiments. 

      We agree that our largest membrane tension values are unphysiological. However, we see a main novelty and relevance of our simulations in the fact that we obtain a response of the nanodome in the physiological range of membrane tensions, see e.g. the 3<sup>rd</sup> sentence of the abstract. Yes, we include simulations at tensions of 21 mN/m, but most of our simulated tension values are in the range from 0 to 10 mN/m (see e.g. Fig. 3e), in contrast to previous simulation studies.   

      (2) At line 78 and in the Methods, only the reference paper is for the CHARMM protein force field, but not for the lipid force field. 

      We have added the reference Klauda et al., 2010 for the CHARMM36 lipid force field in both spots.

      (3) (Line 83) Acknowledging that the authors needed to use the structure from micelles (because it has atomic resolution), how closely do their relaxed Piezo structures compare with the lowerresolution data from the MacKinnon and Patapoutian papers? 

      There are no structures reported in these papers to compare with, only a clear flattening as stated.  

      (4) (Line 99) The authors chose a slightly asymmetric lipid membrane composition to capture some specific plasma-membrane features. However, they do not discuss which features are described by this particular composition, which doesn't include different acyl-chain unsaturations between leaflets. Further, they do not seem to comment on whether there is enrichment of certain lipid species coupled to curvature, or whether there is any "scrambling" occurring when the dome section and the planar membrane are stitched together in the preparation phase (Figure 8). 

      Enrichment of lipids in contact with the protein is addressed in the reference Buyan et al., 2020, based on Martini simulations with Piezo 1. We have a different focus, but still wanted to keep an asymmetric membrane as in essentially all previous simulation studies as now stated also on lines 439ff, to mimic the native Piezo membrane environment. There is no apparent “scrambling” in the setup of our membrane systems. We also did not explore any coupling between curvature and lipid composition, but will publish the simulation trajectories to enable such studies.  

      (5) (Caption of Figure 2). Please comment briefly in the text why the tensionless simulation required a longer simulation run (e.g. larger fluctuations?) 

      We added as explanation on line 500 as explanation: “ … to explore the role of the long-range shape fluctuations in tensionless membranes for the relaxation into equilibrium”. The relaxation time of membrane shape fluctuations strongly increases with the wave length, which is only limited by the simulation box size in the absence of tensions. However, also for 8 microsecond trajectories, we do not observe complete equilibriation and therefore decided to extrapolate the excess area and bending energy values obtained for different time intervals of the trajectories.

      (6) (Caption of Figure 3). Please clarify in the Methods how the atomistic simulations were initialized were they taken from independent CG simulation snapshots? If not, the use of the adjective "independent" would be questionable given the very short atomistic simulation time length. 

      We now added that the production simulations started from the same structure. On lines 386, we now discuss the starting structure of the atomistic simulations in more detail.

      (7) (Line 202). The approach of discretizing the bilayer shape is reasonable, but no justification was provided for the 1-nm grid spacing. In my opinion, there should be a supporting figure showing how the bending energy varies with the grid spacing. 

      We now report also the effect of a 2-nm grid spacing on the results, see new text passages on page 18, and provide an explanation for the smaller 1-nm grid spacing on lines 587ff, where we write:

      “This lattice constant [a = 1 nm] is chosen to be smaller than the bin width of about 2nm used in determining the membrane shape of the simulation conformations, to take into account that the averaging of these membrane shapes can lead to a higher resolution compared to the 2 nm resolution of the individual membrane shapes.”

      (8) (Line 211). The choice by the authors to use a mixed lipid composition complicates the task of defining a reasonable bending modulus. Experimentally and in atomistic simulations, lipids with one saturated tail (like POPC or SOPC) are much stiffer when they are mixed with cholesterol (https://doi.org/10.1529/biophysj.105.067652, https://doi.org/10.1103/PhysRevE.80.021931, https://doi.org/10.1093/pnasnexus/pgad269). On the other hand, MARTINI seems to predict a slight *softening* for POPC mixed with cholesterol (https://doi.org/10.1038/s41467-023-43892-x). Further complicating this matter, mixtures of phospholipids with different preferred curvatures are predicted to be softer than pure bilayers (e.g. https://doi.org/10.1021/acs.jpcb.3c08117), but asymmetric bilayers are stiffer than symmetric ones in some circumstances (https://doi.org/10.1016/j.bpj.2019.11.3398). 

      This issue can be quite thorny: therefore, my recommendation would be to either: (a) directly compute k for their lipid composition, which is straightforward when using large CG bilayers (as was done in Fowler et al, 2016), but it would also require more advanced methods for the atomistic ones; (b) use a reasonable *experimental* value for k, based on a similar enough lipid composition. 

      We now justify in somewhat more detail why we use an asymmetric membrane, but agree that his complicates the bending energy estimates. We only aim to estimate the bending energy in the Martini 2.2 force field, because our elasticity model is based on and, thus, limited to results obtained with this force field. We have included the two further references using the Martini 2.2 force field suggested by the reviewer on line 213, and discuss now in more detail how the bending rigidity estimate enters and affects the modeling, see lines 226ff.  

      (9) (Line 224). Does this closing statement imply that all experimental work from ex-vivo samples describe Piezo states under some small but measurable tension? 

      We compare here to the cryo-EM structure in detergent micelles. So, there is no membrane tension, there may be a surface tension of the micelle, but we assume here that Piezo proteins are essentially force free in detergent micelles. Membrane embedding, in contrast, leads to strong forces on Piezo proteins already in the absence of membrane tension, because of the membrane bending energy.

      (10) (Line 304). The Discussion concludes with a reasonable point, albeit on a down note: could the authors elaborate on what kind of experimental approach may be able to verify their modeling results? 

      Very good question, but this is somewhat beyond our expertise. We don’t have a clear recommendation – it is complicated. What can be verified is the flattening, i.e. the height and curvature of the nanodome in lower-resolution experiments. We see our results in line with these experiments, see Introduction. 

      (11) (Line 331). The very title of the Majumder and Straub paper addresses the problem of excessive binding strength between protein beads in the MARTINI force field, which should be mentioned. Figure 3(d) shows that the atomistic systems have larger excess areas than the CG ones. This could be related to MARTINI's "stickiness", or just statistical sampling. Characterizing the grid spacing (see point 7 above) might help illuminate this. 

      We discuss now the larger excess area values of the atomistic simulations on lines 381ff.  

      (12) (Lines 367, 375). Are the harmonic restraints absolute position restraints or additional bonds?

      Note also that the schedule at which the restraints are released (10-ns intervals) is relatively quick. Does the membrane have enough time to equilibrate the number of lipids in each leaflet? 

      These are standard, absolute position restraints. The 10-ns intervals may be too short to fully equilibrate the numbers of lipids, we have not explored this. The main point in the setup was to have a reasonable TM helix embedding with a smooth membrane, without any rupturing. This turned out to be tricky, with the procedures illustrated in Figure 8 as solution. If the membrane is smooth, the lipid numbers quickly equilibrate either in the final relaxation or in the initial nanoseconds of the production runs.

      (13) (Line 387) The use of an isotropic barostat for equilibration further impedes the system's ability to relax its structure. I feel that the authors should validate more strongly their protocol to rule out the possibility that incomplete equilibration could bias dynamics towards flatter membranes, which is one of the main results of this paper. 

      We don’t see how choices in the initial relaxation steps could have affected our results, at least for the coarse-grained simulations. There is more and more flattening throughout all simulation trajectories, see e.g. the extrapolations in Figure 4. All initial simulation structures are significantly less flattened than the final structures in the production runs.

      (14) (Line 403). What is the protocol for reducing the membrane size for atomistic simulation? This is even more important to mention than for CG simulations. 

      We just cut lipids beyond the intended box size of the atomistic simulations. As a technical point, we now have also added on line 507 how PIP2 lipids were converted.

      (15) (Line 423). The CHARMM force field requires a cut-off distance of 12 Å for van der Waals forces, with a force-based continuous switching scheme. The authors should briefly comment on this deviation and its possible impact on membrane properties. Quick test simulations of very small atomistic bilayers with the chosen composition could be used as a comparison. 

      We don’t expect any relevant effect on membrane properties within the statistical accuracies of the quantities of interest here (i.e. excess areas).

      (16) (Equation 4). There are some mismatched parentheses: please check. 

      We have corrected this mistake.

      (17) (Equations 7-8). Why did the authors use finite-differences derivatives of z(x,y) instead of using cubic splines and the corresponding analytical derivatives? 

      In our experience, second derivatives of standard cubic splines can be problematic. The continuous membrane shapes we obtain in our analysis are averages of such splines. We find standard finite differences more reliable, and therefore discretize these shapes. Already for the 2d membrane profiles of Figure 1b and 2d, calculating curvatures from interpolations using splines is problematic.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary: As TDP-43 mislocalization is a hallmark of multiple neurodegenerative diseases, the authors seek to identify pathways that modulate TDP-43 levels. To do this, they use a FACS based genome wide CRISPR KD screen in a Halo tagged TDP-43 KI iPSC line. Their screen identifies a number of genetic modulators of TDP-43 expression including BORC which plays a role in lysosome transport.

      Strengths:

      Genome wide CRISPR based screen identifies a number of modulators of TDP-43 expression to generate hypotheses regarding RNA BP regulation and perhaps insights into disease.

      Weaknesses:

      It is unclear how altering TDP-43 levels may relate to disease where TDP-43 is not altered in expression but mislocalized. This is a solid cell biology study, but the relation to disease is not clear without providing evidence of BORC alterations in disease or manipulation of BORC reversing TDP-43 pathology in disease.

      We thank the reviewer for this comment and have updated the discussion to include more discussion of the role TDP-43 may play in the BORCS8-associated neurodegenerative disorder and how understanding how lysosome localization changing TDP-43 levels may help patients (lines 313-321).

      The mechanisms by which BORC and lysosome transport modulate TDP-43 expression are unclear. Presumably, this may be through altered degradation of TDP protein but this is not addressed.

      We agree with the reviewer that understanding the mechanism by which lysosome transport regulates TDP-43 levels is important and plan to examine this in future studies.

      Previous studies have demonstrated that TDP-43 levels can be modulated by altering lysosomal degradation so the identification of lysosomal pathways is not particularly novel.

      We thank the reviewer for this comment and have updated the text to make this clearer (lines 310-313). What hasn’t been observed previously is a change in lysosome localization affecting TDP-43 levels.

      It is unclear whether this finding is specific to TDP-43 levels or whether lysosome localization may more broadly impact proteostasis in particular of other RNA BPs linked to disease.

      We agree that this is an interesting question and something that should be investigated in future studies.

      Unclear whether BORC depletion alters lysosome function or simply localization.

      We thank the reviewer for this comment. Lysosome function related to protein turnover has not yet been examined in the literature after loss of BORC, but other aspects of lysosome function (including lipid metabolism and autophagic flux) have been shown to be disrupted upon loss of BORC. We have updated the discussion to address this (lines 292-296).

      Reviewer #2 (Public review):

      Summary: The authors employ a novel CRISPRi FACS screen and uncover the lysosomal transport complex BORC as a regulator of TDP-43 protein levels in iNeurons. They also find that BORC subunit knockouts impair lysosomal function, leading to slower protein turnover and implicating lysosomal activity in the regulation of TDP-43 levels. This is highly significant for the field given that a) other proteins could also be regulated in this way, b) understanding mechanisms that influence TDP-43 levels are significant given that its dysregulation is considered a major driver of several neurodegenerative diseases and c) the novelty of the proposed mechanism.

      Strengths:

      The novelty and information provided by the CRISPRi screen. The authors provide evidence indicating that BORC subunit knockouts impair lysosomal function, leading to slower protein turnover and implicating lysosomal activity in the regulation of TDP-43 levels and show a mechanistic link between lysosome mislocalization and TDP-43 dysregulation. The study highlights the importance of localized lysosome activity in axons and suggests that lysosomal dysfunction could drive TDP-43 pathologies associated with neurodegenerative diseases like FTD/ALS. Further, the methods and concepts will have an impact to the larger community as well. The work also sets up for further work to understand the somewhat paradoxical findings that even though the tagged TDP-43 protein is reduced in the screen, it does not alter cryptic exon splicing and there is a longer TDP-43 half-life with BORC KD.

      Weaknesses:

      While the data is very strong, the work requires some additional clarification.

      We thank the reviewer for these comments. Our detailed responses are included below in the “recommendations for authors” section.

      Reviewer #3 (Public review):

      Summary: In this work, Ryan et al. have performed a state-of-the-art full genome CRISP-based screen of iNeurons expressing a tagged version of TDP-43 in order to determine expression modifiers of this protein. Unexpectedly, using this approach the authors have uncovered a previously undescribed role of the BORC complex in affecting the levels of TDP-43 protein, but not mRNA expression. Taken together, these findings represent a very solid piece of work that will certainly be important for the field.

      Strengths:

      BORC is a novel TDP-43 expression modifier that has never been described before and it seemingly acts on regulating protein half life rather than transcriptome level. It has been long known that different labs have reported different half-lives for TDP-43 depending on the experimental system but no work has ever explained these discrepancies. Now, the work of Ryan et al. has for the time identified one of these factors which could account for these differences and play an important role in disease (although this is left to be determined in future studies).

      The genome wide CRISPR screening has demonstrated to yield novel results with high reproducibility and could eventually be used to search for expression modifiers of many other proteins involved in neurodegeneration or other diseases

      Weaknesses:

      The fact that TDP-43 mRNA does not change following BORCS6 KD is based on a single qRT- PCR that does not really cover all possibilities. For example, the mRNA total levels may not change but the polyA sites may have switched from the highly efficient pA1 to the less efficient and nuclear retained pA4. There are therefore a few other experiments that could have been performed to make this conclusion more compelling, maybe also performing RNAscope experiments to make sure that no change occurred in TDP-43 mRNA localisation in cells.

      We thank the reviewer for this comment. To address this point, we performed an analysis of polyA sites on our RNA sequencing data using REPAC and did not find a change in TDP-43 poly adenylation after BORC KD (Figure S6C). Other transcripts do have altered polyA sites, which are summarized in Figure S6C. We also performed HCR FISH for TARDBP mRNA in TDP-43 and BORC KD neurons. While we did not see a difference in RNA localization (see A below, numbers on brackets indicate p-values), we also were not able to detect a significant difference in total TARDBP mRNA levels upon TDP-43 KD (see B below, numbers on brackets indicate p-values), suggesting that some of the signal detected is non-specific to TARDBP. Because of this, we cannot conclusively say that BORC KD does not alter TARDBP mRNA localization using the available tools.

      Author response image 1.

      Even assuming that the mRNA does not change, no explanation for the change in TDP-43 protein half life has been proposed by the authors. This will presumably be addressed in future studies: for example, are mutants that lack different domains of TDP-43 equally affected in their half-lives by BORC KD?. Alternatively, can a mass-spec be attempted to see whether TDP-43 PTMs change following BORCS6 KD?

      We agree with the reviewer that these are important experiments that could be done in the future to further examine the mechanism by which loss of BORC alters TDP-43 half-life. We examined our proteomics data for differential phosphorylation and ubiquitination in NT vs BORC KD (Figure S7G-H). We were unable to detect PTMs on TDP-43, so we cannot say if they contribute to the change in TDP-43 half-life we observed.

      Reviewer #1 (Recommendations for the authors):

      Recommendations are detailed in the public review.

      Reviewer #2 (Recommendations for the authors):

      Ryan et al, employ a CRISPRi FACS screen and uncover the lysosomal transport complex BORC as a regulator of TDP-43 protein levels in iNeurons. The authors provide strong evidence indicating that BORC subunit knockouts impair lysosomal function, leading to slower protein turnover and implicating lysosomal activity in the regulation of TDP-43 levels. The authors then provided additional evidence of TDP-43 perturbations under lysosome-inhibiting drug conditions, underscoring a mechanistic link between lysosome mislocalization and TDP-43 dysregulation. The study highlights the importance of localized lysosome activity in axons and suggests that lysosomal dysfunction could drive TDP-43 pathologies associated with neurodegenerative diseases like FTD/ALS. The work is exciting and could be highly informative for the field.

      Concerns: There are some disconnects between the figures and the main text that can benefit from refining of the figures to align better with the main text. This does not require additional experiments other than perhaps Figure 4B. The impact of the work could be further discussed - it is an interesting disconnect between the fact BORC KD causes decreased IF of the Halo-tagged TDP-43 and lysosomal transport, however this reduction does not impact cryptic exon expression and also increases TDP-43 half life (and of other proteins). It is a very interesting and potentially informative part of the manuscript.

      We thank the reviewer for their detailed reading of our manuscript. We have endeavored to better match the figures and the text and have added more discussion of the impact of the work.

      Minor:

      (1) Suggestion: relating to the statement "Gene editing was efficient, with almost all selected clones correctly edited." - please provide values or %.

      We updated the text to remove the statement about the editing efficiency, instead saying we identified a clone that was correct for both sequence and karyotype (lines 83-85).

      (2) Relating to Figure 1A: Please provide clarification regarding tagging strategy with the halotag - e.g. why in front of exon2.

      We updated the figure legend to reflect that the start codon for TDP-43 is in exon 2, hence why we placed the HaloTag there.

      (3) Relating to Figure S1: A and B seems to have been swapped.

      We thank the reviewer for catching this mistake and have fixed the figure/text.

      (4) Relating to Figure 1B: figure legend does not indicate grayscale coloring of TDP-43 signal.

      We have added text in the figure legend to indicate that the Halo signal is shown in grayscale in the left-handed panels.

      (5) Relating to Figure 1C: can the authors clarify abbreviation for 'NT' in text and legend.

      We thank the reviewer for catching this and have indicated in the text and figure legend that NT refers to the non-targeting sgRNA that was used as a control for comparison to the TDP-43 KD sgRNA.

      (6) Relating to figure 2B and S2A: main text mentioned "Non-targeting Guides" however the figure does not show non-targeting guides to confirm.

      We thank the reviewer for catching this oversight, we updated the figure legends for these figures to indicate that the non-targeting (NT) guides are shown in gray on the rank plot. They cluster towards the middle, more horizontal portion of the graphs, showing that the more vertical sections of the graph are hits.

      (7) Suggestion: To make it easier on the reader, please provide overlap numbers for the following statement ..."In comparing the top GO terms associated with genes that increase or decrease Halo-TDP-43 levels in iNeurons, we found that almost none altered Halo-TDP-43 levels in iPSCs...".

      We thank the reviewer for this comment and have updated the text to indicate that only a single term is shared between the iPSC and iNeuron screens (lines 113-117).

      (8) Relating to the statement "We cloned single sgRNA plasmids for 59 genes that either increased or decreased Halo-TDP-43 in iNeurons but not in iPSCs." Can the authors provide a list of the 59 genes.

      We have included a new column in the supplemental table S1 indicating the result of the Halo microscopy validation to hopefully clarify which genes lead to a validated phenotype and which did not.

      (9) Relating to the statement "To rule out the possibility of neighboring gene or off-target effects of CRISPRi, as has been reported previously15, we examined the impact of BORC knockout (KO) on TDP-43 levels. Using the pLentiCRISPR system, which expresses the sgRNA of interest on the same plasmid as an active Cas916 we found that KO of BORCS7 using two different sgRNAs decreased TDP-43 levels by immunofluorescence (Figure 5C-D)." Please provide clarification as to why BORCS7 was chosen out of all the BORCS? From the data presentation thus far (Figure 4B & 5A), the reader might have anticipated testing BORCS6 for panels 5C-D.

      We thank the reviewer for this comment. We tried a couple of BORCs with the pLentiCRISPR system, but BORCS7 was the only one we were convinced we got functional knockout for based on lysosome localization. We think that either the guides were not ideal for the other BORC components we tried, or we did not get efficient gene editing across the population of cells tested. Because we had previously been working with knock down and CRISPRi guides are not the same as CRISPR knock out guides, we couldn’t use the existing guide sequences we know work well for BORC. Since loss of one BORC gene causes functional loss of the complex and restricts lysosomes to the soma, we did not feel it necessary to assay all 8 genes.

      (10) Relating to the statement "We treated Halo-TDP-43 neurons with various drugs that disrupt distinct processes in the lysosome pathway and asked if Halo-TDP-43 levels changed. Chloroquine (decreases lysosomal acidity), CTSBI (inhibits cathepsin B protease), ammonium chloride (NH4Cl, inhibits lysosome-phagosome fusion), and GPN (ruptures lysosomal membranes) all consistently decreased Halo-TDP-43 levels (Figure 6A-B, S5A-C)" Please provide interpretations for Figures S5A and S5C in text.

      We thank the reviewer for catching this oversight and have updated the text accordingly (lines 183-191).

      (11) Relating to figure 6E: please provide in legend what the different colors used correlate with (i.e. green/brown for BORCS7 KD)?

      We thank the reviewer for pointing this out. These colors were mistakenly left in the figure from a version looking to see if the observed effects were driven by a single replicate rather than a consistent change (each replicate has a slightly different color). As the colors are intermingled and not separated, we concluded the effect was not driven by a single replicate. The colors have been removed from the updated figure for simplicity.

      (12) Relating to the statement "We observed a similar trend for many proteins in the proteome (Figure 8B)" This statement can benefit from stating which trend the authors are referring to, it is currently unclear from the volcano plot shown for Figure 8B.

      We thank the reviewer for catching this and have updated the text accordingly.

      (13) Relating to the statement "For almost every gene, we observed an increase or decrease in Halo-TDP-43 levels without a change in Halo-TDP-43 localization or compartment specific level changes (Figure 4B)." Please provide: (1) the number of genes examined, (2) additional clarification of "localization" and "compartment specific" level changes, (3) some quantification and or additional supporting data of the imaging results. Figures 5A-B presents with the same concern relating to the comment "To determine if results from Halo-TDP-43 expression assays also applied to endogenous, untagged TDP-43 levels, we selected 22 genes that passed Halo validation and performed immunofluorescence microscopy for endogenous (untagged) TDP-43 (Figure 4D-G,5A-B, S4E-F)." please clarify further.

      We thank the reviewer for requesting this clarification. This statement refers to all 59 genes tested by Halo imaging; only one (MFN2) showed any hints of aggregation or changes in localization, every other gene (58) showed what appeared to be global changes in Halo-TDP-43 levels. We were initially intrigued by the MFN2 phenotype; however, we were unable to replicate it on endogenous TDP-43 and thus concluded that this might be an effect specific to the tagged protein. The representative images shown in Figure 4B are representative of the changes we observed across all 59 genes tested (if changes were present). From the 59 genes that we observed a change in Halo-TDP-43 levels by microscopy, we selected a smaller number to move forward to immunofluorescence for TDP-43. We picked a subset of genes from each of the different categories we had identified (mitochondria, m6A, ubiquitination, and some miscellaneous) to validate by immunofluorescence, thinking that genes in the same pathway would act similarly. We have added a column to the supplemental table S1 indicating which genes were tested by immunofluorescence and what the result was. We have also attempted to clarify the results section to make the above clearer.

      (14) Relating to the statement "To determine if results from Halo-TDP-43 expression assays also applied to endogenous, untagged TDP-43 levels, we selected 22 genes that passed Halo validation and performed immunofluorescence microscopy for endogenous (untagged) TDP-43 (Figure 4D-G, 5A-B, S4E-F). Of these, 18 (82%) gene knockdowns showed changes in endogenous TDP-43 levels (Figure 4D-G, S4E-F)." It is difficult to identify the 18 or 22 genes in the figures as described in the main text.

      We added columns to the supplemental table S1 listing the genes and the result in each assay.

      (15) Relating to figures S7A and 8A and the first part of the section "TDP-43, like the proteome, shows longer turnover time in BORC KD neurons" Can the authors provide clarification why the SunTag assay was performed with BORCS6 KD (S7A) but the follow-up experiment (8A) was performed with BORCS7 KD. Does BORCS6 KD show similar results as BORCS7 with the SunTag assay, and does TDP-43 protein abundance with BORCS7 KD show similar results as BORCS6?

      Because loss of any of the 8 BORC genes causes functional loss of BORC and lysosomes to be restricted to the peri-nuclear space, we used BORC KDs interchangeably. Additionally, all BORC KDs had similar effects on Halo-TDP-43 levels.

      Reviewer #3 (Recommendations for the authors):

      Adding more control experiments that TDP-43 mRNA is really not affected following BORC KD

      We performed a FISH experiment to examine TARDBP mRNA localization upon BORC KD but were unable to conclusively say whether BORC KD changes TARDBP mRNA localization (see above). We also analyzed our RNA sequencing experiment for alternative polyadenylation sites upon BORC KD. Results are in Figure S6C.

      Although this could be part of a future study, the authors should try and determine what are the changes to TDP-43 that drive a change in the half-life.

      We agree with the reviewer that these are important experiments and hope to figure this out in the future.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer 1:

      I would suggest that the authors focus on what I think is the main goal of the work, namely, to consider the whole cell contour when characterizing cell shape instead of only some points on the contour. A reference to the connection with Minkowski tensors and the biologically relevant mathematical consequences of this connection would suffice; a detailed definition of the Minkowski tensors does not seem to be necessary. Especially because you do not really use them. You could use the analysis of the simulation data to explain what the γ<sub>p</sub> miss and for which statements they would be sufficient.

      We argue that the explanation of Minkowski tensors is helpful and should remain in the Methods and materials section. There are two reasons: First, our argumentation relays on the robustness and stability properties of Minkowski tensors. Introducing q<sub>p</sub> without the connection to Minkowski tensors would not allow us to make these statements. Second, Minkowski tensors seem not well known in the community, otherwise measures like γ<sub>p</sub> would not have been introduced. Furthermore, readers not interested in the technical details could skip this part of the manuscript and directly go to the Results section. Concerning the questions, what the γ<sub>p</sub> miss and for which statements they would be sufficient, the answer from a purly mathematical point of view is rather simple: As γ<sub>p</sub> does not share robustness and stability it should not be used in any case! The provided results on computational and experimental data demonstrate the consequences of using such measures. In case of the proposed nematic-hexatic transition in Armengol-Collade et al. (2023) the consequence is severe, as this transition is specific only to the used method but not to the underlying physics. A second aspect which we now further highlight is the influence of approximating a cell by a polygon. We demonstrate that this approximation is responsible for a strong hexatic order on the cellular scale in the considered MDCK data from Armengol-Collade et al. (2023).

      It is not clear to me what we should learn about the two tissue models by using q<sub>2</sub> and q<sub>6</sub> to quantify cell shape. Can you clearly formulate one or more conclusions?

      What we can learn from the research is a dependence of q<sub>p</sub> on model parameters in the two tissue models is

      increases with higher activity or deformability

      decreases with higher activity or deformability.

      Furthermore, q<sub>2</sub> and q<sub>6</sub> are independent and describe distinct properties. Using these models as a basis to coarse-grain and derive continuous models on the tissue scale, these results indicate that more general p-atic liquid crystal theories should be used and the simplest nematic liquid crystal theories might not be sufficient.

      The experimental data and their analysis does not seem to add anything to the work. Do you report only data from independent measurements, or did you consider all images of a monolayer?

      As we now also analyze experimental data from Armengol-Collado et al. (2023) which confirm our findings on independency of q<sub>2</sub> and q<sub>6</sub> and also confirm that the proposed nematic-hexatic transition is only specific to the use of γ<sub>p</sub> for characterizing the shape, additional experimental data are indeed no longer needed. We, therefore, skip the detailed analysis of this data and only keep the results in Fig 1 and Fig 2 and the corresponding figures in the appendix as illustrating examples.

      L13: ”P-atic liquid crystal theories offer new perspectives on how cells self-organize (...)” This is a difficult entry, because the average reader of eLife might not be familiar with p-atic liquid crystals.

      We agree that p-atic liquid crystals might not be familiar to the average reader. For this reason we introduce orientational order in the introduction with examples demonstrating that not only nematic, but also tetratic and hexatic order have been identified in tissue and introduce the different symmetries. Furthermore, we provide examples for p-atic liquid crystals from other fields and various references. In the conclusion, we also cite models for p-atic liquid crystal theories. Even if the average reader is not familiar with these theories, it should become evident that nematic order might not be sufficient to describe tissue as other symmetries are present as well.

      L32: ”nematic” needs to be introduced.

      Nematic order is already explained as rotational order with 180° degrees. The references cited discuss nematic liquid crystals in the context of morphological changes in tissue. We therefore only added a standard text book as reference for liquid crystal theories and refrain introducing it in more detail in the manuscript.

      Figure 1: Why do you show the data for q<sub>3</sub>, q<sub>4</sub>, and q<sub>5</sub>, which you do not really consider in this manuscript? Same for Figure 2. Why not combine the two figures? Furthermore, you show q<sub>p</sub> without having defined them yet.

      We consider all p \= 2,3,4,5,6, but focus on p = 2,6 in the main text and p = 3,4,5 in the appendix. Figures 1 and 2 essentially only introduce the subject and help to relate p-atic order to cell shapes and introduce the methodology to analyze the data. Our conclusion is that all p can be important and should be considered in continuous descriptions of tissue.

      Equation 1: The notation is confusing: the domain of integration (C or ∂C) also appears as the variable you integrate.

      The equation is correct. The variable of integration is 1 or H and the domain of integration is C (cell) or ∂C (cell contour).

      L68: ”a snapshot of the considered monolayer of wild-type MDCK cells”. Did you analyse only one monolayer? Please, provide information about the number of monolayers that were imaged and how many cell shapes were analyzed.

      We have analyzed one monolayer and have added the missing information.

      L86: ”field-specific prefactors” I do not understand what is meant by these.

      Different communities, e.g. physics, mathematics, cosmology, .... use different prefactors in the definition. We have removed this statement.

      L89: ”Hadwiger’s characterization theorem”. What is this?

      This mathematical result is important to claim robustness and stability, it can be found in the cited reference.

      L104: ”the essential property is the continuity”. Essential for what?

      Essential ”for our purpose” to characterize the shape of cells by a robust method.

      L120: ”the theory also guarantees robust description of p-atic orientation for p = 3,4,5,6,...” I do not understand what you mean.

      The previous examples only consider p \= 2. However, the cited theoretical results also hold for p = 3,4,5,6,..

      Equations (5) and (6): You define ψ<sub>p</sub>(C) twice. Are the definitions equivalent? Why do you need both?

      This is not a different definition, equation (6) is a reformulation which is more useful for our purpose. But we indeed define ϑ<sub>p</sub> twice. We now use a new symbol to distinguish ϑ<sub>p</sub> in Equation 7 and 9.

      Figure 4: ”The visualization uses rotationally-symmetric direction fields (known as p-RoSy fields in computer graphics (Vaxman et al., 2016)).” I guess that you have used these fields already in Figure 1, so why introduce them only now?

      We have moved this comment to Figure 1.

      Figure 6: Using a few discrete values cannot illustrate continuity. Also, the ”jump” in γ<sub>p</sub> results from deleting a vertex, so I doubt that this is a fair comparison. Still, I think that it is important to point out to the reader that the value γ<sub>p</sub> depends on the number of vertices (here, I allow that two edges connected by a vertex are aligned).

      We adjusted the caption to make our point more clear. The last image is a triangle and according to the definition of γ<sub>p</sub> is, therefore, described by only three vertices. So, it is indeed a fair comparison. The reviewer is right that the value of γ<sub>p</sub> has a strong dependency of the number of used vertices, this is exactly the point that we are trying to make with this figure. Also, adding vertices artificially to make γ<sub>p</sub> continuous leads to more problems, as the values for γ<sub>p</sub> change if we change the number of vertices. But an equilateral triangle should be recognized as an equilateral triangle, no matter if there is an artificial fourth vertex or not. The triangle in our picture and the triangle that the reviewer mentioned (so our triangle with an artificial fourth vertex) both have the shape of an equilateral triangle, yet for one it is |γ<sub>3</sub>| = 1.0 and for the other one it is |γ<sub>3</sub>| = 0.935.

      While we agree on the reviewers statement about continuity, we did not modify the sentence, as the meaning should be clear.

      L160: The definition of the center of mass is incorrect as it is not that of an extended object whose contour is defined by a polygon, but only of the set of vertices. In Figure 6 you write ”the choice of the center of mass highly influences the value of γ<sub>p</sub>” - is there really a choice of the center of mass? I thought that it was uniquely defined.

      We here only repeat the definition from Armengol-Collado et al. (2023) in order to be able to directly compare our analyses with the results presented therein. We adjusted the caption to be more clear.

      L166: What is the weighting you refer to in Equation 9?

      We apologize, the reference is to Equation 8. We have modified this.

      L312: ”Quantifying orientational order in biological tissues can be realized by Minkowsky tensors”. As mentioned above, you do not really use them, but use Equation (7), which can be defined without reference to Minkowski tensors.

      Eq. (7) is part of the irreducible representations of the Minkowsky tensor. Therefore the sentence is correct.

      L318: I do not quite understand the link between being able (or not) to compare q<sub>p</sub>’s for different values of p and the interpretability of q<sub>2</sub> and q<sub>6</sub>. Also, since you introduce q<sub>p</sub>, how can the question about their comparability be a recurrent challenge? Finally, would you agree that even though a comparison between the absolute values of q<sub>2</sub> and q<sub>6</sub> is inappropriate, one can still meaningfully compare relative changes as a parameter is changed or when comparing cells in different conditions?

      We have modified the sentence. Furthermore we agree that one can still meaningfully compare relative changes as a parameter is changed, as we do. However, our claim that q<sub>2</sub> and q<sub>6</sub> are independent, does not allow to conclude any kind of nematic-hexatic phase transition. We have now provided further evidence using the published data of Armengol-Collado et al. (2023), which unequivocally supports this statement. We would also like to remark that the detection of a phase-transition requires a single order parameter, which cannot exist as q<sub>2</sub> and q<sub>6</sub> are independent.

      We have further explained this in the main text.

      Figure 7: The axes are not labeled.

      We added the labels.

      L359: ”q<sub>2</sub> and q<sub>6</sub> values cluster tightly”, L362 ”q<sub>2</sub> and q<sub>6</sub> values become highly scattered” Please, quantify.

      We kept these formulations but have added statistical measures to these qualitative descriptions, see Supplementary Figures to Fig 7 for the distance correlation and the P-values of the distance correlation. These data support our claim of independence.

      L362: ”each q<sub>2</sub> value spans a broad range of q<sub>6</sub> values and vice versa, demonstrating their independence”. Please, use a quantitative test of statistical independence.

      We have added statistical information by using the distance correlation and statistical tests, see Supplementary Figures to Fig 7. Similar results are obtained for the Pearson correlation and corresponding tests. However, they are not included as the distance correlation is more general.

      L371: Please, define Q<sub>2</sub> and Q<sub>6</sub> in the main text.

      We have now added the definition to the Materials and methods section.

      L420: A reference seems to be missing.

      Thanks for pointing this out. This was a formatting error, we only wanted to cite Balasubramaniam et al. (2021).

      L425: ”strong dependence of cell shape on cell density”. But q<sub>6</sub> seems to be rather independent of density, see Figure 11. Also, what do you mean by ”strong”? Can you quantify?

      The dependency of the cell shape on the cell density is shown in detail in (Eckert et al., 2023). Furthermore, to describe the cell shape the values for all p are needed. So the change in q<sub>2</sub> already indicates a change in the overall cell shape even as q<sub>6</sub> is barely changing. As we excluded these experimental results now in favor of the experimental data also used in Armengol-Collado et al. (2023), we did not add further evaluations regarding cell density.

      L453 ”These divergences [nonmonotonic dependence of γ<sub>p</sub> on activity or deformability] highlight the limitations of γ<sub>p</sub> in capturing consistent patterns”. I am not sure to follow your argument here.

      Besides the quantitative differences seen in comparing Fig. 1 and Fig 2 with the corresponding figures in the appendix, these results show qualitative differences. Using a method which is not robust and not continuous leads to qualitative different results. The nonmonotonic dependence of γ<sub>p</sub> is specific to the method but not to the underlying physics.

      Appendix 3 - Figure 20: It is not clear how to compare this figure to Figure 3e of Armengol-Collado et al 2023. Please, provide more details.

      Appendix 3 - Figure 20 (Appendix 3 - Figure 25 in the revised version) and Figure 3e in Armengol-Collado et al. (2023) cannot be directly compared. Fig 3e shows results of experiments and multiphase field simulations for one parameter stetting and Fig 20 results of the active vertex model for various parameter settings. But both are considered using γ<sub>p</sub> and Γ<sub>p</sub>. We have added these computation, see Fig. 13, which indeed reproduces the results from Fig 3e. We refrain from considering corresponding plots to Fig 20 for the multiphase field model, as this first requires computing the vertices and no additional information can be expected.

      Reviewer 2:

      The manuscript lacks statistical information. The following should be addressed: How often have the experiments been performed? How many monolayers have been analyzed? How many time steps have been considered and in what duration? How many cells have been included in the analysis? What are the p-values to determine if q<sub>p</sub>’s (Figure 2, panel a) and γ<sub>p</sub>’s (Appendix 3-Figure 17, panel a) are significantly different? Same figures: How many cells and experiments have been considered here? Figure 11: What is the density of cells for each condition? Please provide the corresponding values. How significant are the differences? How many times has the experiment been repeated? Figure 12: Due to cell proliferation, the cell density changes over time. Does this need to be taken into account?

      We agree, our information have only been qualitative. We have added the missing information. Especially we added statistical information by using the distance correlation and statistical tests, see Supplementary Figures to Fig. 7. Similar results are obtained for the Pearson correlation and corresponding tests (not included). As we excluded the experimental results previously shown in Figure 11 and Figure 12, in the revised version in favor of the experimental data that is already published in Armengol-Collado et al. (2023), we did not add further statistics regarding this. We added the number of frames and cells in the text.

      The image analysis part of the Method section states that time-series were xy-drift corrected, and cells were tracked. However, the manuscript does not contain results of dynamical data, timedependent analyses, or discussions of how q<sub>p</sub> changes over time. The authors mention that the fluidity of the tissue was confirmed by the MSD, neighbor number variance, and the self-intermediate scattering function, but none of the results are shown in the manuscript. I would like to ask the authors to provide the results and related content in the Method section.

      We have modified the description and removed all parts related to dynamical data. Due to the heavy overload of images in the manuscript we refrain from providing all the results for the phase diagram to distinguish solid and fluid phase. These measures have been provided previously for the considered modeling approaches and provide here only a side remark. Our results do not depend on an exact localization of a solid-fluid phase boundary.

      Additional information is missing in the Image analysis part of the Method section. Could the authors provide the information on the image analysis steps between obtaining the segmented image and inputting the parameters for the Minkowski tensor? This should include how the normal vectors have been determined and whether this has been done for all pixels along the contour.

      We added further details in the section Extraction of the contour in Experimental setup in Methods and Materials and also provide the code to compute q<sub>p</sub> for segmented images.

      The authors have analyzed low-resolution phase contrast images acquired with a 10x objective to experimentally support their introduced Minkowski tensors. This may have decreased the resolution of the cell boundary detection and its curvature. I strongly suggest imaging the tissue with higher magnification (40x or 63x) and/or fluorescent markers to visualize the cell boundaries in high quality. This would allow the authors to distinguish between circles and circle-like shapes (lines 432-434) and to further investigate differences between MDCK wild-type and MDCK E-cad KO cells.

      We agree that higher resolution of the images would be beneficial. However, we are convinced that this will not influence our findings. Instead of performing the experiments with higher magnification or using fluorescent markers, we have considered the experimental data from Armengol-Collado et al. (2023) to support our results.

      The authors have coarse-grained the shape function, Γ<sub>p</sub>, and have chosen the active vertex model (Appendix 3-Figure 20) for comparison with the Minkowski tensors, Q<sub>p</sub> (Appendix 2 Figure 13). In both figures, the hexatic-nematic crossover does not occur. Armengol-Collado et al. have previously reported that the Voronoi model failed to achieve the hexatic-nematic crossover and argued that this is due to the artificial enhancement of the polygon’s hexagonality, leading to high hexatic order at the tissue scale. Since the authors have used the Voronoi-tailing method (line 196), I would like to ask the authors to compare the multiphase field models for Γ<sub>p</sub> andQ<sub>p</sub> instead.

      We would like to mention that we do not consider a Voronoi model but an active vertex model. A Voronoi model is only used for initialization. Both models are certainly related but not identical and claims for a Voronoi model do not need to hold for an active vertex model. The suggested comparison for the multi phasefield model is not an easy task as it requires to compute the vertices from the phase field variables. There are gaps between cells and a reliable algorithm to identify the vertices is a task on its own. We, therefore, refrain from doing these calculations. Instead, we have used the experimental data from Armengol-Collado et al. (2023) for which the polygonal information are provided, see Figure 11. Especially for p \= 6, strong differences can be seen by comparing the PDF obtained by the full shape and the polygonal shape. Indeed, the strong hexatic order at the cellular scale is only a consequence of the approximation by polygons. With this result analysing the multi phasefield data by γ<sub>p</sub> does not add any new information as this first requires an approximation by polygons.

      The authors show the q<sub>p</sub> distributions for the experimental systems (Figure 2, Figure 11). For completeness, I would like to ask the authors to also coarse-grain q<sub>p</sub> and γ<sub>p</sub> of the experimental data as shown for the computational models in Appendix 2 - Figure 13 and Appendix 2 - Figure 14. It would be interesting to see if the hexatic-nematic crossover appears. I would recommend that the authors avoid using the Voronoi tailing of the experimental system, as this may fail to obtain the crossover as explained in (5) above. Instead, I suggest using the real vertex positions for γ<sub>p</sub>, which can be obtained from the segmented images.

      It remains open what is meant by ”the real vertex positions for γ<sub>p</sub>, which can be obtained from the segmented images”. Segmenting the images leads to smooth contours, partly even with gaps between cells. As the magnitude of γ<sub>p</sub> depends on the number of points used in the calculation it is not meaningful to use all points of the contour for calculating γ<sub>p</sub>, as this would lead to artificially low values for |γ<sub>p</sub>|. Identifying the vertex positions for an approximating polygon is an issue of its own and the consequence of this approximation is already mentioned above. For a comparison we therefore added the experimental data from Armengol-Collado et al (2023) and used the provided vertex positions to compute q<sub>p</sub> and γ<sub>p</sub> as well as the raw data and performed the segmentation and used these data to compute q<sub>p</sub>. See Figure 11. These results confirm our findings and show that the proposed nematic-hexatic phase transition is specific to γ<sub>p</sub> to characterize shape.

      In order to show that shape descriptors like the shape function, γ<sub>p</sub>, introduced by Armengol-Collado et al., ’fail to capture the nuance of irregular shapes’ (line 445), the authors have compared γ<sub>p</sub> with the Minkowski tensors, q<sub>p</sub>, using the same dataset (Figure 1 with Appendix 3 - Figure 16, Figure 2 with Appendix 3 - Figure 17, and Figure 4 with Appendix 3 - Figure 15 Appendix 3). I agree that γ<sub>p</sub> and q<sub>p</sub> are different, not showing identical values. However, I see no evidence in these figures that q<sub>p</sub> describes the symmetry of a cell better than γ<sub>p</sub>, since the values are similar and vary quite similarly between different p-atic orders. What is the quantitative difference that shows the failure of the shape function to capture the nuance of irregular shapes?

      The statement already follows from the mathematical properties of robustness and stability, which is illustrated in Fig. 6. The mentioned comparisons for simulation and experimental data only demonstrate that the lack of robustness and stability of γ<sub>p</sub> also leads to different results if applied to averages of cell measures. The differences are twofold, first the approximation of cells by polygons leads to different results, and second even for polygons different results follow, as only one approach is continuous and the other not. This has strong consequences for the proposed nematic-hexatic phase transition if coarse-grained. Our added results for the experimental data from Armengo-Collado et al. (2023) show that this behavior is not a physical feature but only specific to the use of γ<sub>p</sub>.

      The authors claim that the Minkowski tensors provide a ’reliable framework’ and that this framework ’opens new pathways for understanding the role of orientational symmetries in tissue mechanics and development’ (line 78-79). However, the p-atic orders in the experimental systems peak at very low orders of q<sub>p</sub> < 0.3, which may not allow conclusions about (non-)dominant orientational symmetry(ies) of cells. Can this framework be applied to experimental systems? Since the Minkowski tensors display the independence of the hexatic and nematic symmetry, the variations of cell shapes in experimental systems are too strong to provide any additional results (line 437), as stated by the authors, and no crossover was found, while the crossover was reported by Armengol-Collado et al., what new pathways can be opened to study tissues?

      We have added a comparison with experimental data from Armengol-Collado et al. (2023) and demonstrate that the proposed nematic-hexatic transition is only specific to the use of γ<sub>p</sub> for characterizing the shape. So our results first of all essentially close the ”pathway for understanding the role of orientational symmetries in tissue mechanics and development”, which was proposed on this nematic-hexatic transition. On the other side, even if q<sub>p</sub> peaks at relatively low values, the results demonstrate independence of the measures for different p’s, for two different modeling approaches and two different sets of experimental data. This motivates to consider p-atic order for different p simultaneously. Such theories of ”multi”-p-atic liquid crystals, as proposed in the conclusions, are the mentioned new pathways.

      In principle, the introduced Minkowski tensors integrate the orientation of the normal vectors (Equation 6) and consider the perimeter of the contour (Equation 1). Do the tensors distinguish between convex and concave curvature since both are present in tissues? Does a square with 4 concave and a square with 4 convex edges (same curvature) have the same q<sub>p</sub> values?

      For the specific situation of a square with 4 concave or 4 convex edges even p would lead to the same orientation and the same value for q<sub>p</sub>, as even p have a 180 degree symmetry. Odd p would result in the same value for q<sub>p</sub> but in a different orientation ϑ<sub>p</sub>. In more general cases, e.g. shapes with concave and convex edges, no general statements can be made. In general the theoretical results on stability of q<sub>p</sub> only hold for convex shapes. However, as discussed in Methods and materials the known counterexamples for concave shapes are not relevant for cell shapes.

      In lines 169-172 and Figure 6, the authors report a jump in γ<sub>p</sub>. Why has the fourth vertex in the last image been removed? The vertices are essential for the calculation of γ<sub>p</sub>. If the fourth vertex is not removed, the following values result: γ<sub>3</sub> = 0.935 and γ<sub>4</sub> = 0.474, which leads to changes of the same order of magnitude as those of q<sub>p</sub>. I think it is therefore not the choice of the center of mass that ’heavily influences the value of γ<sub>p</sub>’, but the removal of the fourth vertex.

      We adjusted the caption to make our point more clear. The last image is a triangle and according to the definition of γ<sub>p</sub> is therefore described by only three vertices. The reviewer is right that the value of γ<sub>p</sub> has a strong dependency of the number of used vertices, this is exactly the point that we are trying to make with this figure. An equilateral triangle should be recognized as an equilateral triangle, no matter if there is an artificial fourth vertex or not. The triangle in our picture and the triangle that the reviewer described (so our triangle with an artificial fourth vertex) both have the shape of an equilateral triangle, yet for one |γ<sub>3</sub>| = 1.0 and for the other one it is |γ<sub>3</sub>| = 0.935. This can be seen even more clearly if even more artificial vertices on the outline of the equilateral triangle are added, which will decrease |γ<sub>3</sub>| even more. Furthermore, we think there was a misunderstanding regarding our statement about the center of mass. The general problem of γ<sub>p</sub> - so the dependence of the values on the number of vertices - is independent of the calculation of the center of mass. The exact values of γ<sub>p</sub> on the other hand depend on the choice of this. We follow Armengol-Collado et al. (2023) and use the mean of all vertex coordinates as center of mass. If the reviewer would use the center of mass of the equilateral triangle and do the same calculations the resulting values for γ<sub>p</sub> would be different. This is what we meant with ’heavily influences the value of γ<sub>p</sub>’.

      In Appendix 3 - Figure 18, the authors show that the shape function, γ<sub>6</sub>, exhibits a non-monotonic trend as a function of activity and deformability. I have no objection to this statement. However, I would like to ask the authors to check the values for γ<sub>6</sub>. In the bottom-left corner, for example, γ<sub>6</sub> = 0.55. This value seems very low to me. In Appendix 3-Figure 20, |Q<sub>6</sub>| for R/Rcell = 2 is already in this range, while |Q<sub>6</sub>| for R/Rcell = 1 (not shown), corresponding to γ<sub>6</sub>, must be even higher. Also, the parameters p<sub>6</sub> = 3.5 and v<sub>0</sub> = 0.1 should result in a nearly hexagonal lattice, which should be captured with high γ<sub>6</sub> values. I would expect γ<sub>6</sub> to be in the same range as q<sub>6</sub>.

      Many thanks for pointing this out. There are two different points addressed in this question: The first is if |Γ<sub>p</sub>| is too high. We checked the values, |Γ<sub>p</sub>| = 0.5075 for R/R<sub>cell</sub> = 2, so it is lower than = 0.58. The second question is why γ<sub>p</sub> and q<sub>p</sub> are not in the same value range. You are right that for a perfectly hexagonal lattice both should give the same value, namely = = 1.0. However, even at p<sub>6</sub> = 3.5 and v<sub>0</sub> = 0.1 this is not a perfectly hexagonal lattice anymore and how fast the values of q<sub>6</sub> and |γ<sub>6</sub>| drop if we move away from a perfect hexagon scales differently. As q<sub>p</sub> is stable and only changes slightly for slight changes in the shape it makes sense, that q<sub>p</sub> is still close to 1.0 . We included an image, see below, of one time step in said parameter to showcase that cells do not form a perfect hexagonal lattice anymore.

      Reviewer 3:

      Could the authors show why and how this method could bring new information which were missing so far in the understanding of morphogenesis in vitro and in vivo with the current quantification?

      The introduction provides examples of how orientational order and its topological defects can be linked to morphological changes in tissues. The orientational order emerges from the shape of the cells. Most commonly nematic order has been considered, but more recently also hexatic order and even a nematic-hexactic crossover on larger scales. This suggests a mechanical mechanism for morphogenesis, like a phase transition from hexatic to nematic, which would have consequences on the evolution of shape. We demonstrate that the measures q<sub>2</sub> and q<sub>6</sub> are independent. Furthermore the proposed nematic-hexatic transition is only specific to the use of γ<sub>p</sub> for characterizing the shape and coarse-graining of the associated order. These measures are not robust and therefore should not be used. Results for the robust measures q<sub>p</sub> suggest to consider all p for a coarse-grained theory to model morphological changes in tissues.

      Could authors show quantitative comparisons between available methods with the same sets of data and highlight pros and cons?

      Author response image 1.

      Screenshot from p<sub>6</sub> = 3.5 and v<sub>0</sub> = 0.1

      In addition to what was already done for the simulation data we have added data from Armengol-Collado et al. (2023) and compared the results for q<sub>p</sub> and Q<sub>p<sub> and γ<sub>p</sub> and Γ<sub>p</sub>. The theoretical results and the illustrating example in Fig. 6 already show that there are no pros for γ<sub>p</sub>. Other methods belong to the class of bond-order methods and measure neighbor relations instead of shape. We already comment that these methods are inappropriate to classify shape, see Methods and materials, last sentence and Mickel et al. (2013) for a detailed discussion why these methods are not robust.

      Instead of using phase contrast images, which exhibit curved cell-cell contours, could authors use data with E-cadherin staining instead - as used in many epithelial studies in vitro and in vivo? Could they show both images for wild type and for the E-cadherin KO cell lines with fluorescent readout?

      We are convinced that our results do not depend on the way to visualize the cell contours. Furthermore the images do not provide additional information. To further strengthen the experimental part of the manuscript, we instead analyzed data from Armengol-Collado et al. (2023).

      They confirm our findings.

      The authors acknowledge differences in density between cell lines p. 13 so this calls for new experiments with solid readouts and analysis using comparable experimental conditions.

      Additionally, we analyzed data from Armengol-Collado et al. (2023) which confirm our findings. Our results are now supported by two different modeling approaches and two different experimental settings. Because of redundancy we removed the original experimental data from the revised manuscript.

    1. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #1

      Evidence, reproducibility and clarity

      Summary:

      This is a very insightful work showing how to disentangle one of the most complex transcriptional networks in yeast (S. cerevisiae) by combining single-cell dynamics, dynamical-systems modeling, Bayesian-style inference, and genetic perturbations. The authors tackle a problem that has eluded quantitative resolution for over two decades-how yeast regulates its seven primary glucose importer genes (HXT1-HXT7) in response to both steady and temporally changing extracellular [glucose]. Their integrated experimental-theoretical approach delivers the most satisfying mechanistic and quantitative explanation to date, and I enthusiastically recommend this manuscript for publication.

      Yeast relies on seven passive hexose transporters (Hxt1-Hxt7) to import glucose, its preferred sugar; deleting all seven abolishes growth on glucose. The underlying regulatory network is exceptionally intricate, reflecting yeast's evolutionary priority for glucose. Two membrane sensors-Snf3 (high affinity) and Rgt2 (low affinity)-detect extracellular glucose and thereby inactivate two co-repressors, Mth1 and Std1, which modulate the DNA-binding factor Rgt1. Concurrently, intracellular glucose activates the SNF1 kinase, phosphorylating and exporting the repressor Mig1, while Mth1/Std1 also govern the transcription and stability of Mig2, another DNA-binding repressor. Together, Rgt1, Mig1, and Mig2 integrate these inputs to control HXT promoter activity (Fig. 2A). Importantly, Mth1 and Std1 do not directly bind to DNA and this complication - the protein-protein interaction that one cannot get from DNA sequence - is just one source of difficulty that the authors overcame.

      To map the network's behavior, the authors used microfluidic "cages" housing single cells expressing GFP-tagged HXTs, monitoring fluorescence under three constant glucose levels-low (0.01%), medium (0.1%), and high (1%) (Fig. 1B-C). The authors confirm that steady-state Hxt abundances rank by transporter affinity. But the more important and surprising discovery is that when the cells were subjected to gradual glucose up-shifts and down-shifts, they discovered that some transporters transiently spike only when [glucose] rises and others only when [glucose] falls (Fig. 1C and Fig. S1F). This discovery establishes that the HXT network not only "senses" the absolute external [glucose] concentration but also the direction of the temporal change in external [glucose].

      To understand how the regulatory network yields such intricate temporal changes in HXT expression, the authors first focused on the medium-affinity transporter, Hxt4. Targeted knockouts of Mig1/Mig2 versus Mth1/Std1 confirmed that Hxt4 dynamics arise from differential repressor kinetics. To formalize these findings, the authors built an ODE model grounded in literature-based constraints (pg. 13 of the Supplement) with explicit separation of repressor timescales. They rigorously fit the model to wild-type and knockout time series-exploring parameter sensitivity in depth (Fig. S5).

      The authors discovered that their model and experiments converged on a push-pull mechanism: fast-acting Mig1/Mig2 dominate during glucose up-shifts, while slower Mth1/Std1 govern down-shifts, determining whether each HXT gene is repressed or de-repressed (i.e., "who gets there first"). Extending this analysis across all seven HXTs via approximate Bayesian computation revealed the most likely repressor-promoter interactions for each transporter, reducing a vast parameter space to unique or small sets of plausible regulatory schemes. The authors thus revealed what could be happening and which regulations are improbable - a more nuanced and comprehensive view than giving just one outcome for each HXT.

      Overall, this work represents a role model - textbook-worthy - for quantitative systems biology. Beyond the rigor and novelty of its findings, the authors explain complex mathematical concepts with clarity, and the narrative flows logically from experiment to model to inference. This study provides a definitive mechanistic resolution of the HXT network and establishes a broadly applicable framework for dissecting dynamic and complex gene circuits.

      Major points:

      I don't recommend any new experiments or modeling; the major claims are already well supported by the data and models. Below are comments and questions intended to improve clarity and facilitate the reader's understanding. Please feel free to disregard any that you find not sensible or beyond the scope of the current work.

      1. Preconditioning (Fig. 1B-C): What medium were cells in immediately before t = 0? Were they in log-phase or stationary-phase growth just prior to the glucose addition?
      2. Transporter ranking in medium glucose: In the medium [glucose] regime, why is a low-affinity Hxt the second-most highly expressed, rather than the next-highest-affinity transporter? Could co-expression of multiple affinities (e.g., as a bet-hedging strategy) be advantageous? The Discussion section already mentions bet-hedging but I think you could further discuss ideas such as evolutionarily trained "Pavlovian" response or what the 2nd-ranking says about what the yeast anticipates as an upcoming change in the environment.
      3. Defining low/medium/high regimes: Low = 0.01%, Medium = 0.1%, and High = 1%. This is indeed in line with the standard classification of [glucose] in the literature regarding HXTs. But how might your results change at intermediate concentrations - those between these three levels. Using the model, could you comment on whether HXT expression dynamics "sharply" change as a function of either the [glucose]/time or the final concentration of [glucose] after the ramping-up phase?
      4. Rate-affinity trade-off (Lines 18-20): Give a brief explanation of the rate-affinity trade-off. Why does higher affinity necessarily entail a lower maximal transport rate (Vₘₐₓ) for passive transporters? Perhaps you can give an intuitive explanation backed by mass-action kinetics (e.g., to attain a higher affinity, the glucose-binding pocket on Hxt cannot be flipping rapidly back-and-forth between facing cytoplasm and extracellular space -- the binding pocket must allow sufficient time for molecule to find and bind it).
      5. Single-transporter expression (Lines 39-40): It's unclear to me why cells would express only the "optimal" Hxt and suppress all others. For instance, a bet-hedging strategy might favor simultaneous expression of multiple affinities. Consider revising these lines or adding a brief explanation. Related to above is a subtle point I think that was glossed over: there must be a fitness cost associated with making too many copies of Hxtn. After all, why not make as many transporters as possible? Is the cell operating near the upper limit of Hxt abundance, beyond which there's a fitness cost? Is there a pareto-optimal-type front in the space of expression level and another axis? I think this could go into the Discussion section.
      6. Hxt5 exception (Fig. 1B): Although Hxt5 follows a distinct regulatory scheme, it is most highly expressed at medium [glucose] (0.1%), consistent with its affinity like the other Hxts. I think you could mention this in lines 51-58.
      7. Glucose-ramp details (Fig. 1C; Lines 66-67): You state that [glucose] rises from 0 to 1 % over 15 min and reaches 1 % at t = 3 h. However, the actual ramp slope ([glucose]/time) and when the [glucose] starts to increase from zero aren't specified. The Hxt5-GFP behavior and differing Hxt6/7 levels at t = 0 vs. t = 20 h suggest the ramp may begin later than t = 0. Please clarify these details in the caption and main text, and consider adding a [glucose] vs. time schematic above the panel in Fig. 1C (like in Fig. 1B).
      8. Pre-t < 0 incubation (Fig. 1C): Related to point 1, how long were the cells incubated in pyruvate (or other medium) before t < 0? The Hxt6-GFP level at t = 20 h does not match that at t = 0; what is the timescale for Hxt6-GFP and Hxt7-GFP decay to steady state after glucose removal?
      9. Hxt-GFP localization: Does the reported Hxt#-GFP level include fluorescence from both the plasma membrane and internal compartments (e.g., vacuole)? Clarifying which pools of fluorescence are quantified would help interpretation, even if they don't change the main conclusions are unchanged.
      10. Predominantly transcriptional" wording (Lines 90-92): The phrase "...the regulation is predominantly transcriptional" should specify that it refers to the induction of HXT4 transcription during glucose down-ramping, rather than the subsequent decrease in Hxt4-GFP. The experiments do not rule out post-translational regulation (e.g., endocytosis) once glucose levels fall below a threshold.
      11. Glucose "protection" of Hxt4 (Lines 121-122): The statement "we allowed glucose to protect Hxt4 from degradation" is unclear. First, Hxt4-GFP likely degrades at a different rate than free GFP-you could estimate its half-life from Fig. S3. Second, please explain precisely what "protection" means in the model or experiment.
      12. Quantifying repressor kinetics (Lines 158-162): The push-pull mechanism is compelling, but it would be helpful to report the quantitative separation of timescales-e.g., how much faster do Mig1/Mig2 respond compared to Mth1/Std1? Including fold-difference would strengthen this explanation.
      13. Mechanism of repressor regulation (Lines 197-213): Be clearer about whether and how changes in extracellular glucose alter the expression levels of Mth1, Std1, Mig1, and Mig2, as opposed to modulating say, how Mth1 and Std1 bind to Rgt2 protein. I think you could be clearer here about which regulatory steps (transcriptional, post-translational, or binding-affinity changes) are assumed in the model and supported by the data.

      Minor points:

      1. Abstract: Original: "...how an HXT for a medium-affinity transporter can be made to respond like the HXTs for the other transporters." Suggestion: "...how the gene-expression regulation of a medium-affinity HXT can be rewired to respond like that of any other HXT." (You might also generalize beyond "medium-affinity" if the converse holds.)
      2. Lines 64-66: Please emphasize that the "synthetic complete medium" used for pre-conditioning contains no glucose.
      3. Line 143: The phrase "low expression of the std1\Delta strain in glucose" is ambiguous-low expression of which gene or reporter? Please specify.
      4. Line 240: Change "should weakened" to "should weaken."
      5. Fig. S9 caption (typo) Change "Rtg1 sites are..." to "Rgt1 sites are...."

      Hyun Youk.

      Referee cross-commenting

      I agree with the other reviewers' comments. The other reviewers noticed important points I have missed. But like them, I'm still supportive of the work being published with < 1 month spent on revision. I still don't recommend any further experiments or modeling.

      Significance

      This is a very insightful work showing how to disentangle one of the most complex transcriptional networks in yeast (S. cerevisiae) by combining single-cell dynamics, dynamical-systems modeling, Bayesian-style inference, and genetic perturbations. The authors tackle a problem that has eluded quantitative resolution for over two decades-how yeast regulates its seven primary glucose importer genes (HXT1-HXT7) in response to both steady and temporally changing extracellular [glucose]. Their integrated experimental-theoretical approach delivers the most satisfying mechanistic and quantitative explanation to date, and I enthusiastically recommend this manuscript for publication via Review Commons.

    1. Overall thoughts: This is an interesting history piece regarding peer review and the development of review over time. Given the author’s conflict of interest and association with the Centre developing MetaROR, I think that this paper might be a better fit for an information page or introduction to the journal and rationale for the creation of MetaROR, rather than being billed as an independent article. Alternatively, more thorough information about advantages to pre-publication review or more downsides/challenges to post-publication review might make the article seem less affiliated. I appreciate seeing the history and current efforts to change peer review, though I am not comfortable broadly encouraging use of these new approaches based on this article alone.

      Page 3: It’s hard to get a feel for the timeline given the dates that are described. We have peer review becoming standard after WWII (after 1945), definitively established by the second half of the century, an example of obligatory peer review starting in 1976, and in crisis by the end of the 20th century. I would consider adding examples that better support this timeline – did it become more common in specific journals before 1976? Was the crisis by the end of the 20th century something that happened over time or something that was already intrinsic to the institution? It doesn’t seem like enough time to get established and then enter crisis, but more details/examples could help make the timeline clear. 

      Consider discussing the benefits of the traditional model of peer review.

      Table 1 – Most of these are self-explanatory to me as a reader, but not all. I don’t know what a registered report refers to, and it stands to reason that not all of these innovations are familiar to all readers. You do go through each of these sections, but that’s not clear when I initially look at the table. Consider having a more informative caption. Additionally, the left column is “Course of changes” here but “Directions” in text. I’d pick one and go with it for consistency.

      3.2: Considering mentioning your conflict of interest here where MetaROR is mentioned.

      With some of these methods, there’s the ability to also submit to a regular journal. Going to a regular journal presumably would instigate a whole new round of review, which may or may not contradict the previous round of post-publication review and would increase the length of time to publication by going through both types. If someone has a goal to publish in a journal, what benefit would they get by going through the post-publication review first, given this extra time?

      There’s a section talking about institutional change (page 14). It mentions that openness requires three conditions – people taking responsibility for scientific communication, authors and reviewers, and infrastructure. I would consider adding some discussion of readers and evaluators. Readers have to be willing to accept these papers as reliable, trustworthy, and respectable to read and use the information in them. Evaluators such as tenure committees and potential employers would need to consider papers submitted through these approaches as evidence of scientific scholarship for the effort to be worthwhile for scientists.

      Based on this overview, which seems somewhat skewed towards the merits of these methods (conflict of interest, limited perspective on downsides to new methods/upsides to old methods), I am not quite ready to accept this effort as equivalent of a regular journal and pre-publication peer review process. I look forward to learning more about the approach and seeing this review method in action and as it develops.

    2. Response to the Editors and the Reviewers

      I am sincerely grateful to the editors and peer reviewers at MetaROR for their detailed feedback and valuable comments and suggestions. I have addressed each point below.

      Handling editor

      1. “However, the article’s progression and arguments, along with what it seeks to contribute to the literature need refinement and clarification. The argument for PRC is under-developed due to a lack of clarity about what the article means by scientific communication. Clarity here might make the endorsement of PRC seem like less of a foregone conclusion.”

      The structure of the paper (and discussion) has changed significantly to address the feedback.

      2. “I strongly endorse the main theme of most of the reviews, which is that the progression and underlying justifications for this article’s arguments needs a great deal of work. In my view, this article’s main contribution seems to be the evaluation of the three peer review models against the functions of scientific communication. I say ‘seems to be’ because the article is not very clear on that and I hope you will consider clarifying what your manuscript seeks to add to the existing work in this field. In any case, if that assessment of the three models is your main contribution, that part is somewhat underdeveloped. Moreover, I never got the sense that there is clear agreement in the literature about what the tenets of scientific communication are. Note that scientific communication is a field in its own right.”

      I have implemented a more rigorous approach to argumentation in response. “Scientific communication” was replaced by “scholarly communication.”

      3. “I also agree that paper is too strongly worded at times, with limitations and assumptions in the analysis minimised or not stated. For example, all of the typologies and categories drawn could easily be reorganised and there is a high degree of subjectivity in this entire exercise. Subjective choices should be highlighted and made salient for the reader. Note that greater clarity, rigour, and humility may also help with any alleged or actual bias.”

      I have incorporated the conceptual framework and description of the research methodology. However, the Discussion section reflects my personal perspective in some points, which I have explicitly highlighted to ensure clarity.

      4. “I agree with Reviewer 3 that the ‘we’ perspective is distracting.”

      This has been fixed.

      5. “The paragraph starting with ‘Nevertheless’ on page 2 is very long.”

      The text was restructured.

      6. “There are many points where language could be shortened for readability, for example:

      Page 3: ‘decision on publication’ could be ‘publication decision’.

      Page 5: ‘efficiency of its utilization’ could be ‘its efficiency’.

      Page 7: ‘It should be noted…’ could be ‘Note that…’.”

      I have proofread the text.

      7. “Page 7: ‘It should be noted that..’ – this needs a reference.”

      This statement has been moved to the Discussion section, paraphrased, and reference added

      “It should be also noted that peer review innovations pull in opposing directions, with some aiming to increase efficiency and reduce costs, while others aim to promote rigor and increase costs (Kaltenbrunner et al., 2022).”

      8. “I’m not sure that registered reports reflect a hypothetico-deductive approach (page 6). For instance, systematic reviews (even non-quantitative ones) are often published as registered reports and Cochrane has required this even before the move towards registered reports in quantitative psychology.”

      I have added this clarification.

      9. “I agree that modular publishing sits uneasily as its own chapter.”

      Modular publishing has been combined with registered reports into the deconstructed publication group of models, now Section 5.1.

      10. “Page 14: ‘The "Publish-Review-Curate" model is universal that we expect to be the future of scientific publishing. The transition will not happen today or tomorrow, but in the next 5-10 years, the number of projects such as eLife, F1000Research, Peer Community in, or MetaROR will rapidly increase’. This seems overly strong (an example of my larger critique and that of the reviewers).”

      This part of the text has been rewritten.

      Reviewer 1

      11. “For example, although Model 3 is less chance to insert bias to the readers, it also weakens the filtering function of the review system. Let’s just think about the dangers of machine-generated articles, paper-mills, p-hacked research reports and so on. Although the editors do some pre-screening for the submissions, in a world with only Model 3 peer review the literature could easily get loaded with even more ‘garbage’ than in a model where additional peers help the screening.”

      I think that generated text is better detected by software tools. At the same time, I tried and described the pros and cons of different models in a more balanced way in the concluding section.

      12. “Compared to registered reports other aspects can come to focus that Model 3 cannot cover. It’s the efficiency of researchers’ work. In the care of registered reports, Stage 1 review can still help researchers to modify or improve their research design or data collection method. Empirical work can be costly and time-consuming and post-publication review can only say that ‘you should have done it differently then it would make sense’.”

      Thank you very much for this valuable contribution, I have added this statement at P. 11.

      13. “Finally, the author puts openness as a strength of Model 3. In my eyes, openness is a separate question. All models can work very openly and transparently in the right circumstances. This dimension is not an inherent part of the models.”

      I think that the model, providing peer reviews to all the submissions, ensures maximum transparency. However, I have made effort to make the wording more balanced and distinguish my personal perspective from the literature.

      14. “In conclusion, I would not make verdict over the models, instead emphasize the different functions they can play in scientific communication.”

      This idea has been reflected now in the concluding section.

      15. “A minor comment: I found that a number of statements lack references in the Introduction. I would have found them useful for statements such as ‘There is a point of view that peer review is included in the implicit contract of the researcher.’”

      Thank you for your feedback. I have implemented a more rigorous approach to argumentation in response.

      Reviewer 2

      16. “The primary weakness of this article is that it presents itself as an 'analysis' from which they 'conclude' certain results such as their typology, when this appears clearly to be an opinion piece. In my view, this results in a false claim of objectivity which detracts from what would

      otherwise be an interesting and informative, albeit subjective, discussion, and thus fails to discuss the limitations of this approach.”

      I have incorporated the conceptual framework and description of the research methodology. However, the Discussion section reflects my personal perspective in some points, which I have explicitly highlighted to ensure clarity.

      17. “A secondary weakness is that the discussion is not well structured and there are some imprecisions of expression that have the potential to confuse, at least at first.”

      The structure of the paper (and discussion) has changed significantly.

      18. “The evidence and reasoning for claims made is patchy or absent. One instance of the former is the discussion of bias in peer review. There are a multitude of studies of such bias and indeed quite a few meta-analyses of these studies. A systematic search could have been done here but there is no attempt to discuss the totality of this literature. Instead, only a few specific studies are cited. Why are these ones chosen? We have no idea. To this extent I am not convinced that the references used here are the most appropriate.”

      I have reviewed the existing references and incorporated additional sources. However, the study does not claim to conduct a systematic literature review; rather, it adopts an interpretative approach to literature analysis.

      19. “Instances of the latter are the claim that ‘The most well-known initiatives at the moment are ResearchEquals and Octopus’ for which no evidence is provided, the claim that ‘we believe that journal-independent peer review is a special case of Model 3’ for which no further argument is provided, and the claim that ‘the function of being the "supreme judge" in deciding what is "good" and "bad" science is taken on by peer review’ for which neither is provided.

      Thank you for your feedback. I have implemented a more rigorous approach to argumentation in response.

      20. “A particular example of this weakness, which is perhaps of marginal importance to the overall paper but of strong interest to this reviewer is the rather odd engagement with history within the paper. It is titled "Evolution of Peer Review" but is really focussed on the contemporary state-of-play. Section 2 starts with a short history of peer review in scientific publishing, but that seems intended only to establish what is described as the 'traditional' model of peer review. Given that that short history had just shown how peer review had been continually changing in character over centuries - and indeed Kochetkov goes on to describe further changes - it is a little difficult to work out what 'traditional' might mean here; what was 'traditional' in 2010 was not the same as what was 'traditional' in 1970. It is not clear how seriously this history is being taken. Kochetkov has earlier written that "as early as the beginning of the 21st century, it was argued that the system of peer review is 'broken'" but of course criticisms - including fundamental criticisms - of peer review are much older than this. Overall, this use of history seems designed to privilege the experience of a particular moment in time, that coincides with the start of the metascience reform movement.”

      While the paper addresses some aspects of peer review history, it does not provide a comprehensive examination of this topic. A clarifying statement to this effect has been included in the methodology section.

      “… this section incorporates elements of historical analysis, it does not fully qualify as such because primary sources were not directly utilized. Instead, it functions as an interpretative literature review, and one that is intentionally concise, as a comprehensive history of peer review falls outside the scope of this research”.

      21. “Section 2 also demonstrates some of the second weakness described, a rather loose structure. Having moved from a discussion of the history of peer review to detail the first model, 'traditional' peer review, it then also goes on to describe the problems of this model. This part of the paper is one of the best - and best - evidenced. Given the importance of it to the main thrust of the discussion it should probably have been given more space as a Section all on its own.”

      This section (now Section 4) has been extended, see also previous comment.

      22. “Another example is Section 4 on Modular Publishing, in which Kochetkov notes "Strictly speaking, modular publishing is primarily an innovative approach for the publishing workflow in general rather than specifically for peer review." Kochetkov says "This is why we have placed this innovation in a separate category" but if it is not an innovation in peer review, the bigger question is 'Why was it included in this article at all?'.”

      Modular publishing has been combined with registered reports into the deconstructed publication group of models, now Section 5.1.

      23. “One example of the imprecisions of language is as follows. The author also shifts between the terms 'scientific communication' and 'science communication' but, at least in many contexts familiar to this reviewer, these are not the same things, the former denoting science-internal dissemination of results through publication (which the author considers), conferences and the like (which the author specifically excludes) while the latter denotes the science-external public dissemination of scientific findings to non-technical audiences, which is entirely out of scope for this article.”

      Thank you for your remark. As a non- native speaker, I initially did not grasp the distinction between the terms. However, I believe the phrase ‘scholarly communication’ is the most universally applicable term. This adjustment has now been incorporated into the text.

      24. “A final note is that Section 3, while an interesting discussion, seems largely derivative from a typology of Waltman, with the addition of a consideration of whether a reform is 'radical' or 'incremental', based on how 'disruptive' the reform is. Given that this is inherently a subjective decision, I wonder if it might not have been more informative to consider 'disruptiveness' on a scale and plot it accordingly. This would allow for some range to be imagined for each reform as well; surely reforms might be more or less disruptive depending on how they are implemented. Given that each reform is considered against each model, it is somewhat surprising that this is not presented in a tabular or graphical form.”

      Ultimately, I excluded this metric due to its current reliance on purely subjective judgment. Measuring 'disruptiveness', e.g., through surveys or interviews remains a task for future research.

      25. “Reconceptualize this as an opinion piece. Where systematic evidence can be drawn upon to make points, use that, but don't be afraid to just present a discussion from what is clearly a well-informed author.”

      I cannot definitively classify this work as an opinion piece. In fact, this manuscript synthesizes elements of a literature review, research article, and opinion essay. My idea was to integrate the strengths of all three genres.

      26. “Reconsider the focus on history and 'evolution' if the point is about the current state of play and evaluation of reforms (much as I would always want to see more studies on the history and evolution of peer review).”

      I have revised the title to better reflect the study’s scope and explicitly emphasize its focus on contemporary developments in the field.

      “Peer Review at the Crossroads”

      27. “Consider ways in which the typology might be expanded, even if at subordinate level.”

      I have updated the typology and introduced the third tier, where it is applicable (see Fig.2).

      Reviewer 3

      28. “In my view, the biggest issue with the current peer review system is the low quality of reviews, but the manuscript only mentions this fleetingly. The current system facilitates publication bias, confirmation bias, and is generally very inconsistent. I think this is partly due to reviewers’ lack of accountability in such a closed peer review system, but I would be curious to hear the author’s ideas about this, more elaborately than they provide them as part of issue 2.

      I have elaborated on this issue in the footnote.

      29. “I’m missing a section in the introduction on what the goals of peer review are or should be. You mention issues with peer review, and these are mostly fair, but their importance is only made salient if you link them to the goals of peer review. The author does mention some functions of peer review later in the paper, but I think it would be good to expand that discussion and move it to a place earlier in the manuscript.”

      The functions of peer review are summarized in the first paragraph of Introduction.

      30. “Table 1 is intuitive but some background on how the author arrived at these categorizations would be welcome. When is something incremental and when is something radical? Why are some innovations included but not others (e.g., collaborative peer review, see https://content.prereview.org/how-collaborative-peer-review-can-transform-scientific-research/)?”

      Collaborative peer review, namely, Prereview was mentioned in the context of Model 3 (Publish-Review-Curate). However, I have extended this part of the paper.

      31“‘Training of reviewers through seminars and online courses is part of the strategies of many publishers. At the same time, we have not been able to find statistical data or research to assess the effectiveness of such training.’ (p. 5)  There is some literature on this, although not recent. See work by Sara Schroter for example, Schroter et al., 2004; Schroter et al., 2008)”

      Thank you very much, I have added these studies and a few more recent ones.

      32. “‘It should be noted that most initiatives aimed at improving the quality of peer review simultaneously increase the costs.’ (p. 7) This claim needs some support. Please explicate why this typically is the case and how it should impact our evaluations of these initiatives.”

      I have moved this part to the Discussion section.

      33. “I would rephrase “Idea of the study” in Figure 2 since the other models start with a tangible output (the manuscript). This is the same for registered reports where they submit a tangible report including hypotheses, study design, and analysis plan. In the same vein, I think study design in the rest of the figure might also not be the best phrasing. Maybe the author could use the terminology used by COS (Stage 1 manuscript, and Stage 2 manuscript, see Details & Workflow tab of https://www.cos.io/initiatives/registered-reports). Relatedly, “Author submits the first version of the manuscript” in the first box after the ‘Manuscript (report)’ node maybe a confusing phrase because I think many researchers see the first version of the manuscript as the stage 1 report sent out for stage 1 review.”

      Thank you very much. Stage 1 and Stage 2 manuscripts look like suitable labelling solution.

      34. “One pathway that is not included in Figure 2 is that authors can decide to not conduct the study when improvements are required. Relatedly, in the publish-review-curate model, is revising the manuscripts based on the reviews not optional as well? Especially in the case of

      3a, authors can hardly be forced to make changes even though the reviews are posted on the platform.”

      All the four models imply a certain level of generalization; thus, I tried to avoid redundant details. However, I have added this choice to the PRC model (now, Model 4).

      35. “I think the author should discuss the importance of ‘open identities’ more. This factor is now not explicitly included in any of the models, while it has been found to be one of the main characteristics of peer review systems (Ross-Hellauer, 2017).”

      This part has been extended.

      36. “More generally, I was wondering why the author chose these three models and not others. What were the inclusion criteria for inclusion in the manuscript? Some information on the underlying process would be welcome, especially when claims like ‘However, we believe that journal-independent peer review is a special case of Model 3 (‘Publish-Review-Curate’).’ are made without substantiation.”

      The study included four generalized models of peer review that involved some level of abstraction.

      37. “Maybe it helps to outline the goals of the paper a bit more clearly in the introduction. This helps the reader to know what to expect.”

      The Introduction has been revised including the goal and objectives.

      38. “The Modular Publishing section is not inherently related to peer review models, as you mention in the first sentence of that paragraph. As such, I think it would be best to omit this section entirely to maintain the flow of the paper. Alternatively, you could shortly discuss it in the discussion section but a separate paragraph seems too much from my point of view.”

      Modular publishing has been combined with registered reports into the fragmented publishing group of models, now in Section 5.

      39. “Labeling model 3 as post-publication review might be confusing to some readers. I believe many researchers see post-publication review as researchers making comments on preprints, or submitting commentaries to journals. Those activities are substantially different from the publish-review-curate model so I think it is important to distinguish between these types.”

      The label was changed into Publish- Review-Curate model.

      40. “I do not think the conclusions drawn below Table 3 logically follow from the earlier text. For example, why are “all functions of scientific communication implemented most quickly and transparently in Model 3”? It could be that the entire process takes longer in Model 3 (e.g. because reviewers need more time), so that Model 1 and Model 2 lead to outputs quicker. The same holds for the following claim: ‘The additional costs arising from the independent assessment of information based on open reviews are more than compensated by the emerging opportunities for scientific pluralism.’ What is the empirical evidence for this? While I personally do think that Model 3 improves on Model 1, emphatic statements like this require empirical evidence. Maybe the author could provide some suggestions on how we can attain this evidence. Model 2 does have some empirical evidence underpinning its validity (see Scheel, Schijen, Lakens, 2021; Soderberg et al., 2021; Sarafoglou et al. 2022) but more meta-research inquiries into the effectiveness and cost-benefits ratio of registered reports would still be welcome in general.”

      The Discussion section has been substantially revised to address this point. While I acknowledge the current scarcity of empirical studies on innovative peer review models, I have incorporated a critical discussion of this methodological gap. I am grateful for the suggested literature on RRs, which I have now integrated into the relevant subsection.

      41. “What is the underlaying source for the claim that openness requires three conditions?”

      I have made effort to clarify within the text that this reflects my personal stance.

      42. “‘If we do not change our approach, science will either stagnate or transition into other forms of communication.’ (p. 2) I don’t think this claim is supported sufficiently strongly. While I agree there are important problems in peer review, I think would need to be a more in-depth and evidence-based analysis before claims like this can be made.”

      The sentence has been rephrased.

      43. “On some occasions, the author uses ‘we’ while the study is single authored.”

      This has been fixed.

      44. “Figure 1: The top-left arrow from revision to (re-)submission is hidden”

      I have updated Figure 1.

      45. “‘The low level of peer review also contributes to the crisis of reproducibility in scientific research (Stoddart, 2016).’ (p. 4) I assume the author means the low quality of peer review.”

      This has been fixed.

      46. “‘Although this crisis is due to a multitude of factors, the peer review system bears a significant responsibility for it.’ (p. 4) This is also a big claim that is not substantiated”

      I have paraphrased this sentence as “While multiple factors drive this crisis, deficiencies in the peer review process remain a significant contributor.” and added a footnote.

      47. “‘Software for automatic evaluation of scientific papers based on artificial intelligence (AI) has emerged relatively recently” (p. 5) The author could add RegCheck (https://regcheck.app/) here, even though it is still in development. This tool is especially salient in light of the finding that preregistration-paper checks are rarely done as part of reviews (see Syed, 2023)”

      Thank you very much, I have added this information.

      48. “There is a typo in last box of Figure 1 (‘decicion’ instead of ‘decision’). I also found typos in the second box of Figure 2, where ‘screns’ should be ‘screens’, and the author decision box where ‘desicion’ should be ‘decision’”

      This has been fixed.

      49. “Maybe it would be good to mention results blinded review in the first paragraph of 3.2. This is a form of peer review where the study is already carried out but reviewers are blinded to the results. See work by Locascio (2017), Grand et al. (2018), and Woznyj et al. (2018).”

      Thanks, I have added this (now section 5.2)

      50. “Is ‘Not considered for peer review’ in figure 3b not the same as rejected? I feel that it is rejected in the sense that neither the manuscript not the reviews will be posted on the platform.”

      Changed into “Rejected”

      51. “‘In addition to the projects mentioned, there are other platforms, for example, PREreview12, which departs even more radically from the traditional review format due to the decentralized structure of work.’ (p. 11) For completeness, I think it would be helpful to add some more information here, for example why exactly decentralization is a radical departure from the traditional model.”

      I have extended this passage.

      52. “‘However, anonymity is very conditional - there are still many “keys” left in the manuscript, by which one can determine, if not the identity of the author, then his country, research group, or affiliated organization.’ (p.11) I would opt for the neutral ‘their’ here instead of ‘his’, especially given that this is a paragraph about equity and inclusion.”

      This has been fixed.

      53. “‘Thus, “closeness” is not a good way to address biases.’ (p. 11) This might be a straw man argument because I don’t believe researchers have argued that it is a good method to combat biases. If they did, it would be good to cite them here. Alternatively, the sentence could be

      omitted entirely.

      I have omitted the sentence.

      54. “I would start the Modular Publishing section with the definition as that allows readers to interpret the other statements better.”

      Modular publishing has been combined with registered reports into the deconstructed publication group of models, now in Section 5, general definition added.

      55. “It would be helpful if the Models were labeled (instead of using Model 1, Model 2, and Model 3) so that readers don’t have to think back what each model involved.”

      All the models represent a kind of generalization, which is why non-detailed labels are used. The text labels may vary depending on the context.

      56. “Table 2: ‘Decision making’ for the editor’s role is quite broad, I recommend to specify and include what kind of decisions need to be made.”

      Changed into “Making accept/reject decisions”

      57. “Table 2: ‘Aim of review’ – I believe the aim of peer review differs also within these models (see the ‘schools of thought’ the author mentions earlier), so maybe a statement on what the review entails would be a better way to phrase this.”

      Changed into “What does peer review entail?”

      58. “Table 2: One could argue that the object of the review’ in Registered Reports is also the manuscript as a whole, just in different stages. As such, I would phrase this differently.

      Current wording fits your remark: “Manuscript in terms of study design and execution”

      Reviewer 4

      59. “Page 3: It’s hard to get a feel for the timeline given the dates that are described. We have peer review becoming standard after WWII (after 1945), definitively established by the second half of the century, an example of obligatory peer review starting in 1976, and in crisis by the end of the 20th century. I would consider adding examples that better support this timeline – did it become more common in specific journals before 1976? Was the crisis by the end of the 20th century something that happened over time or something that was already intrinsic to the institution? It doesn’t seem like enough time to get established and then enter crisis, but more details/examples could help make the timeline clear. Consider discussing the benefits of the traditional model of peer review.”

      This section has been extended.

      60. “Table 1 – Most of these are self-explanatory to me as a reader, but not all. I don’t know what a registered report refers to, and it stands to reason that not all of these innovations are familiar to all readers. You do go through each of these sections, but that’s not clear when I initially look at the table. Consider having a more informative caption. Additionally, the left column is “Course of changes” here but “Directions” in text. I’d pick one and go with it for consistency.”

      Table 1 has been replaced by Figure 2. I have also extended text descriptions, added definitions.

      61. “With some of these methods, there’s the ability to also submit to a regular journal. Going to a regular journal presumably would instigate a whole new round of review, which may or may not contradict the previous round of post-publication review and would increase the length of time to publication by going through both types. If someone has a goal to publish in a journal, what benefit would they get by going through the post-publication review first, given this extra time?”

      Some of these platforms, e.g., F1000, Lifecycle Journal, replace conventional journal publishing. Modular publishing allows for step-by-step feedback from peers. An important advantage of RRs over other peer review models lies in their capacity to enhance research efficiency. By conducting peer review at Stage 1, researchers gain the opportunity to refine their study design or data collection protocols before empirical work begins. Other models of review can offer critiques such as "the study should have been conducted differently" without actionable opportunity for improvement. The key motivation for having my paper reviewed in MetaROR is the quality of peer review – I have never received so many comments, frankly! Moreover, platforms such as MetaROR usually have partnering journals.

      62. “There’s a section talking about institutional change (page 14). It mentions that openness requires three conditions – people taking responsibility for scientific communication, authors and reviewers, and infrastructure. I would consider adding some discussion of readers and evaluators. Readers have to be willing to accept these papers as reliable, trustworthy, and respectable to read and use the information in them. Evaluators such as tenure committees and potential employers would need to consider papers submitted through these approaches as evidence of scientific scholarship for the effort to be worthwhile for scientists.”

      I have omitted these conditions and employed the Moore’s Technology Adoption Life Cycle. Thank you very much for your comment!

      63. Based on this overview, which seems somewhat skewed towards the merits of these methods (conflict of interest, limited perspective on downsides to new methods/upsides to old methods), I am not quite ready to accept this effort as equivalent of a regular journal and pre-publication peer review process. I look forward to learning more about the approach and seeing this review method in action and as it develops.

      The Discussion section has been substantially revised to address this point. While I acknowledge the current scarcity of empirical studies on innovative peer review models, I have incorporated a critical discussion of this methodological gap.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Manuscript number: RC-2025-03083 Corresponding author(s): David Fay General Statements [optional] This section is optional. Insert here any general statements you wish to make about the goal of the study or about the reviews.

      We greatly appreciate the input of the four reviewers, all of whom carried out a careful reading of our manuscript, provided useful suggestions for improvements, and were enthusiastic about the study including its thoroughness and utility to the field. Because the reviewers required no additional experiments, we were able to address their comments in writing.

      However, in response to a comment from reviewer #4 we decided to add an additional new biological finding to our study given that our functional validation of proximity labeling targets was not extensive. Namely, we now show that a missense mutation affecting BCC-1, one of the top NEKL-MLT interactors identified by our proximity labeling screen, is a causative mutation (together with catp-1) in a strain isolated through a forward genetic screen for suppressors of nekl molting defects (new Fig 9C). This finding, combined with our genetic enhancer tests, further strengthens the functional relevance of proteins identified though our proximity labeling approach and highlights the synergy of proteomics combined with classical genetics.

      Positive statements from reviewers include: Reviewer #1: Overall, this is an outstanding study that will be of great interest to those interested in using proximity labeling to identify interactors of their favorite protein. The experiments are well executed and the data presented in a mostly clear manner.

      Reviewer #2: The key conclusions are convincing, and the work is rigorous. The work provides a clear roadmap to reproducing the data. The experiments are adequately replicated, and statistical analysis is adequate... In many papers, TurboID seems very trivial but this paper clearly highlights the limitations and will be an invaluable resource for labs that want to get proximity labeling established in their labs.

      Reviewer #3: Overall, the claims are solid and conclusions supported. The data and methods are substantial to enable reproducibility in other labs. The experiments have been repeated multiple times with particular attention to statistical analysis. ...This manuscript represents a methodological advance that will likely become an oft-cited reference for members of the C. elegans community and a springboard for other basic biomedical scientists wanting to adapt rigorous proximity labeling techniques to their system.

      Reviewer #4: Fay et al. present a solid, clear and comprehensive BioID-based proteomics study that takes into account and discusses decisive aspects for the (re)production and analysis of high-quality TurboID-based mass spectrometry data. Claims and conclusions are generally well and sufficiently supported by the presented data and illustrated with figures (throughout the text as well as with plenty of supplementary data)... Basic consideration and thoughts for the experimental design and MS data analysis are given in detail and can serve as another guideline for future studies.

      Based on these reviews and comments, we believe that our manuscript is suitable for publication in a high-impact journal. 1. Point-by-point description of the revisions This section is mandatory. Please insert a point-by-point reply describing the revisions that were already carried out and included in the transferred manuscript.

      *Reviewer #1 (Evidence, reproducibility and clarity (Required)): *

      *Proximity labeling has become a powerful tool for defining protein interaction networks and has been utilized in a growing number of multicellular model systems. However, while such an approach can efficiently generate a list of potential interactors, knowledge of the most appropriate controls and standardized metrics to judge the quality of the data are lacking. The study by Fay systematically investigates these questions using the C. elegans NIMA kinase family members NEKL-2 and NEKL-2 and their known binding partners MLT-2, MLT-3 and MLT-4. The authors perform eight TurboID experiments each with multiple NEKL and MLT proteins and explore general metrics for assessing experimental outcomes as well as how each of the individual metrics correlates with one another. They also compare technical and biological replicates, explore strategies for identifying false positives and investigate a number of variations in the experimental approach, such as the use of N- versus C-terminal tags, depletion of endogenous biotinylated proteins, combining auxin-inducible degradation, and the use of gene ontology analysis to identify physiological interactors. Finally, the authors validate their findings by demonstrating that a number of the candidate identified functionally interact with NEKL-2 or components of the WASH complex. *

      Overall this is an outstanding study that will be of great interest to those interested in using proximity labeling to identify interactors of their favorite protein. The experiments are well executed and the data presented in a mostly clear manner. I really like this study (particularly because I plan to do a proximity labeling study of my own), but I did come away less than impressed with some of the analysis. This is a data-dense manuscript, and it appears to me that the authors tried to cover so much ground that in some cases very little insight was provided. For instance, the authors promote the use of data independent acquisition (DIA) as compared to the more commonly used data dependent acquisition (DDA). However the authors do not provide any analysis to indicate one approach is better than the other. Likewise the combined use of auxin-induced degradation and proximity labeling is explored but there is very little to take away from these experiments. Despite these issues, I am very enthusiastic about the study as a whole. Below I list major and minor concerns.

      Major concerns * 1. My biggest issue with the manuscript is that a lot is made of the use of data independent acquisition (DIA) as compared to the more commonly used data dependent acquisition (DDA). The authors perform experiments using DIA and DDA approaches but do not directly compare the outcomes. As a result there is really no way to know if one approach is better than the other. I would suggest the authors either perform the necessary analysis to compare the two approaches or tone down their promotion of DIA.* We agree and have scaled back any statements comparing DDA to DIA as our manuscript did not address this directly. We also now point out this caveat in our closing thoughts section, while referencing other studies that compared the two (lines 926-929). Our main point was to convey that DIA worked well for our proximity labeling studies but has seen little use by the model organism field. Surprising (to us), DIA was also considerably less expensive than DDA options.

      2. Line 75, The authors promote the use of data-independent acquisition (DIA) without defining what this approach is and how it differs from the more conventional data-dependent acquisition. As a non-mass spectroscopist, I found myself with lots of question concerning DIA, what it is and how it differs from DDA. I think it would really be helpful to expand the description of DIA and its comparison with DDA in the introduction. As non-mass-spectroscopists ourselves, we understand the reviewer's point. Because the paper is quite long, we were trying to avoid non-essential information. We have now added some information to explain some of the key differences between DDA and DIA. We have also included references for readers who may want to learn more. (lines 77-80)

      Minor concerns: * Line 92 typo. I believe the authors meant to say NEKL-2-MLT-2-MLT-4. * Corrected. (line 95)

      Line169. Is exogenous the correct word to use here? It suggests that you are talking about non-worm proteins, but I know you are not. Corrected. Changed to "Moreover, the detection of biotinylated proteins may be difficult if the bait-TurboID fusion is expressed at low levels..." (line 181).

      Line 177 typo (D) should be (C). Corrected. (line 1122)

      Figure 1C: Lucky Charms may sue you for infringement of their trademarked marshmallow treats. Thank you for picking up on this. The authors accept full responsibility for any resulting lawsuits.

      Figure 1D. The NEKL-2::TurboID band is indicated with a green triangle in the figure but the figure legend states that green triangles indicate mNG::TurboID control. I know this triangle is a shade off the triangle that indicates mNG::TurboID but it's really hard to see the difference. All of the differently colored triangles in panel F are unnecessary. I would either just pick one color for all non-control bait proteins or better yet, only use a triangle to point to bands that are not obvious. For instance I don't need the triangles that point to NEKL-2 -3 and -4 fusion proteins. These are just distracting. We understand the reviewer's point. We colored the triangles to match the colors used for the proteins in the figures. We have now added "bright green triangles with white outlines" (Fig 1 legend) to indicate the Pdpy-7::mNG::TurboID control" and changed triangles in the corresponding figures. Although we would be fine with removing or changing the triangles, we think that they may aid somewhat with clarity.

      Line: 316: Conceivably, another factor that could contribute to the counterintuitive upregulation of some proteins in the N2 samples is related to the fusion proteins that are being expressed in the TurboID lines. A partially functional bait protein (one with a level of activity similar to nekl-2(fd81) that may not result in an obvious phenotype) could directly or indirectly affect gene expression leading to lower levels of a subset of proteins in the TurboID samples. The same could be said for fusion proteins with a gain-of-function effect. This is an interesting idea, and we tested this possibility by looking for consistent overlap between N2-up proteins between biological replates of individual bait proteins. We now include a representative Venn diagram in S3C Fig to highlight this comparison. In summary, although we cannot rule out this possibility, our analysis did not support the widespread occurrence of this effect in our study. We also made certain that our statement regarding N2 up proteins was not too definitive. (lines 285-288)

      *Fig 3 B-E. I am a little confused how the data in these graphs is normalized. For instance, I would have expected that for NEKL-3 in panel B, that the normalized (log2) intensity value in N2 be set at 0 as it is for NEKL-2. Maybe I just don't have enough information on how these plots were generated. * The difference is that in the N2 sample, NEKL-3 was detected but NEKL-2 was not. The numbers themselves are assigned by the Spectronaut software used to quantify the DIA results but are not meaningful beyond indicating relative amounts (intensity values) of a given protein within an individual biological experiment. We've added some lines to the figure legend to make this clearer. (lines 1165-1169)

      *Figure 6C legend is not correct. * Corrected. (line 1214)

      Line 575: Figure reference should be Fig. S5G. The authors should check to make sure all references to supplemental figures include correct panel information. Corrected. (line 464) In addition, we have now gone through the manuscript and added panel numbers references where applicable. Note that the addition of a new supplemental file has shifted the numbering.

      Line 576. The authors reference a study by Artan and colleagues and report a weak correlation between their study and that of Artan. They reference figure S4 but it should be Fig S5H. Apologies and many thanks to the reviewer for catching these errors. (line 464)

      Line 652. The authors note that numerous proteins were present at substantially reduced levels in the mNG::TurboID samples and suggest that sticky proteins may have been outcompeted or otherwise excluded from beads incubated with the mNG::TurboID lysates. Why would sticky proteins only be a problem in these samples? The reasoning is not clear to me. The idea was that in the sample with very high levels of biotinylated proteins (mNG::TurboID), the surface of the beads might become saturated with high-affinity biotinylated proteins. This could prevent or out complete the binding of random proteins that are not biotinylated but nevertheless have some affinity to the beads ("sticky" proteins). We have reworded this section to make this clearer. (lines 546-550)

      Line 745: The term "bait overlaps" is a bit vague. Ultimately, I figured out what it meant but it was not immediately obvious. We have changed this to "overlap between baits" and made this section clearer. (line 624-628)

      *S7B Fig. Why is actin missing from the eluate? * In S7B we refer to the purified eluate as the "eluate", which may have caused some confusion. In other sections of the manuscript, we refer to the bead-bound proteins as the "purified eluate" (Figs 1 and 5). For the purified eluate a portion of the streptavidin beads are boiled in sample buffer to elute the bound proteins before running a western. Actin would not be expected in these samples because it's (presumably) not biotinylated in our samples and doesn't detectably bind the beads. This result was seen in all relevant westerns in S1 Data. For consistency, however, we've gone through all our files to make sure we consistently use the term "purified eluate" versus "eluate", which is less specific.

      L*ine 873: The authors state the extent of overlap in GO terms between the various experiments and provide percentages. I tried to extract this information from Figure 8C and came up with different values. For instance, in the case of Molecular Function, they state that they observed a 54% overlap between NEKL-2 and NEKL-3 but in the Venn diagram in Figure 8C I see that the NEKL-2 and NEKL-3 experiments had 71 (25+46) GO terms in common. Out of 98 GO terms for NEKL-2 or 104 for NEKL-3 the percentage I got is closer to 72. Am I analyzing this correctly? * Thanks for checking this. We believe our method for calculating the percent overlap is correct. In the case of NEKL-2/NEKL-3 overlap for Molecular Function, there are 131 total unique terms, of which 71 overlap, giving a 54% overlap. In the case of NEKL-2/NEKL-3 overlap for Biological Process, however, we made an error in arithmetic (415 unique, 239 overlap), such that the correct percentage is 58%, which we have corrected in the text.

      *Reviewer #1 (Significance (Required)): *

      *Overall this is an outstanding study that will be of great interest to those interested in using proximity labeling to identify interactors of their favorite protein. The experiments are well executed and the data presented in a mostly clear manner. I really like this study (particularly because I plan to do a proximity labeling study of my own), but I did come away less than impressed with some of the analysis. This is a data-dense manuscript, and it appears to me that the authors tried to cover so much ground that in some cases very little insight was provided. For instance, the authors promote the use of data independent acquisition (DIA) as compared to the more commonly used data dependent acquisition (DDA). However the authors do not provide any analysis to indicate one approach is better than the other. Likewise the combined use of auxin-induced degradation and proximity labeling is explored but there is very little to take away from these experiments. Despite these issues, I am very enthusiastic about the study as a whole. *

      *Reviewer #2 (Evidence, reproducibility and clarity (Required)): *

      *This study expanded the use of data-independent acquisition-mass spectrometry (DIA-MS) in TurboID proximity-labeling proteomics to identify novel interactors of NEKL-2, NEKL-3, MLT-2, MLT-3, and MLT-4 complexes in C. elegans. The authors described several useful metrics to evaluate the quality of TurboID experiments, such as using the percentage of upregulated genes, the percentage of proteins present only in bait-TurboID experiments as compared to N2 controls, and the percentage of endogenously biotinylated carboxylases as internal controls. Further, the authors introduced methodological variability across 23 TurboID experiments and evaluated any improvement to the resulting data, such as N-terminally tagging bait proteins with TurboID, depleting endogenous carboxylases, and auxin-inducible degradation of known complex members. Finally, this study identified the kinase folding chaperone CDC-37 and the WASH complex component DDL-2 as novel interactors with the NEKL-MLT complexes through an RNAi-based enhancer approach following their identification by TurboID. *

      Major comments: * The key conclusions are convincing, and the work is rigorous. The work provides a clear roadmap to reproducing the data. The experiments are adequately replicated, and statistical analysis is adequate. We only have minor comments.*

      Minor comments: * •In the western blot in Fig 1 why does the mNG::Turbo have two bands? * Thank you for point this out. To our knowledge this is a breakdown product that was especially prevalent in replicate 3 (also see S1 Data), which we chose to shown because all the NEKL-MLTs were clearly visible in this western. The expected size of the mNeonGreen::TurboID (including linker and tags) is ~68 kDa and our blots are roughly consistent those of Artan et al., (2001). This lower band was not evident in Exp 8. We have now included a statement in the figure legend to indicate that the upper band is the full-length protein whereas the lower band is likely to be a breakdown product (lines 1141-1142).

      •Fig 2B is difficult to parse as a reader. Columns labeled "Upreg," "Downreg," "TurboID only," "N2 only," "Filter-1," "Filter-2," and "Epi %" could be moved to Supplemental. Fold change vs N2 could be represented as a bar chart, allowing for trends between fold change and the metrics Upreg %, Turbo %, and Carboxylase % to be seen more clearly. Further, rows headed "Carboxylase depletion," "DDA," and "Auxin treated" could be presented as separate panels to better match the distinct points made in the text. After serious consideration we have made several changes including the addition of S2 Fig, which may provide readers with a better visual representation of the bait and prey fold changes observed in all our experiments. However, we feel that the detailed data embedded in Fig 2 is the most concise and accurate means by which to convey our full results and is key to our methodological conclusions. As such we did not want to relegate this information to a supplemental table. We note that this figure was not found to be problematic by other reviewers, although we do understand the points made by this reviewer.

      •Line 179: in vivo should be italicized Because journals differ in their stylistic practices, we are currently waiting before doing our final formatting. We did keep our use of Latin phrases consistently non-italicized in the draft.

      •Lines 215-217: The comparison between Western blot expression levels and prior fluorescent reporter levels is unclear. Could be reformatted to make it clearer that relative expression of the different NEKL-MLTs in this study is consistent with prior data. We reformatted this sentence to improve clarity. (lines 205-207)

      *•Lines 267-268: The final line of the passage is unclear and can be removed. * This sentence has been removed.

      •Lines 311-313: This study is able to use the recovery of bait and known interactor proteins as internal controls to determine the quality of each experiment, but this may not always be the case for other users' experiments. The authors should comment on how Upreg %, a value influenced by many factors, can actually be used as a quality check when a bait protein has no known interactors. We have added language to highlight this point. (lines 344-348)

      *•Line 702: There is a [new REF] that should be removed * As described above, we have now included this finding on bcc-1 as part of this manuscript (Fig 9C).

      •The approach used mixed stage animals, but some genes oscillate or are transiently expressed. Please discuss cost-benefit of mixed stage vs syncing. This is an important point. We have added a discussion on the benefits and drawbacks of using mixed stages to the discussion. (lines 901-911)

      *•Authors were working on hypodermally expressed proteins. It would be valuable to discuss what tissues are amenable to TurboID. Ie are the cases where there are few cells (anchor cell, glial sockets, etc) that it will be extremely challenging to perform this technique * We agree that certain tissues/proteins will not be amenable to proximity labeling. We believe that we have addressed this point together with the above comment throughout the manuscript and now on lines 936-940.

      •Authors mention approaches such as nanobodies, split Turbo. Based on their experiences it would be valuable to add Discussion on strengths and weaknesses of these approaches to guide folks considering TurboID and DIA-MS experiments in C. elegans Because we have not tested these methods, we feel that we cannot provide a great deal of insight into these alternate approaches. We mention and reference these methods in the introduction so that readers are aware of them.

      *Reviewer #2 (Significance (Required)): *

      •Advance in technique: This study expands the use cases of data-independent acquisition MS method (DIA-MS) in C. elegans, which fragments all ions independent of the initial MS1 data. The benefits of this approach include better reproducibility across technical replicates and better recovery of low abundance peptides, which are critical for advancing our ability to capture weak and transient interactions.

      •The use of DIA-MS in this study has improved our understanding of the partners of these NEKL-MLTs in membrane trafficking, molting, and cell adhesion within the epidermis.

      •In many papers, TurboID seems very trivial but this paper clearly highlights the limitations and will be an invaluable resource for labs that want to get proximity labeling established in their labs.

      *Reviewer #3 (Evidence, reproducibility and clarity (Required)): *

      *Summary: *

      Fay and colleagues perform a series of proximity labeling experiments in C. elegans followed by thorough and rational analysis of the resulting biotinylated proteins identified by LC-MS/MS. The overall goals of the study are to evaluate different techniques and provide practical guidance on how to achieve success. The major takeaways are that integration of data-independent acquisition (DIA) along with comparison of endogenously tagged TurboID alleles to soluble TurboID expressed in the same tissue results in improved detection of bona-fide interactors and reduced numbers of false-positives.

      *Major comments: *

      Overall the claims are solid and conclusions supported. The data and methods are substantial to enable reproducibility in other labs. The experiments have been repeated multiple times with particular attention to statistical analysis. I have no major concerns with the manuscript and focus primarily on improving the accessibility of this important contribution to the scientific community. As such, I suggest that the authors:

      1) Provide more explanation of and rationale for using DIA. This is not yet a standard technique and most basic biomedical scientists will be unaware of the jargon. As I expect many labs in the C. elegans community and beyond will be interested in the guidance provided in this manuscript, the introduction offers a great opportunity to bring the reader up to speed, as opposed to sending them to the complicated proteomics analysis literature. We have added some additional context (lines 77-80) as well as new references. We note that getting into the technical differences between DIA and DDA, beyond what we briefly mention, would take a substantial amount of space, may not be of interest to many readers, and can be found through standard internet and (sigh) AI-based searches.

      *2) Provide a better overview of the various protocols tested (Experiments 1-8). Maybe at the beginning of the results, and maybe with an accompanying schematic. As currently written, it is difficult to figure out details regarding how the experiments vary and why. * We have now added a short paragraph to better inform the reader at the front end regarding the major experiments. (lines 139-146).

      3) As to be expected, expression of TurboID tags at endogenous levels via low abundance proteins in a complex multicellular system results in somewhat weak signals that flirt with the limit of detection. Perhaps by combining tagged alleles within the same complex (NEKL-3/MLT-3 or NEKL-2/MLT-2/MLT-4) the signals could be boosted? Tandem tags, either on one end or multiple ends of proteins might help as well. As the authors point out, a benefit of tagging the two NEKL-MLT complexes is that there are strong loss-of-function phenotypes (lethal molting defects) to help evaluate whether a tagging strategy results in a non-functional complex. THESE EXPERIMENTS ARE OPTIONAL and might simply be discussed at the authors discretion. These are interesting ideas that we have now incorporated into our discussion. (lines 936-940)

      *Minor Comments: *

      *1) Figure 3A is cropped on the right. * Thank you for catching this. Corrected.

      *2) Better define [new REF] on line 702. * We have added new results (Fig 9C), obviating the need for this reference.

      ***Referee cross-comments** *

      Overall, I am in agreement with, and supportive of, the other reviewers' comments.

      *Reviewer #3 (Significance (Required)): *

      *Significance: *

      Proximity labeling is often proposed as a technique to determine interaction networks of proteins in vivo, but in practice it remains challenging for most labs to execute a successful experiment, especially within the context of multicellular model organisms. Fay and colleagues provide a much needed roadmap for how to best approach proximity labeling experiments in C. elegans that will likely apply to other model systems.

      They establish a rigorous approach by choosing to endogenously tag components of two essential NEKL-MLT complexes required for C. elegans molting. These complexes are relatively low abundance as they are only expressed in a single cell type, the hyp7 epidermal syncytium. In addition, as inactivation of any member of the complexes results in molting defects, they have a powerful selection for functional tags. Thus, they have set a high bar for themselves in order to discern whether a given variation on the experimental approach results in improved detection of interactors and fewer false positives.

      *Potential areas for improvement include lowering the expression level of the skin-specific soluble TurboID used to determine non-specific biotinylation events. This control results in much higher levels of biotinylation compared to the TurboID-tagged NEKL-MLT alleles and likely affects their analysis, which they openly admit. In addition, to reduce the high level of background biotinylation signals generated by endogenous carboxylases, they adopt a depletion strategy pioneered by other researchers but this does not offer major improvements in detection of specific signals. The source of these conflicting results remains to be determined. It is also curious that auxin-inducible degradation of components of the NEKL-MLT complexes did not robustly alter the resulting biotinylating capacity of other members. This approach should be evaluated in subsequent studies. Finally, as mentioned in Major Comment #3 (above), it would be interesting to see if combining TurboID tags within the same complex might improve signal-to-background ratios. *

      This manuscript represents a methodological advance that will likely become an oft-cited reference for members of the C. elegans community and a springboard for other basic biomedical scientists wanting to adapt rigorous proximity labeling techniques to their system. I am a cell biologist that uses a variety of genetic, molecular and biochemical approaches, mostly centered around C. elegans. I have used LC/MS-MS in our studies but have relatively little expertise in evaluating all aspects of proteomic pipelines.

      *Reviewer #4 (Evidence, reproducibility and clarity (Required)): *

      *Fay et al. describe an extensive proximity labeling BioID study in C. elegans with TurboID and DIA-LCMS analysis. They chose the NEKL-2/3 kinases and their known interactors MLT-2/3/4 as TurboID-fused bait proteins (C- and partially N-terminal fusions encoded from CRISPR-mediated genome edited genes). With eight biological replicates (and three to four technical replicates each) and with the unmodified wildtype or mNeonGreen-TurboID expressing worms as controls, a comprehensive dataset was generated. Although starting from quite different abundances of the bait-fusions within the cell lysates all bait proteins and known complex-binding partners were convincingly enriched with capturing streptavidin beads after only one hour of incubation with the lysate. This confirms the general applicability of TurboID-BioID approach in C. elegans. The BioID method typically gives rise to large proteomics datasets (up to more than thousand proteins identified after biotin capture) with several tens to hundreds enriched proteins (against negative control strains) as potential proteins that localize proximal to the bait-TurboID protein. However, substantial variations of candidates between biological replicates are frequently observed in BioID experiments. The authors scrutinized their dataset towards indicative metrics, filters and cutoffs in order to separate high-confidence from low-confidence candidates. With the workflow applied the authors melt down the number of candidates to 15 proteins that were grouped in four functional groups reasonably associated to NEKL-MLT function. *

      Successful BioID experiments depend on reliable enrichment quantification with mass spectrometry using control cell lines that require a carefully bait-tailored design. Those must adequately express TurboID controls matching the abundance of the bait-TurboID fusion protein and its biotinylation activity. After affinity capture, sample preparation and LCMS data acquisition there is no silver bullet towards the identification true bait neighbors. Fay et al. elaborately describe their considerations and workflow towards high-confidence candidates. The workflow considered (i) data analysis with Volcano plots to account for statistical reproducibility of biological replicates against negative controls, (ii) fraction of proteins only detected in the positive or negative controls thus evading the fold-enrichment quantification approach, (iii) evaluation of variations in carboxylase enrichment as a measure for variations in the general biotin capture quality between experiments, (iv) an assessment of technical reproducibility with scatter plots and Venn diagrams, (v) exclusion of potentially false positives, e.g. promiscuously biotinylated non-proximal proteins, through comparisons with control worms expressing a non-localized mNeonGreen-TurboID fusion protein, (vi) batch effects, (vii) the impact of endogenous biotinylated carboxylases through depletion, (viii) gene ontology analysis of enriched proteins, (ix) weighing data according to the quality of individual experiments according to the afore mentioned metrics, and finally (x) genetic interaction studies to functionally associate high-confidence candidates with the bait.

      *Major comments: *

      Fay et al. present a solid, clear and comprehensive BioID-based proteomics study that takes into account and discusses decisive aspects for the (re)production and analysis of high-quality TurboID-based mass spectrometry data. Claims and conclusions are generally well and sufficiently supported by the presented data and illustrated with figures (throughout the text as well as with plenty of supplementary data). However, although the authors claim to seek for substrates of the kinase complex they drew no further attention to the phosphorylation status of the captured proteins. Haven't the MS data been analyzed in this respect? Information regarding this issue would enhance the manuscript. Data generation and method description appear reproducible for readers. Also, the statistical analyses appear adequate. The authors should also consider to deposit their MS raw and analysis data in a public repository (e.g. PRIDE) for future reviewing processes and as reference data for readers and followers. Our raw MS data have been deposited by the Arkansas Proteomics Facility. I have followed up to ensure that they are publicly available.

      *Minor comments: *

      The authors should combine supplementary data files to reduce the number of single files readers have to deal with. We have combined these files as suggested.

      The authors should avoid the term "upregulation" or "increased biotinylation" when capture enrichment is meant. We agree with reviewer's point. We now use the terms enriched versus reduced or up versus down, depending on the context, and clearly define these terms. These changes have been incorporated throughout the manuscript.

      *Reviewer #4 (Significance (Required)): *

      The manuscript presents a robust BioID proteomics screening for co-localizing proteins of NEKL-2/3 kinases and their known interactors MLT-2/3/4. The ongoing validation of their functional interactions and whether the protein candidates reflect phosphorylation substrates or else remains elusive and is announced for upcoming manuscripts. The knowledge gain in terms of molecular mechanisms with NEKL-2/3 MLT-2/3/4 involvement in C. elegans is therefore limited to a table of - promising - interacting candidates that have to be studied further. Information about the phosphorylation status of the captured proteins from the MS data are not given. However, knowing the protein candidates will be of interest for groups working with these complexes (or the identified potentially interacting proteins) either in C. elegans or any other organism. Also, in-depth proteomics screenings with novel approaches such as BioID have to be established for individual organisms. For C. elegans there is only one prior BioID publication (Holzer et al. 2022). Many of the aspects discussed here have also been addressed earlier for BioIDs in other organisms and are not principally new. However, the presented study can be of conceptual interest for labs delving into or entangled with the BioID method in C. elegans or other organisms. The study addresses especially proteomics groups working on protein-protein interactions using proximity labeling/MS approaches. Basic consideration and thoughts for the experimental design and MS data analysis are given in detail and can serve as another guideline for future studies.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      This work starts with the observation that embryo polarization is asynchronous starting at the early 8-cell stage, with early polarizing cells being biased towards producing the trophectoderm (TE) lineage. They further found that reduced CARM1 activity and upregulation of its substrate BAF155 promote early polarization and TE specification, this piece of evidence connects the previous finding that at Carm1 heterogeneity 4-cell stage guide later cell lineages - the higher Carm1-expressing blastomeres are biased towards ICM lineage. Thus, this work provides a link between asymmetries at the 4-cell stage and polarization at the 8-cell stage, providing a cohesive explanation regarding the first lineage allocation in mouse embryos.

      Strengths:

      In addition to what has been put in the summary, the advanced 3D image-based analysis has found that early polarization is associated with a change in cell geometry in blastomeres, regarding the ratio of the long axis to the short axis. This is considered a new observation that has not been identified.

      Weaknesses:

      For the microinjection-based method to overexpression/deletion of proteins, although it has been shown to be effective in the early embryo settings and has been widely used, it may not fully represent the in vivo situation in some cases, compared to other strategies such as the use of knock-in mice. This is a minor weakness; it would be good to include some sentences in the discussion on the potential caveats.

      We thank the reviewer for their insightful summary of our work, and their adjudication on the novelty of our research. We agree with the reviewer that microinjection-based methods, whilst being the standard and widely used in the field, have their weaknesses. In this study, we have primarily used microinjection of previously tested and known constructs which may help mitigate these concerns, and have referenced numerous studies in which these constructs have been used and tested. Nevertheless, the authors are aware of this drawback and have tried to address this previously in other research using novel artificial intelligence techniques (Shen and Lamba et al., 2022 – cited in the manuscript) and this continues to be an active area of investigation for us.

      Reviewer #2 (Public review):

      Summary:

      In this study, Lamba and colleagues suggest a molecular mechanism to explain cell heterogeneity in cell specification during pre-implantation development. They show that embryo polarization is asynchronous. They propose that reduced CARM1 activity and upregulation of its substrate BAF155 promote early polarization and trophectoderm specification.

      Strengths:

      The authors use appropriate and validated methodology to address their scientific questions. They also report excellent live imaging. Most of the data are accompanied by careful quantifications.

      Weaknesses:

      I think this manuscript requires some more quantification, increased number of embryos in their evaluations and clearly stating the number of embryos evaluated per experiments.

      We thank the reviewer for these thoughtful comments on our work, their kind assessment of the strength of our research, and their notes on the weaknesses. We have replied to their points raised below.

      Here are some points:

      (1) It should be clearly stated in all figure legends and in the text how many cells from how many embryos were analyzed.

      We appreciate this comment to provide detailed quantification for every experiment in the paper and stating the numbers of embryos (if a whole embryo level experiment) or blastomeres used for statistical tests and displayed in the graph.

      (2) I think that the number of embryos sometimes are too low. These are mouse embryos easily accessible and the methods used are well established in this lab, so the authors should make an effort to have at least 10/15 embryos per experiment. For example "In agreement with this, hybridization chain reaction (HCR) RNA fluorescence in situ hybridization of early 8-cell stage embryos revealed that the number of CDX2 mRNA puncta was higher in polarized blastomeres with a PARD6-positive apical domain than in unpolarized blastomeres, for 5 out of 6 embryos with EP cells (Figure 3A, B)".. or the data for Figure 4, we know how many cells but now how many embryos.

      We appreciate the reviewer’s comment regarding the number of embryos used in the hybridization chain reaction (HCR) experiment. We agree that increasing the number of embryos could, in principle, further add statistical power. However, both first authors have since left the lab to begin their postdoctoral training or joining a company, and it is not feasible for us to generate additional embryos at this stage.

      Importantly, we believe the number of embryos included in the current manuscript is sufficient to support our conclusions, especially when considered in the context of the broader experimental design, the timing of the study, and our ethical commitment to minimizing animal use.

      Notably, the initial HCR experiment targeting Cdx2 mRNA served as a key indication that prompted further investigation of CDX2 at the protein level. These follow-up experiments were conducted with increased numbers of embryos and/or cells and are presented in Figure 3 and the associated supplementary figures (we now have 124 cells (including 23 EP cells) from 16 embryos), thereby strengthening and confirming the conclusion suggested by the HCR data.

      (3) It would be useful to see in Figure 4 an example of asymmetric cell division as done for symmetric cell division in panel 4B. This could really help the reader to understand how the authors assessed this.

      We used live imaging to track cell division patterns. Cells expressing RFP-tagged polarity proteins were observed during division to identify the resulting daughter cells. Immediately after cytokinesis, we assessed the polarity status of each daughter cell. If both daughter cells were polarized, the division was classified as symmetric; if only one was polarized, it was classified as asymmetric.

      Author response image 1.

      8-cell stage embryos expressing Ezrin-RFP (fire colour) was imaged during 8-16 cell stage division. Top panel arrows indicate a symmetric cell division in which polarity domain became partitioned into both daughter cells; bottom panel indicates asymmetric division in which the polarity domain only get inherited to one cell of the two daughter cells.

      (4) Figure 5C there is a big disproportion of the number of EP and LP identified. Could the authors increase the number of embryos quantified and see if they can increase EP numbers?

      We thank the reviewer for this comment and want to clarify an important detail: EP cells are a phenomenon with average cellular frequency of less than 10% as compared to LP cells (the other 90%). Therefore, when investigating natural embryo development without bias or exclusion, there will likely be an imbalance in the number of EP and LP cells as is the case for Figure 5C. In this case, morphological differences and clear statistical significance were seen between the shape of EP and LP cells within the cells quantified and therefore we decided not to expend further mice for this particular experiment – but we agree with the comment that in most cases additional embryos would help strength our conclusions further.

      (5) Could the authors give more details about how they mount the embryos for live imaging? With agarose or another technique? In which dishes? Overlaid with how much medium and oil? This could help other labs that want to replicate the live imaging in their labs. Also, was it a z-stack analysis? If yes, how many um per stack? Ideally, if they also know the laser power used (at least a range) it would be extremely useful.

      We thank the reviewer for this comment and have provided additional detail here and in the Methods section. For live imaging our embryos, we used glass-bottom 35 mm dishes. We then fixed a small cut square of nylon mesh (5mm to 1cm width and height) onto this plate in the centre using silicon which was used as a grid (diameter of approximately 150 micrometres) for deposition of embryos. After drying of the silicon (overnight) and washing with water, the grid was overlaid with a drop of 100 microlitres of KSOM and then covered with mineral oil until this KSOM drop was submerged. After incubation under conditions for live imaging, single embryos were deposited in each ‘well’ of the grid before being placed in the microscope, which was equilibrated at the correct temperature and CO2.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review)

      As this code was developed for use with a 4096 electrode array, it is important to be aware of double-counting neurons across the many electrodes. I understand that there are ways within the code to ensure that this does not happen, but care must be taken in two key areas. Firstly, action potentials traveling down axons will exhibit a triphasic waveform that is different from the biphasic waveform that appears near the cell body, but these two signals will still be from the same neuron (for example, see Litke et al., 2004 "What does the eye tell the brain: Development of a System for the Large-Scale Recording of Retinal Output Activity"; figure 14). I did not see anything that would directly address this situation, so it might be something for you to consider in updated versions of the code.

      Thank you for this comment. We have added a routine to the SpikeMAP to remove highly correlated spikes detected within a given spatial radius of each other. The following was added to the main text (line 149):

      “As an additional verification step, SpikeMAP allows the computation of spike-count correlations between putative neurons located within a user-defined radius. Signals that exceed a defined threshold of correlation can be rejected as they likely reflect the same underlying cell.”

      Secondly, spike shapes are known to change when firing rates are high, like in bursting neurons (Harris, K.D., Hirase, H., Leinekugel, X., Henze, D.A. & Buzsáki, G. Temporal interaction between single spikes and complex spike bursts in hippocampal pyramidal cells. Neuron 32, 141-149 (2001)). I did not see this addressed in the present version of the manuscript.

      We have added a routine to SpikeMAP that computes population spike rates to verify stationarity over time. We have also added a routine to identify putative bursting neurons through a Hartigan statistical dip test applied to the inter-spike distribution of individual cells.

      We added the following (line 204):

      “Further, SpikeMAP contains a routine to perform a Hartigan statistical dip test on the inter-spike distribution of individual cells to detect putative bursting neurons.”

      Another area for possible improvement would be to build on the excellent validation experiments you have already conducted with parvalbumin interneurons. Although it would take more work, similar experiments could be conducted for somatostatin and vasoactive intestinal peptide neurons against a background of excitatory neurons. These may have different spike profiles, but your success in distinguishing them can only be known if you validate against ground truth, like you did for the PV interneurons.

      We have added the following (line 326):

      “future work could include different inhibitory interneurons such as somatostatin (SOM) and vasoactive intestinal polypeptide (VIP) neurons to improve the classification of inhibitory cell types. Another avenue could involve applying SpikeMAP on artificially generated spike data (Buccino & Einevoll 2021; Laquitaine et al., 2024).”

      Reviewer #2 (Public review)

      Summary:

      While I find that the paper is nicely written and easy to follow, I find that the algorithmic part of the paper is not really new and should have been more carefully compared to existing solutions. While the GT recordings to assess the possibilities of a spike sorting tool to distinguish properly between excitatory and inhibitory neurons are interesting, spikeMAP does not seem to bring anything new to state-of-the-art solutions, and/or, at least, it would deserve to be properly benchmarked. I would suggest that the authors perform a more intensive comparison with existing spike sorters.

      Thank you for your insightful comment. A full comparison between SpikeMAP and related methods is provided in Table. 1. As can be seen, SpikeMAP is the only method listed that performs E/I sorting on large-scale multielectrodes. Nonetheless, several aspects of SpikeMAP included in the spike sorting pipeline do overlap with existing methods, as these constitute necessary steps prior to performing E/I identification. These steps are not novel to the current work, nor do they constitute rigid options that cannot be substituted by the user. Rather, we aim to offer SpikeMAP users the option to combine E/I identification with preliminary steps performed either through our software or through another package of their choosing. For instance, preliminary spike sorting could be done through Kilosort before importing the spike data into SpikeMAP for E/I identification. To allow greater flexibility, we have now modularized our suite so that E/I identification can be performed as a stand-alone module. We have clarified the text accordingly (line 317):

      “While SpikeMAP is the only known method to enable the identification of putative excitatory and inhibitory neurons on high-density multielectrode arrays (Table 1), several aspects of SpikeMAP included in the spike sorting pipeline (Figure 1) overlap with existing methods, as these constitute required steps prior to performing E/I identification. To enable users the ability to integrate SpikeMAP with existing toolboxes, we provide a modularized suite of protocols so that E/I identification can be performed separately from preliminary spike sorting steps. In this way, a user could carry out spike sorting through Kilosort or another package before importing their data to SpikeMAP for E/I identification.”

      Weaknesses:

      (1) The global workflow of spikeMAP, described in Figure 1, seems to be very similar to that of Hilgen et al. 2020 (10.1016/j.celrep.2017.02.038). Therefore, the first question is what is the rationale of reinventing the wheel, and not using tools that are doing something very similar (as mentioned by the authors themselves). I have a hard time, in general, believing that spikeMAP has something particularly special, given its Methods, compared to state-of-the-art spike sorters.

      The paper by Hilgen et al. is reported in Table 1. As seen, while this paper employs optogenetics, it does not target inhibitory (e.g., PV) cells. We have added the following clarification (line 82):

      “Despite evidence showing differences in action potential kinetics for distinct cell-types as well as the use of optogenetics (Hilgen et al., 2017), there exists no large-scale validation efforts, to our knowledge, showing that extracellular waveforms can be used to reliably distinguish cell-types.”

      This is why, at the very least, the title of the paper is misleading, because it lets the reader think that the core of the paper will be about a new spike sorting pipeline. If this is the main message the authors want to convey, then I think that numerous validations/benchmarks are missing to assess first how good spikeMAP is, with reference to spike sorting in general, before deciding if this is indeed the right tool to discriminate excitatory vs inhibitory cells. The GT validation, while interesting, is not enough to entirely validate the paper. The details are a bit too scarce for me, or would deserve to be better explained (see other comments after).

      We thank the reviewer for this comment, and have amended the title as follows:

      “SpikeMAP: An unsupervised pipeline for the identification of cortical excitatory and inhibitory neurons in high-density multielectrode arrays with ground-truth validation”

      (2) Regarding the putative location of the spikes, it has been shown that the center of mass, while easy to compute, is not the most accurate solution [Scopin et al, 2024, 10.1016/j.jneumeth.2024.110297]. For example, it has an intrinsic bias for finding positions within the boundaries of the electrodes, while some other methods, such as monopolar triangulation or grid-based convolution,n might have better performances. Can the authors comment on the choice of the Center of Mass as a unique way to triangulate the sources?

      We agree with the reviewer that the center-of-mass algorithm carries limitations that are addressed by other methods. To address this issue, we have included two additional protocols in SpikeMAP to perform monopolar triangulation and grid-based convolution, offering additional options for users of the package. The text has been clarified as follows (line 429):

      “In addition to center-of-mass triangulation, SpikeMAP includes protocols to perform monopolar triangulation and grid-based convolution, offering additional options to estimate putative soma locations based on waveform amplitudes.”

      (3) Still in Figure 1, I am not sure I really see the point of Spline Interpolation. I see the point of such a smoothing, but the authors should demonstrate that it has a key impact on the distinction of Excitatory vs. Inhibitory cells. What is special about the value of 90kHz for a signal recorded at 18kHz? What is the gain with spline enhancement compared to without? Does such a value depend on the sampling rate, or is it a global optimum found by the authors?

      We clarified the text as follows (line 183):

      “While we found that a resolution of 90 kHZ provided a reasonable estimate of spike waveforms, this value can be adjusted as a parameter in SpikeMAP.”

      (4) Figure 2 is not really clear, especially panel B. The choice of the time scale for the B panel might not be the most appropriate, and the legend filtered/unfiltered with a dot is not clear to me in Bii.

      We apologize for the rendering issues in the Figures that occurred during conversion into PDF format. We have now ensured that all figures are properly displayed.

      In panel E, the authors are making two clusters with PCA projections on single waveforms. Does this mean that the PCA is only applied to the main waveforms, i.e. the ones obtained where the amplitudes are peaking the most? This is not really clear from the methods, but if this is the case, then this approach is a bit simplistic and does not really match state-of-the-art solutions. Spike waveforms are quite often, especially with such high-density arrays, covering multiple channels at once, and thus the extracellular patterns triggered by the single units on the MEA are spatio-temporal motifs occurring on several channels. This is why, in modern spike sorters, the information in a local neighbourhood is often kept to be projected, via PCA, on the lower-dimensional space before clustering. Information on a single channel only might not be informative enough to disambiguate sources. Can the authors comment on that, and what is the exact spatial resolution of the 3Brain device? The way the authors are performing the SVD should be clarified in the methods section. Is it on a single channel, and/or on multiple channels in a local neighbourhood?

      We agree with the reviewer that it would be useful to have the option of performing PCA on several channels at once, since spikes can occur at several channels at the same time. We have now added a routine to SpikeMAP that allows users to define a radius around individual channels prior to performing PCA. The text was clarified as follows (line 131):

      “The SpikeMAP suite also offers a routine to select a radius around individual channels in order to enter groups of adjacent channels in PCA.”

      (5) About the isolation of the single units, here again, I think the manuscript lacks some technical details. The authors are saying that they are using a k-means cluster analysis with k=2. This means that the authors are explicitly looking for 2 clusters per electrode? If so, this is a really strong assumption that should not be held in the context of spike sorting, because, since it is a blind source separation technique, one can not pre-determine in advance how many sources are present in the vicinity of a given electrode. While the illustration in Figure 2E is ok, there is no guarantee that one can not find more clusters, so why this choice of k=2? Again, this is why most modern spike sorting pipelines do not rely on k-means, to avoid any hard-coded number of clusters. Can the authors comment on that?

      We clarified the text as follows (line 135):

      “In SpikeMAP, the optimal number of k-means clusters can be chosen by a Calinski-Harabasz criterion (Calinski and Harabasz, 1974) or pre-selected by the user.”

      (6) I'm surprised by the linear decay of the maximal amplitude as a function of the distance from the soma, as shown in Figure 2H. Is it really what should be expected? Based on the properties of the extracellular media, shouldn't we expect a power law for the decay of the amplitude? This is strange that up to 100um away from the soma, the max amplitude only dropped from 260 to 240 uV. Can the authors comment on that? It would be interesting to plot that for all neurons recorded, in a normed manner V/max(V) as function of distances, to see what the curve looks like.

      We added Supplemental Figure 1 showing the drop in voltage over all putative somas (N=1,950) of one recording, after excluding somas with an increase voltage away from electrode peak and computing normed values V/max(V). We see a distribution of slopes as well as intercepts across somas, showing some variability across recordings sites. As the reviewer suggests, it is possible that a power-law describes these data better than a linear function, and this would need to be investigated further by quantitatively comparing the fit of these functions.

      (7) In Figure 3A, it seems that the total number of cells is rather low for such a large number of electrodes. What are the quality criteria that are used to keep these cells? Did the authors exclude some cells from the analysis, and if yes, what are the quality criteria that are used to keep cells? If no criteria are used (because none are mentioned in the Methods), then how come so few cells are detected, and can the authors convince us that these neurons are indeed "clean" units (RPVs, SNRs, ...)?

      The reviewer is correct to point out that a number of stringent criteria were employed to exclude some putative cells. We now outline these criteria directly in the text (line 161):

      “ At different steps in the process, conditions for rejecting spikes can be tailored by applying: (1) a stringent threshold to filtered voltages; (2) a minimal cut-off on the signal-to-noise ratio of voltages (see Supplemental Figure 2); (3) an LDA for cluster separability; (4) a minimal spike rate to putative neurons; (5) a Hartigan statistical dip test to detect spike bursting; (6) a decrease in voltage away from putative somas; and (7) a maximum spike-count correlation for nearby channels. Together, these criteria allow SpikeMAP users the ability to precisely control parameters relevant to automated spike sorting.”

      Further, we provide SNRs of individual channels (Supplemental Figure 2), and added to the SpikeMAP software the ability to apply a minimal criterion based on SNR.

      (8) Still in Figure 3A, it looks like there is a bias to find inhibitory cells at the borders, since they do not appear to be uniformly distributed over the MEA. Can the authors comment on that? What would be the explanation for such a behaviour? It would be interesting to see some macroscopic quantities on Excitatory/Inhibitory cells, such as mean firing rates, averaged SNRs... Because again, in Figure 3C, it is not clear to me that the firing rates of inhibitory cells are higher than Excitatory ones, whilst they should be in theory.

      We have added figures showing the distribution of E and I firing rates across a population of N=1,950 putative cells (Supplemental Figure 3). Firing rates of inhibitory neurons are marginally higher than excitatory neurons, and both E and I follow an approximately exponential distribution of rates.

      Reviewer may be right that there are more I neurons at borders in Fig.3B because injections were done in medial prefrontal cortex, so this may reflect an experimental artefact related to a high probability of activating I neurons in locations where the opsin was activated. We added a sentence to the text to clarify this point (line 201):

      “It is possible that the spatial location of putative I cells reflects the site of injection of the opsin in medial prefrontal cortex.”

      (9) For Figure 3 in general, I would have performed an exhaustive comparison of putative cells found by spikeMAP and other sorters. More precisely, I think that to prove the point that spikeMAP is indeed bringing something new to the field of spike sorting, the authors should have compared the performances of various spike sorters to discriminate Exc vs Inh cells based on their ground truth recordings. For example, either using Kilosort [Pachitariu et al, 2024, 10.1038/s41592-024-02232-7], or some other sorters that might be working with such large high-density data [Yger et al, 2018, 10.7554/eLife.34518].

      The reviewer is correct to point out that our the spike-sorting portion of our pipeline shares similarities with related approaches. Other aspects, however, are unique to SpikeMAP. We have clarified the text accordingly:

      “In sum, SpikeMAP provides an end-to-end pipeline to perform spike-sorting on high-density multielectrode arrays. Some elements of this pipeline are similar to related approaches (Table 1), including the use of voltage filtering, PCA, and k-means clustering. Other elements are novel, including the use of spline interpolation, LDA, and the ability to identify putative excitatory and inhibitory cells.”

      (10) Figure 4 has a big issue, and I guess the panels A and B should be redrawn. I don't understand what the red rectangle is displaying.

      Again, we apologize for the rendering issues in the Figures that occurred during conversion into PDF format. We have now ensured that all figures are properly displayed.

      (11) I understand that Figure 4 is only one example, but I have a hard time understanding from the manuscript how many slices/mices were used to obtain the GT data? I guess the manuscript could be enhanced by turning the data into an open-access dataset, but then some clarification is needed. How many flashes/animals/slices are we talking about? Maybe this should be illustrated in Figure 4, if this figure is devoted to the introduction of the GT data.

      Details of the open access data are now provided in Supplemental Table 1. We also clarified Figure 5B:

      “Quantification of change in firing rate following optogenetic stimulation. Average firing rates are taken over four recordings obtained from 3 mice.”

      (12) While there is no doubt that GT data as the ones recorded here by the authors are the most interesting data from a validation point of view, the pretty low yield of such experiments should not discourage the use of artificially generated recordings such as the ones made in [Buccino et al, 2020, 10.1007/s12021-020-09467-7] or even recently in [Laquitaine et al, 2024, 10.1101/2024.12.04.626805v1]. In these papers, the authors have putative waveforms/firing rate patterns for excitatory and inhibitory cells, and thus, the authors could test how good they are in discriminating the two subtypes.

      We agree with the reviewer that it would be worthwhile for future work to apply SpikeMAP to artificially generated spike trains, and have added the following (line 328):

      “Another avenue could involve applying SpikeMAP on artificially generated spike data (Buccino & Einevoll 2021; Laquitaine et al., 2024).”

      Reviewer #1 (Recommendations for the authors):

      (1) Line 154 seems to include a parenthetical expression left over from editing: "sensitive to noise (contamination? Better than noise?) generated by the signal of proximal units." See also line 186: "use (reliance?) of light-sensitive" and line 245: "In the absence of synaptic blockers (right?)," and line 270: "the size of the data prevents manual intervention (curation?)." Check carefully for all parentheses like that, which should be removed.

      Thank you for pointing this out. We have revised the text and removed parenthetical expressions left over from editing.

      (2) In lines 285-286, you state that: "k-mean clustering of spike waveform properties best differentiated the two principal classes of cells..." But I could not find where you compared k-means clustering to other methods. I think you just argued that k-means seemed to work well, but not better than, another method. If that is so, then you should probably rephrase those lines.

      The reviewer is correct that direct comparisons are not performed here, hence we removed this sentence.

      (3) Methods section, E/I classification, lines 396-405: You give us figures on what fraction was E and I (PV subtype) (94.75% and 5.25%), but there is more that you could have said. First of all, what is the expected fraction of parvalbumin-sensitive interneurons in the cortex - is it near 5%?

      We clarified the text as follows (line 444): “This number is close to the expected percentage of PV interneurons in cortex (4-6%) (Markram et al. 2004).”

      Second, how would these percentages change if you altered the threshold from 3 s.d. to something lower, like 2 s.d.? Giving us some idea of how the threshold affects the fraction of PV interneurons could give us an idea of whether this method agrees with our expectations or not.

      While SpikeMAP offers the flexibility to set the voltage threshold manually, we opted for a stringent threshold to demonstrate the capabilities of the software. As seen in Figure 2D, at 2 and 3 s.d., the signal is largely accounted for by Gaussian noise, while deviation from noise arises around 4 s.d. We clarified the text as follows (line 120):

      “At a threshold of -3 , the signal could be largely accounted for by Gaussian noise, while a separation between signal and noise began around a threshold of -4 ”

      Third, did the inhibitory neurons identified by this optogenetic method also have narrow spike widths at half amplitude? Could you do a scatterplot of all the spike widths and inter-peak distances that had color-coded dots for E and I based on your optogenetic method?

      We have added a scatterplot (Supplemental Figure 5).

      (4) Can you compare your methods with others now widely in use, like, for example, Spiking Circus or Kilosort? You do that in Table 1 in terms of features, but not in terms of performance. For example, you could have applied Kilosort4 to your data from the 4096 electrode array and seen how often it sorted the same neurons that SpikeMAP did. I realize this could not give you a comparison of how many were E/I, but it could tell you how close your numbers of neurons agreed with their numbers. Were your numbers within 5% of each other? This would be helpful for groups who are already using Kilosort4.

      As mentioned ealier, packages listed in Table 1 do not provide an identification of putative E/I neurons on high-density electrode arrays. To facilitation the integration of SpikeMAP with other spike sorting packages, our suite now provides a stand-alone module to perform E/I identification. This is now mentioned in the text (see earlier comment).

      Reviewer #2 (Recommendations for the authors):

      I would encourage the authors to decide what the paper is about: is it about a new sorting method (and if yes, more tests/benchmarks are needed to explain the pros and the cons of the pipelines, and the Methods need to be expanded). Or is it about the new data for Ground Truth validation, and again, if yes, then maybe explain more what they are, how many slices/mice/cells, ... Maybe also consider making the data available online as an open dataset.

      We agree with the reviewer that the paper is best slated toward ground truth validation of E/I identification. We now specify how many slices/mice/cells etc. (see Supplemental Table 1) and make the data available online as open source.

    1. Author response:

      (1) Explore the temporal component of neural responses (instead of collapsing responses to a single number, i.e., the average response over 4s), and determine which of the three models can recapitulate the observed dynamics.

      (2) Expand the polar plot visualization to show all three slopes (changes in responses across all three successive concentrations) instead of only two slopes.

      (3) Attempt to collect and analyze, from published papers, data of: (a) first-order neuron responses to odors to determine the role of first-order inhibition towards generating non-monotonic responses, and (b) PN responses in Drosophila to properly compare with corresponding first-order neuron responses.

      (4) Further discuss: (a) why the brain may need to encode absolute concentration, (b) the distinction between non-monotonic responses and cross-over responses, and (c) potential limitations of the primacy model.

      (5) Expand the divisive normalization model by evaluating different values of k and R, and study the effects of divisive normalization on tufted cells.

      (6) Add discussion of other potential inhibitory mechanisms that could contribute towards the observed effects.

      Reviewer #1:

      The article starts from the premise that animals need to know the absolute concentration of an odor over many log units, but the need for this isn't obvious. The introduction cites an analogy to vision and audition. These are cases where we know for a fact that the absolute intensity of the stimulus is not relevant. Instead, sensory perception relies on processing small differences in intensity across space or time. And to maintain that sensitivity to small differences, the system discards the stimulus baseline. Humans are notoriously bad at judging the absolute light level. That information gets discarded even before light reaches the retina, namely through contraction of the pupil. Similarly, it seems plausible that a behavior like olfactory tracking relies on sensing small gradients across time (when weaving back and forth across the track) or space (across nostrils). It is important that the system function over many log units of concentration (e.g., far and close to a source) but not that it accurately represents what that current concentration is [see e.g., Wachowiak et al, 2025 Recalibrating Olfactory Neuroscience..].

      We thank the Reviewer for the insightful input and agree that gradients across time and space are important for various olfactory behaviors, such as tracking. At the same time, we think that absolute concentration is also needed for two reasons. First, in order to extract changes in concentration, the absolute concentration needs to be normalized out; i.e., change needs to be encoded with respect to some baseline, which is what divisive normalization computes. Second, while it is true that representing the exact number of odor molecules present is not important, this number directly relates to distance from the odor source, which does provide ethological value (e.g., is the tiger 100m or 1000m away?). Indeed, our decoding experiments focused on discriminating relative, and not on absolute, concentrations by classifying between each pair of concentrations (i.e., relative distances), which is effectively an assessment of the gradient. In our revision, we will make all of these points clearer.

      Still, many experiments in olfactory research have delivered square pulses of odor at concentrations spanning many log units, rather than the sorts of stimuli an animal might encounter during tracking. Even within that framework, though, it doesn't seem mysterious anymore how odor identity and odor concentration are represented differently. For example, Stopfer et al 2003 showed that the population response of locust PNs traces a dynamic trajectory. Trajectories for a given odor form a manifold, within which trajectories for different concentrations are distinct by their excursions on the manifold. To see this, one must recognize that the PN responds to an odor pulse with a time-varying firing rate, that different PNs have different dynamics, and that the dynamics can change with concentration. This is also well recognized in the mammalian systems. Much has been written about the topic of dynamic coding of identity and intensity - see the reviews of Laurent (2002) and Uchida (2014).

      Given the above comments on the dynamics of odor responses in first- and second-order neurons, it seems insufficient to capture the response of a neuron with a single number. Even if one somehow had to use a single number, the mean firing rate during the odor pulse may not be the best choice. For example, the rodent mitral cells fire in rhythm with the animal's sniffing cycle, and certain odors will just shift the phase of the rhythm without changing the total number of spikes (see e.g., Fantana et al, 2008). During olfactory search or tracking, the sub-second movements of the animal in the odor landscape get superposed on the sniffing cycle. Given all this, it seems unlikely that the total number of spikes from a neuron in a 4-second period is going to be a relevant variable for neural processing downstream.

      To our knowledge, it is not well understood how downstream brain regions read out mitral cell responses to guide olfactory behavior. The olfactory bulb projects to more than a dozen brain regions, and different regions could decode signals in different ways. We focused on the mean response because it is a simple, natural construct.

      The datasets we analyzed may not include all relevant timing information; for example, the mouse data is from calcium imaging studies that did not track sniff timing. Nonetheless, we plan to address this comment within our framework by binning time into smaller-sized windows (e.g., 0-0.2s, 0.2-0.4s, etc.) and repeating our analysis for each of these windows. Specifically, we will determine how each normalization method fares in recapitulating statistics of the population responses of each window, beyond simply assessing the population mean.

      Much of the analysis focuses on the mean activity of the entire population. Why is this an interesting quantity? Apparently, the mean stays similar because some neurons increase and others decrease their firing rate. It would be more revealing, perhaps, to show the distribution of firing rates at different concentrations and see how that distribution is predicted by different models of normalization. This could provide a stronger test than just the mean.

      We agree that mean activity is only one measure to summarize a rich data set and will perform the suggested analysis.

      The question "if concentration information is discarded in second-order neurons, which exclusively transmit odor information to the rest of the brain, how does the brain support olfactory behaviors, such as tracking and navigation?" is really not an open question anymore. For example, reference 23 reports in the abstract that "Odorant concentration had no systematic effect on spike counts, indicating that rate cannot encode intensity. Instead, odor intensity can be encoded by temporal features of the population response. We found a subpopulation of rapid, largely concentration-invariant responses was followed by another population of responses whose latencies systematically decreased at higher concentrations."

      Primacy coding does provide one plausible mechanism to decode concentration. Our manuscript demonstrated how such a code could emerge in second-order neurons with the help of divisive normalization, though it does require maintaining at least partial rank invariance across concentrations, which may not be robust. We also showed how concentration could be decoded via spike rates, even if average rates are constant, which provides an alternative hypothesis to that of ref 23.

      Further, ref 23 only considers the piriform cortex, which, as mentioned above, is one of many targets of the olfactory bulb, and it remains unclear what the decoding mechanisms are of each of these targets. In addition, work from the same authors of ref 23 found multiple potential decoding strategies in the piriform cortex itself, including changes in firing rate (see Fig. 2E of ref. 23 - Bolding & Franks, 2017; as well as Fig. 4 in Roland et al., 2017).

      It would be useful to state early in the manuscript what kinds of stimuli are being considered and how the response of a neuron is summarized by one number. There are many alternative ways to treat both stimuli and responses.

      We will add this explanation to the manuscript.

      "The change in response across consecutive concentration levels may not be robust due to experimental noise and the somewhat limited range of concentrations sampled": Yes, a number of the curves just look like "no response". It would help the reader to show some examples of raw data, e.g. the time course of one neuron's firing rate to 4 concentrations, and for the authors to illustrate how they compress those responses into single numbers.

      We agree and will add this information to the manuscript.

      "We then calculated the angle between these two slopes for each neuron and plotted a polar histogram of these angles." The methods suggest that this angle is the arctan of the ratio of the two slopes in the response curve. A ratio of 2 would result from a slope change from 0.0001 to 0.0002 (i.e., virtually no change in slope) or from 1 to 2 (a huge change). Those are completely different response curves. Is it reasonable to lump them into the same bin of the polar plot? This seems an unusual way to illustrate the diversity of response curve shapes.

      We agree that the two changes in the reviewer’s example will be categorized in the same quadrant in our analysis. We did not focus on the absolute changes because our analysis covers many log ratios of concentrations. Instead, we focused on the relative shapes of the concentration response curves, and more specifically, the direction of the change (i.e., the sign of the slope). We will better motivate this style of analysis in the revision. Moreover, in response to comments by Reviewer 2, we will compare response shapes between all three successive levels of concentration changes, as opposed to only two levels.

      The Drosophila OSN data are passed through normalization models and then compared to locust PN data. This seems dangerous, as flies and locusts are separated by about 300 M years of evolution, and we don't know that fly PNs act like locust PNs. Their antennal lobe anatomy differs in many ways, as does the olfactory physiology. To draw any conclusions about a change in neural representation, it would be preferable to have OSN and PN data from the same species.

      We are in the process of requesting PN response data in Drosophila from groups that have collected such data and will repeat the analysis once we get access to the data.

      One conclusion is that divisive normalization could account for some of the change in responses from receptors to 2nd order neurons. This seems to be well appreciated already [e.g., Olsen 2010, Papadopoulou 2011, minireview in Hong & Wilson 2013].

      While we agree that these manuscripts do study the effects of divisive normalization in insects and fish, here we show that this computation also generalizes to rodents. In addition, these previous studies do not focus on divisive normalization’s role towards concentration encoding/decoding, which is our focus. We will clarify this difference in the revision.

      Another claim is that subtractive normalization cannot perform that function. What model was used for subtractive normalization is unclear (there is an error in the Methods). It would be interesting if there were a categorical difference between divisive and subtractive normalization.

      We apologize for the mistake in the subtractive normalization equation and will correct it. Thank you for catching it.

      Looking closer at the divisive normalization model, it really has two components: (a) the "lateral inhibition" by which a neuron gets suppressed if other neurons fire (here scaled by the parameter k) , and (b) a nonlinear sigmoid transformation (determined by the parameters n and sigma). Both lateral inhibition and nonlinearity are known to contribute to decorrelation in a neural population (e.g., Pitkow 2012). The "intraglomerular gain control" contains only the nonlinearity. The "subtractive normalization" we don't know. But if one wanted to put divisive and subtractive inhibition on the same footing, one should add a sigmoid nonlinearity in both cases.

      Our intent was not to place all the methods on the “same footing” but rather to isolate the two primary components of normalization methods – non-linearity and lateral inhibition – and determine which of these, and in which combination, could generate the desired effects. Divisive normalization incorporates both components, whereas intraglomerular gain control and subtractive normalization only incorporate one of these components. We will clarify this reasoning in the revision.

      The response models could be made more realistic in other ways. For example, in both locusts and fish, the 2nd order neurons get inputs from multiple receptor types; presumably, that will affect their response functions. Also, lateral inhibition can take quite different forms. In locusts, the inhibitory neurons seem to collect from many glomeruli. But in rats, the inhibition by short axon cells may originate from just a few sparse glomeruli, and those might be different for every mitral cell (Fantana 2008).

      We thank the Reviewer for the input. Instead of fixing k for all second-order neurons, we will apply different k values for different neurons. We will also systematically vary the percentage of neurons used for the divisive normalization calculation in the denominator, and determine the regime under which the effects experimentally observed are reproducible. This approach takes into account the scenario that inter-glomerular inhibitory interactions are sparse.

      There are questions raised by the following statements: "traded-off energy for faster and finer concentration discrimination" and "an additional type of second-order neuron (tufted cells) that has evolved in land vertebrates and that outperforms mitral cells in concentration encoding" and later "These results suggest a trade-off between concentration decoding and normalization processes, which prevent saturation and reduce energy consumption.". Are the tufted cells inferior to the mitral cells in any respect? Do they suffer from saturation at high concentration? And do they then fail in their postulated role for odor tracking? If not, then what was the evolutionary driver for normalization in the mitral cell pathway? Certainly not lower energy consumption (50,000 mitral cells = 1% of rod photoreceptors, each of which consumes way more energy than a mitral cell).

      The question of what mitral cells are “good for”, compared to tufted cells, remains unclear in our view. We speculate that mitral cells provide superior context-dependent processing and are better for determining stimuli-reward contingencies, but this remains far from settled experimentally.

      We believe the mitral cell pathway evolved earlier than tufted cells, since the former appear akin to projection neurons in insects. Nonetheless, we agree that differences in energy consumption are unlikely to be the primary distinguishing factor, and in the revision, we will drop this argument.

      Reviewer #2:

      The main premise that divisive normalization generates this diversity of dose-response curves in the second-order neurons is a little problematic. … The analysis in [Figure 3] indicates that divisive normalization does what it is supposed to do, i.e., compresses concentration information and not alter the rank-order of neurons or the combinatorial patterns. Changes in the combinations of neurons activated with intensity arise directly from the fact that the first-order neurons did not have monotonic responses with odor intensity (i.e., crossovers). This was the necessary condition, and not the divisive normalization for changes in the combinatorial code. There seems to be a confusion/urge to attribute all coding properties found in the second-order neurons to 'divisive normalization.' If the input from sensory neurons is monotonic (i.e., no crossovers), then divisive normalization did not change the rank order, and the same combinations of neurons are activated in a similar fashion (same vector direction or combinatorial profile) to encode for different odor intensities. Concentration invariance is achieved, and concentration information is lost. However, when the first-order neurons are non-monotonic (i.e., with crossovers), that causes the second-order neurons to have different rank orders with different concentrations. Divisive normalization compresses information about concentrations, and rank-order differences preserve information about the odor concentration. Does this not mean that the non-monotonicity of sensory neuron response is vital for robustly maintaining information about odor concentration? Naturally, the question that arises is whether many of the important features of the second-order neuron's response simply seem to follow the input. Or is my understanding of the figures and the write-up flawed, and are there more ways in which divisive normalization contributes to reshaping the second-order neural response? This must be clarified. Lastly, the tufted cells in the mouse OB are also driven by this sensory input with crossovers. How does the OB circuit convert the input with crossovers into one that is monotonic with concentration? I think that is an important question that this computational effort could clarify.

      It appears that there is confusion about the definitions of “non-monotonicity” and “crossovers”.  These are two independent concepts – one does not necessarily lead to the other. Non-monotonicity concerns the response of a single neuron to different concentration levels. A neuron’s response is considered non-monotonic if its response goes up then down, or down then up, across increasing concentrations. A “cross-over” is defined based on the responses of multiple neurons. A cross-over occurs when the response of one neuron is lower than another neuron at one concentration, but higher than the other at a different concentration. For example, the responses of both neurons could increase monotonically with increasing concentration, but one neuron might start lower and grow faster, hence creating a cross-over. We will clarify this in the manuscript, which we believe will resolve the questions raised above.

      The way the decoding results and analysis are presented does not add a lot of information to what has already been presented. For example, based on the differences in rank-order with concentration, I would expect the combinatorial code to be different. Hence, a very simple classifier based on cosine or correlation distance would work well. However, since divisive normalization (DN) is applied, I would expect a simple classification scheme that uses the Euclidean distance metric to work equally as well after DN. Is this the case?

      Yes, we used a simple classification scheme, logistic regression with a linear kernel, which is essentially a Euclidean distance-based classification. This scheme works better for tufted cells because they are more monotonic; i.e., if neuron A and B both increase their responsiveness with concentration, then Euclidean distance would be fine. But if neuron A’s response amplitude goes up and neuron B’s response goes down – as often happens for mitral cells – then Euclidean distance does not work as well. We will add intuition about this in the manuscript.

      Leave-one-trial/sample-out seems too conservative. How robust are the combinatorial patterns across trials? Would just one or two training trials suffice for creating templates for robust classification? Based on my prior experience (https://elifesciences.org/reviewed-preprints/89330https://elifesciences.org/reviewed-preprints/89330), I do expect that the combinatorial patterns would be more robust to adaptation and hence also allow robust recognition of odor intensity across repeated encounters.

      As suggested, we will compute the correlation coefficient of the similarity of neural responses for each odor (across trials). We will repeat this analysis for both mitral and tufted cells. To determine the effect of adaptation, we will compute correlation coefficients of responses between the 1st and 2nd trials vs the 1st and final trial.

      Lastly, in the simulated data, since the affinity of the first-order sensory neurons to odorants is expected to be constant across concentration, and "Jaccard similarity between the sets of highest-affinity neurons for each pair of concentration levels was > 0.96," why would the rank-order change across concentration? DN should not alter the rank order.

      We agree that divisive normalization should not alter the rank order, but the rank order may change in first-order neurons, which carries through to second-order neurons. This confusion may be related to the one mentioned above re: cross-overs vs non-monotonicity. Moreover, in the simulated data (Fig. 4D-H), the Jaccard similarity was calculated based on only the 50 neurons with the highest affinity, not the entire population of neurons. As shown in Fig. 4H, most of the rank-order change happens in the remaining 150 neurons.

      Note that in response to a comment by Reviewer 3, we will change the presentation of Fig. 4H in the revision.

      If the set of early responders does change, how will the decoder need to change, and what precise predictions can be made that can be tested experimentally? The lack of exploration of this aspect of the results seems like a missed opportunity.

      In the Discussion, we wrote about how downstream circuits will need to learn which set of neurons are to be associated with each distinct concentration level. We will expand upon this point and include experimentally testable predictions.

      Based on the methods, for Figures 1 and 2, it appears the responses across time, trials, and odorants were averaged to get a single data point per neuron for each concentration. Would this averaging not severely dilute trends in the data? The one that particularly concerns me is the averaging across different odorants. If you do odor-by-odor analysis, is the flattening of second-order neural responses still observable? Because some odorants activate more globally and some locally, I would expect a wide variety of dose-response relationships that vary with odor identity (more compressed in second-order neurons, of course). It would be good to show some representative neural responses and show how the extracted values for each neuron are a faithful/good representation of its response variation across intensities.

      It appears there is some confusion here; we will clarify in the text and figure captions that we did not average across different odors in our analysis. We will also add figure panels showing some representative neural responses as suggested by the Reviewer.

      A lot of neurons seem to have responses that flat line closer to zero (both firing rate and dF/F in Figure 1). Are these responsive neurons? The mean dF/F also seems to hover not significantly above zero. Hence, I was wondering if the number of neurons is reducing the trend in the data significantly.

      Yes, if a neuron responds to at least one concentration level in at least 50% of the trials, it is considered responsive. So it is possible that some neurons respond to one concentration level and otherwise flatline near zero.  We will highlight a few example neurons to visualize this scenario.

      I did not fully understand the need to show the increase in the odor response across concentrations as a polar plot. I see potential issues with the same. For example, the following dose-response trend at four intensities (C4 being the highest concentration and C1 the lowest): response at C3 > response at C1 and response at C4 > response at C2. But response at C3 < response at C2. Hence, it will be in the top right segment of the polar plot. However, the responses are not monotonic with concentrations. So, I am not convinced that the polar plot is the right way to characterize the dose-response curves. Just my 2 cents.

      Your 2 cents are valuable! Thank you for raising this point. Instead of computing two slopes (C1-C3 and C2-C4), we will expand our analysis to include all three slopes (C1-C2, C2-C3, C3-C4). Consequently, there are 2^3 = 8 different response shapes, and we will list them and quantify the fraction of the responses that fall into each shape category.

      In many analyses, simulated data were used (Figures 3 and 4). However, there is no comparison of how well the simulated data fit the experimental data. For example, the Simulated 1st order neuron in Figure 3D does not show a change in rank-order for the first-order neuron. In Figure 3E, temporal response patterns in second-order neurons look unrealistic. Some objective comparison of simulated and experimental data would help bolster confidence in these results.

      We believe the Reviewer is referring to Figs. 4D and 4E, since Fig. 3D does not show a first-order neuron simulation, and there is no Fig 3E. In Fig. 4D there is no change of rank order because the simulation is for a single odor and single concentration level, and the change of rank-order (i.e., cross-overs) as we define occurs between concentration levels. We will clarify this in the manuscript.

      Reviewer #3:

      While the authors focus on concentration-dependent increases in first-order neuron activity, reflecting the majority of observed responses, recent work from the Imai group shows that odorants can also lead to direct first-order neuron inhibition (i.e., reduction in spontaneous activity), and within this subset, increasing odorant concentration tends to increase the degree of inhibition. Some discussion of these findings and how they may complement divisive normalization to contribute to the diverse second-order neuron concentration-dependence would be of interest and help expand the context of the current results.

      We thank the Reviewer for the suggestion. We will request datasets of first-order neuron responses from the groups who acquired them. We will analyze this data to determine the role of inhibition or antagonistic binding and quantify what percentage of first-order neurons respond less strongly with larger concentrations.

      Related to the above point, odorant-evoked inhibition of second-order neurons is widespread in mammalian mitral cells and significantly contributes to the flattened concentration-dependence of mitral cells at the population level. Such responses are clearly seen in Figure 1D. Some discussion of how odorant-evoked mitral cell inhibition may complement divisive normalization, and likewise relate to comparatively lower levels of odorant-evoked inhibition among tufted cells, would further expand the context of the current results. Toward this end, replication of analyses in Figures 1D and E following exclusion of mitral cell inhibitory responses would provide insight into the contribution of such inhibition to the flattening of the mitral cell population concentration dependence.

      We will perform the analysis suggested, specifically, we will set the negative mitral cell responses to 0 and assess whether the population mean remains flat.

      The idea of concentration-dependent crossover responses across the first-order population being required for divisive normalization to generate individually diverse concentration response functions across the second-order population is notable. The intuition of the crossover responses is that first-order neurons that respond most sensitively to any particular odorant (i.e., at the lowest concentration) respond with overall lower activity at higher concentrations than other first-order neurons less sensitively tuned to the odorant. Whether this is a consistent, generalizable property of odorant binding and first-order neuron responsiveness is not addressed by the authors, however. Biologically, one mechanism that may support such crossover events is intraglomerular presynaptic/feedback inhibition, which would be expected to increase with increasing first-order neuron activation such that the most-sensitively responding first-order neurons would also recruit the strongest inhibition as concentration increases, enabling other first-order neurons to begin to respond more strongly. Discussion of this and/or other biological mechanisms (e.g., first-order neuron depolarization block) supporting such crossover responses would strengthen these results.

      We thank the reviewer for providing additional mechanisms to consider. As suggested, we will add discussion of these alternatives to divisive normalization.

      It is unclear to what degree the latency analysis considered in Figures 4D-H works with the overall framework of divisive normalization, which in Figure 3 we see depends on first-order neuron crossover in concentration response functions. Figure 4D suggests that all first-order neurons respond with the same response amplitude (R in eq. 3), even though this is supposed to be pulled from a distribution. It's possible that Figure 4D is plotting normalized response functions to highlight the difference in latency, but this is not clear from the plot or caption. If response amplitudes are all the same, and the response curves are, as plotted in Figure 4D, identical except for their time to half-max, then it seems somewhat trivial that the resulting second-order neuron activation will follow the same latency ranking, regardless of whether divisive normalization exists or not. However, there is some small jitter in these rankings across concentrations (Figure 4G), suggesting there is some randomness to the simulations. It would be helpful if this were clarified (e.g., by showing a non-normalized Figure 4D, with different response amplitudes), and more broadly, it would be extremely helpful in evaluating the latency coding within the broader framework proposed if the authors clarified whether the simulated first-order neuron response timecourses, when factoring in potentially different amplitudes (R) and averaging across the entire response window, reproduces the concentration response crossovers observed experimentally. In summary, in the present manuscript, it remains unclear if concentration crossovers are captured in the latency simulations, and if not, the authors do not clearly address what impact such variation in response amplitudes across concentrations may have on the latency results. It is further unclear to what degree divisive normalization is necessary for the second-order neurons to establish and maintain their latency ranks across concentrations, or to exhibit concentration-dependent changes in latency.

      As suggested by the Reviewer, we will add another simulation scenario where the response amplitudes (R) are different for different neurons. For each concentration, we will then average each neuron’s response across the entire response window and determine if the simulation reproduces the cross-overs as observed experimentally.

      How the authors get from Figure 4G to 4H is not clear. Figure 4G shows second-order neuron response latencies across all latencies, with ordering based on their sorted latency to low concentration. This shows that very few neurons appear to change latency ranks going from low to high concentration, with a change in rank appearing as any deviation in a monotonically increasing trend. Focusing on the high concentration points, there appear to be 2 latency ranks switched in the first 10 responding neurons (reflecting the 1 downward dip in the points around neuron 8), rather than the 7 stated in the text. Across the first 50 responding neurons, I see only ~14 potential switches (reflecting the ~7 downward dips in the points around neurons 8, 20, 32, 33, 41, 44, 50), rather than the 32 stated in the text. It is possible that the unaccounted rank changes reflect fairly minute differences in latencies that are not visible in the plot in Figure 4G. This may be clarified by plotting each neuron's latency at low concentration vs. high concentration (i.e., similar to Figure 4H, but plotting absolute latency, not latency rank) to allow assessment of the absolute changes. If such minute differences are not driving latency rank changes in Fig. 4G, then a trend much closer to the unity line would be expected in Figure 4H. Instead, however, there are many massive deviations from unity, even within the first 50 responding neurons plotted in Figure 4G. These deviations include a jump in latency rank from 2 at low concentration to ~48 at high concentration. Such a jump is simply not seen in Figure 4G.

      We apologize that Fig. 4H was a poor choice for visualization. What is plotted in Fig. 4H is the sorted identity of neurons under low and high concentrations, and points on the y=x line indicate that the two corresponding neurons have the same rank under the two concentrations. We will replace this panel with a more intuitive visualization, where the x and y axes are the ranks of the neurons; and deviation from the y=x line indicates how different the ranks are of a neuron to the two concentrations.

      In the text, the authors state that "Odor identity can be encoded by the set of highest-affinity neurons (which remains invariant across concentrations)." Presumably, this is a restatement of the primacy model and refers to invariance in latency rank (since the authors have not shown that the highest-affinity neurons have invariant response amplitudes across concentration). To what degree this statement holds given the results in Figure 4H, however, which appear to show that some neurons with the earliest latency rank at low concentration jump to much later latency ranks at high concentration, remains unclear. Such changes in latency rank for only a few of the first responding neurons may be negligible for classifying odor identity among a small handful of odorants, but not among 1-2 orders of magnitude more odors, which may feasibly occur in a natural setting. Collectively, these issues with the execution and presentation of the latency analysis make it unclear how robust the latency results are.

      The original primacy model states that the latency of a neuron decreases with increasing concentration, while the ranks of neurons remain unaltered. Our results, on the other hand, suggest that the ranks do at least partially change across concentrations. This leads to two possible decoding mechanisms. First, if the top K responding neurons remain invariant across concentrations (even if their individual ranks change within the top K), then the brain could learn to associate a population of K neurons with a response latency; lower response latency means higher concentration. Second, if the top K responding neurons do not remain invariant across concentrations, then the brain would need to learn to associate a different set of neurons with each concentration level. The latter imposes additional constraints on the robustness of the primacy model and the corresponding read-out mechanism. We will include more discussion of these possibilities in the revision.

      Analysis in Figures 4A-C shows that concentration can be decoded from first-order neurons, second-order neurons, or first-order neurons with divisive normalization imposed (i.e., simulating second-order responses). This does not say that divisive normalization is necessary to encode concentration, however. Therefore, for the authors to say that divisive normalization is "a potential mechanism for generating odor-specific subsets of second-order neurons whose combinatorial activity or whose response latencies represent concentration information" seems too strong a conclusion. Divisive normalization is not generating the concentration information, since that can be decoded just as well from the first-order neurons. Rather, divisive normalization can account for the different population patterns in concentration response functions between first- and second-order neurons without discarding concentration-dependent information.

      We agree that the word “generating” is faulty. We thank the reviewer for their more precise wording, which we will adopt.

      Performing the same polar histogram analysis of tufted vs. mitral cell concentration response functions (Figure 5B) provides a compelling new visualization of how these two cell types differ in their concentration variance. The projected importance of tufted cells to navigation, emerging directly through the inverse relationship between average concentration and distance (Figure 5C), is not surprising, and is largely a conceptual analysis rather than new quantitative analysis per se, but nevertheless, this is an important point to make. Another important consideration absent from this section, however, is whether and how divisive normalization may impact tufted cell activity. Previous work from the authors, as well as from Schoppa, Shipley, and Westbrook labs, has compellingly demonstrated that a major circuit mediating divisive normalization of mitral cells (GABA/DAergic short-axon cells) directly targets external tufted cells, and is thus very likely to also influence projection tufted cells. Such analysis would additionally provide substantially more justification for the Discussion statement "we analyzed an additional type of second-order neuron (tufted cells)", which at present instead reflects fairly minimal analysis.

      We agree that tufted cells are subject to divisive normalization as well, albeit probably to a less degree than mitral cells. To determine the effect of this, we will alter the strength (and degree of sparseness of interglomerular interactions) of divisive normalization and determine if there is a regime where response features of tufted cells match those observed experimentally.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1:

      The chosen classification scheme for aGPCRs may require reassessment and amendment by the authors in order to prevent confusion with previously issued classification attempts of this family. (…) Can the authors suggest another scheme (mind to avoid the subfamily IIX or the alternative ADGRA-G,L,V subfamily schemes of metazoan aGPCRs), and adapt their numbering throughout the text and all figures/supplementary figures/supplementary files?

      We appreciate the reviewer's comment and agree that a different nomenclature should be used for choanoflagellate aGPCRs to avoid possible confusion. We have now re-labeled the choanoflagellate aGPCR subfamilies, previously numbered from I to XIX, using alphabetical enumeration (from A to S). Changes have been made throughout the main text, in Figure 5, and in Supplementary Figures S6 and S7.

      line 10: The abbreviation 'GPCR-TKL/Ks' is not explained.

      Thank you for pointing this out. We have now revised the text to explain the abbreviation:

      “Adhesion GPCRs and a class of GPCRs fused to kinases (the GPCR-TKL/Ks) are the most abundant GPCRs in choanoflagellates.”

      line 30: "7TM domain is diagnostic for GPCRs": strange wording. Use an alternative expression.

      We changed the wording to: 

      “A conserved seven transmembrane (7TM) domain is a hallmark of GPCRs, while the wide spectrum of extracellular and intracellular domains in some GPCRs reflects the diversification of the gene family and its functions (Schiöth and Lagerström 2008).”

      line 33: In the case of rhodopsins, not the GPCR (i.e., the apoprotein) responds directly to photons, but the retinal, which isomerises upon illumination.

      We thank the reviewer for bringing this to our attention, and we have now removed mention of photons from the list of cues detected by GPCRs.

      “For example, the extracellular N-terminus and the three extracellular loops of the 7TM domain respond to a wide range of cues, including odorant molecules, peptides, amines, lipids, nucleotides, and other molecules (Yang et al. 2021).”

      line 111: What are "genome-enabled choanoflagellates"? Explain the term. As it stands, it doesn't make sense to me.

      We meant only to highlight that these two species have sequenced genomes. We have deleted the phrase “genome enabled.”

      “To assess the predictive power of our protein-detection pipeline, we then compared the new GPCR and cytosolic signaling component datasets from two choanoflagellates – Salpingoeca rosetta and Monosiga brevicollis – with previously published GPCR and downstream GPCR signaling component counts for these two species (Nordström et al. 2009a; Krishnan et al. 2012; De Mendoza et al. 2014; Krishnan et al. 2015; Lokits et al. 2018).”

      line 145: Please give a reasoning for the naming of each of the new families (e.g., RemiSens, Hidden Gold, GPCR-TLK/K, etc.) or at least the explanations of the acronyms/names early in the manuscript, even if they are discussed later in more detail.

      Thank you for identifying this as an area of confusion. While we feel that going into the rationale behind each of the names here would interrupt the flow of the manuscript, we have added a phrase encouraging readers to “hold that thought” with the hope that they can wait for the sections that specifically focus on each of these new GPCR families.

      “This left twelve new GPCR families that had not, to our knowledge, been previously detected in choanoflagellates: Rhodopsin, TMEM145, GPR180, TMEM87, GPR155, GPR157, and six additional GPCR families that appear to fall outside all previously characterized GPCR families in eukaryotes. For reasons that will be discussed further below, we have named these six new GPCR families “Rémi-Sans-Famille” (RSF), “Hidden Gold” (Hi-GOLD), GPCR-TKL/K, GPRch1, GPRch2, and GPRch3. (Fig. 1B; Table 1).”

      lines 297/298 and 2049: Rename tethered agonist "peptide" to "element". Synthetic peptides resembling the TA were used in experiments to test for the sufficiency of the TA for receptor activation, but because the naturally occurring TAs are part of the receptor protein, they are not peptides.

      Thank you for pointing this out. We have revised the text as suggested.

      line 2026: I think the letters in the acronym "CMR" are mixed up and were intended to read "CRM".

      Good catch! We have corrected the text.

      line 2048: "diagnostic" again. Change to "tell-tale", "hallmark", or another similar descriptor.

      We have corrected the text accordingly.

      2058: Strike "motif" in order to avoid confusion with the now obsolete term "GPS motif", which entailed the five most C-terminal β-strands of GAIN subdomain B (not thus neither the full GAIN domain nor the GPS).

      Thank you for pointing this out. We have corrected the text.

      Figure 5: Did the authors also find homologs placed in the aGPCR family based on their 7TM domain sequence but lacking a GAIN domain similar to vertebrate ADGRA/GPR123, the only aGPCR known to lack a GAIN domain (10.1016/j.tips.2013.06.002)? Irrespective of the authors' findings or non-finding on that matter, please insert a note on this in the results text.

      We thank the reviewer for bringing this interesting point to our attention. We have now added a new supplementary figure A in Fig. S9 to answer the reviewer's comment. We also modified the legend of Fig. S9  to take into account this change and uploaded a new supplementary data file 20 to support Fig. S9A. Finally, we revised the main text under the section “Adhesion GPCRs” as requested: 

      Lines 328-331: “ While the GAIN and aGPCR 7TM domains evolved before the origin of opisthokonts (Araç et al.2012; Krishnan et al. 2012; De Mendoza et al. 2014), we detected the fusion of these two domains into a single module (GAIN/7TM) in most, but not all, holozoan aGPCRs (Fig. 5D, Fig.S7B and S9A; Supplementary file 20; Prömel et al, 2013; Krishnan et al. 2014).

      Reviewer #2:

      While the study contributes several interesting observations, it does not radically revise the evolutionary history of the GPCR family. However, in an era increasingly concerned with the reproducibility of scientific findings, this is arguably a strength rather than a weakness. It is encouraging to see that previously established patterns largely hold, and that with expanded sampling and improved methods, new insights can be gained, especially at the level of specific GPCR subfamilies. Then, no functional follow-ups are provided in the model system Salpingoeca rosetta, but I am sure functional work on GPCRs in choanoflagellates is set to reveal very interesting molecular adaptations in the future.

      We agree with the reviewer and anticipate that this work will provide a useful resource to motivate the future functional characterization of GPCRs in choanoflagellates, other CRMs, as well as in metazoans.

      The GPCR-TKL fusion is a particularly interesting finding, especially given the presence of such sequences in sponges. This could potentially represent a synapomorphy shared between sponges and choanoflagellates, later lost in other animals. The authors mention that BLASTP searches using the kinase domain recover the sponge GPCR-TKLs, suggesting the fusion may be ancestral. It would be useful to include phylogenetic trees of both the GPCR and TKL domains to assess this possibility. The authors might also consider examining sponge genomes released by the DTOL project to increase representation from this group.

      We agree and thank the reviewer for this suggestion. We have now added the requested phylogenetic analyses to the new Figure S17, revised the supplementary files and Methods accordingly, and commented on these results in the main text under the section “GPCR-TKL/K and GPCR-TKs“.  

      Lines 579 – 589: “While no metazoan homologs were found when using the 7TM domain of choanoflagellate GPCR-TKs as queries, using the conserved tyrosine kinase domains as queries recovered GPCR-TKs in sponges but not in other metazoan lineages or other holozoans (Fig. S17E). To test whether GPCR-TKs in sponges and choanoflagellates are homologous, we performed phylogenetic analyses of their TK and 7TM domains (Fig. S17F and G). While the TK domains of GPCR-TKs from sponges and choanoflagellates formed a well-supported clade, their 7TM domains did not. These results point to a heterogeneous evolutionary history that may include domain swapping (i.e. ancestral GPCR-TKs in which the 7TM domain was replaced in either the sponge or choanoflagellate lineages) or convergent evolution, in which homologous 7TM domains fused with unrelated 7TM domains in the sponge and choanoflagellate lineages.”

      Added to the Method section “Sequence alignment and phylogenetic analyses”:

      Lines 913 – 933: “Phylogenetic analyses of holozoan aGPCRs, Glutamate Receptors, and Gα subunits, and the 7TM and Kinase domains from GPCR TK/TKL/Ks were performed in this study. (…) To construct the phylogenies of the Kinase domain and 7TM domain from the GPCR TK/TKL/Ks, we first built a dataset including all the GPCR TK/TKL/Ks sequences identified in choanoflagellates and in sponges, as well as the GPCR TKL/Ks previously published in oomycetes and amoebozoans (Van Den Hoogen et al. 2018). We extracted the 7TM domain and Kinase domain from each sequence by combining the transmembrane domain prediction tool TMHMM-2.0 and the protein domain prediction tool InterProScan with the alignment tool MAFFT (E-INS-I algorithm) on Geneious Prime v2024.07 (Supplementary Files 30 and 32). We then aligned the aGPCR, Glutamate and Glutamate GPCR TK/TKL/K Receptor 7TMs, the GPCR TK/TKL/Ks Kinase domain, or the full-length Gα sequences using MAFFT with the E-INS-I algorithm. The resulting alignments were then used for Maximum-likelihood and/or Bayesian inference of phylogenies (Fig. 3B, Fig. 5A, Fig. S3D, and Fig. S6A, and Fig. S17F and G; Supplementary Files 5, 9, 16,18, 31, and 33).”

      Rhodopsin-like receptors are proposed in the discussion to be potential cases of lateral gene transfer (LGT) between eukaryotes. To support or refute this hypothesis, it would be valuable to place the choanoflagellate and ichthyosporean Rhodopsins within a broader phylogeny of this family, including (a few) representatives from animals and other eukaryotes. Even if deep branching relationships remain unresolved, signs such as unusually short branches could point toward recent LGT events.

      Thank you for your suggestion. While we originally considered testing these alternative hypotheses in this manuscript by building a phylogeny, the rapid sequence evolution of the Rhodopsin family has stymied similar efforts in the past and instead motivated others to use clustering approaches like those used in our study (Hu et al. 2017; Thiel et al. 2023). Unfortunately, these types of analyses cannot be used to readily identify instances of LGT.

      Therefore, following the suggestion of the reviewer, we bit the bullet and performed phylogenetic analyses on the sequences in question. Unfortunately, these analyses were completely inconclusive, and we feel they do not warrant inclusion in the manuscript. The topologies of the sequence trees recovered were poorly supported and sensitive to most of the variables we tested – the set of rhodopsin sequences included, the multiple alignment algorithms used, and the probabilistic methods employed to infer the phylogenies. 

      Instead, we have revised the manuscript to highlight the challenge of differentiating between the different hypotheses that are consistent with the phylogenetic distribution of Rhodopsins:

      Lines 670 – 678: “Thus, while it is formally possible that Rhodopsins existed in stem choanoflagellates and were lost in most modern choanoflagellate lineages, either horizontal gene transfer or convergent evolution in the shared ancestor of S. macrocollata and S. punica are similarly plausible explanations for their presence in these species. Differentiating between these alternative evolutionary scenarios is challenging because of rapid rate of sequence evolution within the family and the resultant loss of phylogenetic signal. Our own preliminary investigations of Rhodopsin evolution in non-metazoans were inconclusive. Therefore, ambiguities about the provenance and function of CRM Rhodopsins currently obscure the ancestry of metazoan Rhodopsins and opsins.”

      While the study surveys most available holozoan genomes, it appears that the genomes of Amoebidium spp.-which are cited in the manuscript- were not included. It may not be necessary to repeat all analyses with these two species (A. appalachense and A. parasiticum), but a preliminary search indicates the presence of four candidate 7tm_1 (Rhodopsin-like) proteins in their proteomes. These may warrant closer inspection (e.g., via BLASTP against animal databases) to confirm whether they are genuine GPCRs or false positives.

      Author response image 1.

      We thank the reviewer for bringing these sequences to our attention. To be clear, we did not analyze the Amoebidium spp. genome and we can find no reference to it in our manuscript. If the reviewer had the impression that the genome was analyzed, we would be grateful to know the source of the confusion so that it can be corrected. (We did not intentionally exclude the genome; it simply was not available on the Multicell Genome database from which we retrieved the ichthyosporean genomes and transcriptomes used in this study.)

      Nevertheless, out of curiosity, we have now analyzed the sequences provided by the reviewer and summarize our findings here for the interest of the reviewer. Although the sequences were annotated as 7tm_1 (Rhodopsin-like) proteins in the original genome study, none of these sequences group with metazoan or choanoflagellate Rhodopsins in our clustering analysis; instead, we found that these putative GPCRs form a distinct cluster that only weakly resembles cAMP receptors, both on the basis of their sequence and predicted structures. 

      It is not surprising to find new GPCR clusters as new taxa are folded into the study, and these Amoebidium sequences do not add to our understanding of Rhodopsin evolution. Therefore, we have not added their analysis to the manuscript, but we hope the reviewer finds our quick analysis of interest.

      Author response image 2.

      In Figure 2, perhaps expanding the other holozoan clades would have been nice, as there are not too many species, but I understand if that's beyond the point of the manuscript, focused on choanoflagellates.

      Thank you for this comment. However, given the focus of this study, we feel that an expansion of the other holozoan clades would reduce the clarity of the figure.

      line 87 - "To this end, the 671 validated choanoflagellate GPCRs were sorted by sequence similarity, resulting in 18 clusters. "Some details in the results section would be nice, or at least clear references to where this is explained in more detail. How were the extra choanoflagellate GPCRs added if they failed to be identified with quite sensitive HMM profiles?

      We apologize for the possible confusion and thank the reviewer for the suggestion; we have now added specific references to the related sections from the material and methods for interested readers.

      We believe that the "extra choanoflagellate GPCRs" mentioned by the reviewer refer to the choanoflagellate GPCRs that failed to be detected when the choanoflagellate genomes and transcriptomes were searched with the predominantly metazoan-derived GPCRHMM and HMMs from the GPCR_A Pfam clan (CL0192). We were able to recover these extra choanoflagellate GPCRs by using custom choanoflagellate-specific GPCR HMMs and by blasting the choanoflagellate GPCRs previously identified as queries against the 23 choanoflagellate proteomes. We hope that the referencing of the Methods section "Recovering additional choanoflagellate GPCRs using choanoflagellate GPCR BLAST queries and custom choanoflagellate GPCR HMMs", in lines 91 and 93, will help clarify this point.

      line 108 - Well, from the figure it seems that most eukaryotes have an 'animal-like' G protein signalling, so that's perhaps more of an eukaryotic signature than something that puts choanoflagellates and animals together.

      Excellent point! We have revised the text.

      line 132 - It is unclear what the criteria are to include these taxa as helpers for choanoflagellate classification, and not adding the other unicellular holozoans. Just some text justification could help.

      Thank you for pointing this out. We have added an explanation of the rationale to the methods — section “Clustering of the 918 validated choanoflagellate GPCRs” — and referred to it in the main text.

      New text added to methods:

      “The non-choanoflagellate sequences added to the dataset were either top blast hits recovered after searching the entire Eukprot v3 dataset (993 species) with choanoflagellate GPCRs as queries, or previously published and well-documented GPCR sequences from metazoans.”

      line 145 - These families are listed, but perhaps it would be nice to explicitly mention that they will be covered in more detail later on in the manuscript. I found myself wondering about those exotic names, until I reached the sections in the manuscript where they are explained.

      Thank you for this suggestion. We have now modified our sentence to refer to the related sections.

      “For reasons that will be discussed further below, we have named these six new GPCR families “Rémi-Sans-Famille” (RSF), “Hidden Gold” (Hi-GOLD), GPCR-TKL/K, GPRch1, GPRch2, and GPRch3. (Fig. 1B; Table 1).”

      line 199 - perhaps would be nice to explain domain architecture of validated Dictyostelium GABA-like receptors (ANF domain?).

      Thank you for your suggestion. We have now modified the sentence to mention the protein domain composition of the validated GABA-like receptor, GrlE, in Dictyostelium.

      “The Glutamate Receptors from the amoebozan Dictyostelium discoideum, of which at least one, GrlE, binds both GABA and Glutamate presumably through its conserved ANF domain (Anjard and Loomis 2006; Taniura et al. 2006; Wu and Janetopoulos 2013), grouped separately from metazoan and CRM GPCRs in our analysis.”

      Figure S4 - Perhaps a stacked bar chart would be easier to browse than a bunch of pie charts, notoriously difficult to quantify.

      Thank you for this comment. Opinions differ on how best on whether pie charts or bar charts are more effective in this context (including between the authors of this manuscript). However, we think the point of Figure S4 a minor point, only to be appreciated by a tiny number of readers, and therefore have left the data presentation as it was in the original submission.

    1. Author response:

      The following is the authors’ response to the original reviews

      We thank the reviewers for the constructive comments, which have improved the manuscript. In response to these comments, we have made the following major changes to the main text and reviewer response:

      (1) Added experimental and computational evidence to support the use of Cut&Tag to determine speckle location.

      (2) Performed new Transmission Electron Microscopy (TEM) experiments to visualize interchromatin granule clusters +/- speckle degradation.

      (3) Altered the text of the manuscript to remove qualitative statements and clarify effect sizes.

      (4) Performed new analyses of published whole genome bisulfite data from LIMe-Hi-C following DNMT1 inhibition to demonstrate that CpG methylation is lost at DNMT1i-specific gained CTCF sites.

      (5) Included citations for relevant literature throughout the text.

      These revisions in addition to others are described in the point-by-point response below.

      Reviewer #1 (Public review):

      Summary

      Roseman et al. use a new inhibitor of the maintenance DNA methyltransferase DNMT1 to probe the role of methylation on binding of the CTCF protein, which is known to be involved chromatin loop formation. As previous reported, and as expected based on our knowledge that CTCF binding is methylation-sensitive, the authors find that loss of methylation leads to additional CTCF binding sites and increased loop formation. By comparing novel loops with the binding of the pre-mRNA splicing factor SON, which localizes to the nuclear speckle compartment, they propose that these reactivated loops localize to near speckles. This behavior is dependent on CTCF whereas degradation of two speckle proteins does not affect CTCF binding or loop formation. The authors propose a model in which DNA methylation controls the association of genome regions with speckles via CTCF-mediated insulation.

      Strengths

      The strengths of the study are 1) the use of a new, specific DNMT1 inhibitor and 2) the observation that genes whose expression is sensitive to DNMT1 inhibition and dependent on CTCF (cluster 2) show higher association with SON than genes which are sensitive to DNMT1 inhibition but are CTCF insensitive, is in line with the authors' general model.

      Weaknesses

      There are a number of significant weaknesses that as a whole undermine many of the key conclusions, including the overall mechanistic model of a direct regulatory role of DNA methylation on CTCF-mediated speckle association of chromatin loops.

      We appreciate the reviewer’s constructive comments and address them point-by-point below.

      (1) The authors frequently make quasi-quantitative statements but do not actually provide the quantitative data, which they actually all have in hand. To give a few examples: "reactivated CTCF sites were largely methylated (p. 4/5), "many CTCF binding motifs enriched..." (p.5), "a large subset of reactivated peaks..."(p.5), "increase in strength upon DNMT1 inhibition" (p.5); "a greater total number....." (p.7). These statements are all made based on actual numbers and the authors should mention the numbers in the text to give an impression of the extent of these changes (see below) and to clarify what the qualitative terms like "largely", "many", "large", and "increase" mean. This is an issue throughout the manuscript and not limited to the above examples.

      Related to this issue, many of the comparisons which the authors interpret to show differences in behavior seem quite minor. For example, visual inspection suggests that the difference in loop strength shown in figure 1E is something like from 0 to 0.1 for K562 cells and a little less for KCT116 cells. What is a positive control here to give a sense of whether these minor changes are relevant. Another example is on p. 7, where the authors claim that CTCF partners of reactivated peaks tend to engage in a "greater number" of looping partners, but inspection of Figure 2A shows a very minor difference from maybe 7 to 7.5 partners. While a Mann-Whitney test may call this difference significant and give a significant P value, likely due to high sample number, it is questionable that this is a biologically relevant difference.

      We have amended the text to include actual values, instead of just qualitative statements. We have also moderated our claims in the text to note where effect sizes are more modest.

      The following literature examples can serve as positive controls for the effect sizes that we might expect when perturbing CTCF. Our observed effect sizes are largely in line with these expected magnitudes.

      https://pmc.ncbi.nlm.nih.gov/articles/PMC8386078/ Fig. 2E

      https://www.cell.com/cell-reports/pdf/S2211-1247(23)01674-1.pdf Fig. 3J,K

      https://academic.oup.com/nar/article/52/18/10934/7740592 Fig. S5D (CTCF binding only).

      (2) The data to support the central claim of localization of reactivated loops to speckles is not overly convincing. The overlap with SON Cut&Tag (figure 2F) is partial at best and although it is better with the publicly available TSA-seq data, the latter is less sensitive than Cut&Tag and more difficult to interpret. It would be helpful to validate these data with FISH experiments to directly demonstrate and measure the association of loops with speckles (see below).

      A recent publication we co-authored validated the use of speckle (SON) Cut&Run using FISH (Yu et al, NSMB 2025, doi: 10.1038/s41594-024-01465-6). This paper also supports a role of CTCF in positioning DNA near speckles. Unfortunately, the resolution of these FISH probes is in the realm of hundreds of kilobases. This was not an issue for Yu et. al., as they were looking at large-scale effects of CTCF degradation on positioning near speckles. However, FISH does not provide the resolution we need to look at more localized changes over methylation-specific peak sites.

      Instead, we use Cut&Tag to look at these high-resolution changes. In Figure 3C, we show that SON localizes to DNMT1i-specific peaks only upon DNMT1 inhibition. We further demonstrate that this interaction is dependent on CTCF. In response to reviewer comments, we have now also performed spike-in normalized Cut&Tag upon acute (6 hr) SON degradation to validate that our signal is also directly dependent on SON and not merely due to a bias toward open chromatin.

      Author response image 1.

      TSA-seq has been validated with FISH (Chen et. al., doi: 10.1083/jcb.201807108), Alexander et. Al 10.1016/j.molcel.2021.03.006) Fig 6. We include TSA-seq data where possible in our manuscript to support our claims.

      We also note that Fig 2F shows all CTCF peaks and loops, not just methylation-sensitive peaks and loops, to give a sense of the data. We apologize for any confusion and have clarified this in the figure legend.

      (3) It is not clear that the authors have indeed disrupted speckles from cells by degrading SON and SRRM2. Speckles contain a large number of proteins and considering their phase separated nature stronger evidence for their complete removal is needed. Note that the data published in ref 58 suffers from the same caveat.

      Based upon the reviewers’ feedback, we generated Tranmission electron microscopy (TEM) data to visualize nuclear speckles +/- degradation of SON and SRRM2 (DMSO and dTAG). We were able to detect Interchromatin Granules Clusters (ICGs) that are representative of nuclear speckles in the DMSO condition. However, even at baseline, we observed a large degree of cell-to-cell variability in these structures. In addition, we also observe potential structural changes in the distribution of heterochromatin upon speckle degradation. Consequently, we hesitate to make quantitative conclusions regarding loss of these nuclear bodies. In the interest of transparency, we have included representative raw images from both conditions for the reviewers’ consideration.

      We also note that in Ref 58 (Ilik et. Al., https://doi.org/10.7554/eLife.60579), the authors show diffusion of speckle client proteins RBM25, SRRM1, and PNN upon SON and SRRM2 depletion, further supporting speckle dissociation in these conditions.

      Author response image 2.

      Author response image 3.

      (4) The authors ascribe a direct regulatory role to DNA methylation in controlling the association of some CTCF-mediated loops to speckles (p. 20). However, an active regulatory role of speckle association has not been demonstrated and the observed data are equally explainable by a more parsimonious model in which DNA methylation regulates gene expression via looping and that the association with speckles is merely an indirect bystander effect of the activated genes because we know that active genes are generally associated with speckles. The proposed mechanism of a regulatory role of DNA methylation in controlling speckle association is not convincingly demonstrated by the data. As a consequence, the title of the paper is also misleading.

      While it is difficult to completely rule out indirect effects, we do not believe that the relationship between methylation-sensitive CTCF sites and speckles relies only on gene activity.

      We can partially decouple SON Cut&Tag signal from gene activation if we break down Figure 4D to look only at methylation-sensitive CTCF peaks on genes whose expression is unchanged upon DNMT1 inhibition (using thresholds from manuscript, P-adj > 0.05 and/or |log2(fold-change)| < 0.5). This analysis shows that many methylation-sensitive CTCF peaks on genes with unchanged expression still change speckle association upon DNMT1 inhibition. This result refutes the necessity of transcriptional activation to recruit speckles to CTCF.

      Author response image 4.

      We note the comparator upregulated gene set here is small (~20 genes with our stringent threshold for methylation-sensitive CTCF after 1 day DNMT1i treatment).

      However, we acknowledge that these effects cannot be completely disentangled. We previously included the statement “other features enriched near speckles, such as open chromatin, high GC content, and active gene expression, could instead contribute to increased CTCF binding and looping near speckles” in the discussion. In response to the reviewer’s comment, we have further tempered our statements on page 20/21 and also added a statement noting that DNA demethylation and gene activation cannot be fully disentangled. While we are also open to a title change, we are unsure which part of the title is problematic. 

      (5) As a minor point, the authors imply on p. 15 that ablation of speckles leads to misregulation of genes by altering transcription. This is not shown as the authors only measure RNA abundance, which may be affected by depletion of constitutive splicing factors, but not transcription. The authors would need to show direct effects on transcription.

      We agree, and we have changed this wording to say RNA abundance.

      Reviewer #2 (Public review):

      Summary:

      CTCF is one of the most well-characterized regulators of chromatin architecture in mammals. Given that CTCF is an essential protein, understanding how its binding is regulated is a very active area of research. It has been known for decades that CTCF is sensitive to 5-cystosine DNA methylation (5meC) in certain contexts. Moreover, at genomic imprints and in certain oncogenes, 5meC-mediated CTCF antagonism has very important gene regulatory implications. A number of labs (eg, Schubeler and Stamatoyannopoulos) have assessed the impact of DNA methylation on CTCF binding, but it is important to also interrogate the effect on chromatin organization (ie, looping). Here, Roseman and colleagues used a DNMT1 inhibitor in two established human cancer lines (HCT116 [colon] and K562 [leukemia]), and performed CTCF ChIPseq and HiChIP. They showed that "reactivated" CTCF sites-that is, bound in the absence of 5meC-are enriched in gene bodies, participate in many looping events, and intriguingly, appear associated with nuclear speckles. This last aspect suggests that these reactivated loops might play an important role in increased gene transcription. They showed a number of genes that are upregulated in the DNA hypomethylated state actually require CTCF binding, which is an important result.

      Strengths:

      Overall, I found the paper to be succinctly written and the data presented clearly. The relationship between CTCF binding in gene bodies and association with nuclear speckles is an interesting result. Another strong point of the paper was combining DNMT1 inhibition with CTCF degradation.

      Weaknesses:

      The most problematic aspect of this paper in my view is the insufficient evidence for the association of "reactivated" CTCF binding sites with nuclear speckles needs to be more diligently demonstrated (see Major Comment). One unfortunate aspect was that this paper neglected to discuss findings from our recent paper, wherein we also performed CTCF HiChIP in a DNA methylation mutant (Monteagudo-Sanchez et al., 2024 PMID: 39180406). It is true, this is a relatively recent publication, although the BioRxiv version has been available since fall 2023. I do not wish to accuse the authors of actively disregarding our study, but I do insist that they refer to it in a revised version. Moreover, there are a number of differences between the studies such that I find them more complementary rather than overlapping. To wit, the species (mouse vs human), the cell type (pluripotent vs human cancer), the use of a CTCF degron, and the conclusions of the paper (we did not make a link with nuclear speckles). Furthermore, we used a constitutive DNMT knockout which is not viable in most cell types (HCT116 cells being an exception), and in the discussion mentioned the advantage of using degron technology:

      "With high-resolution techniques, such as HiChIP or Micro-C (119-121), a degron system can be coupled with an assessment of the cis-regulatory interactome (118). Such techniques could be adapted for DNA methylation degrons (eg, DNMT1) in differentiated cell types in order to gauge the impact of 5meC on the 3D genome."

      The authors here used a DNMT1 inhibitor, which for intents and purposes, is akin to a DNMT1 degron, thus I was happy to see a study employ such a technique. A comparison between the findings from the two studies would strengthen the current manuscript, in addition to being more ethically responsible.

      We thank the reviewer for the helpful comments, which we address in the point-by-point response below. We sincerely apologize for this oversight in our references. We have included references to your paper in our revised manuscript. It is exciting to see these complementary results! We now include discussion of this work to contextualize the importance of methylation-sensitive CTCF sites and motivate our study.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      To address the above points, the authors should:

      (1) Provide quantitative information in the text on all comparisons and justify that the small differences observed, albeit statistically significant, are biologically relevant. Inclusion of positive controls to give an indication of what types of changes can be expected would be helpful.

      We have added quantitative information to the text, as discussed in the response to public comments above.  We also provide literature evidence of expected effect sizes in that response.

      (2) Provide FISH data to a) validate the analysis of comparing looping patterns with SON Cut&Tag data as an indicator of physical association of loops with speckles and b) demonstrate by FISH increased association of some of the CTCF-dependent loops/genes (cluster 2) with speckles upon DNMT1 inhibition.

      Please see response to Reviewer 1 comment #2 above. Unfortunately, FISH will not provide the resolution we need for point a). We have confidence in our use of TSA-seq and Cut&Tag to study SON association with CTCF sites on a genome-wide scale, which would not be possible with individual FISH probes. Specifically, since the submission of our manuscript several other researchers (Yu et al, Nat. Struct. and Mol. Biol. 2025, Gholamalamdari et al eLife 2025) have leveraged CUT&RUN/CUT&TAG and TSA-seq to map speckle associated chromatin and have validated these methods with orthogonal imaging based approaches.

      (3) Demonstrate loss of speckles upon SON or SRRM2 by probing for other speckle components and ideally analysis by electron microscopy which should show loss of interchromatin granules.  

      We have performed TEM in K562 cells +/- SON/SRRM2 degradation. Please see response to Reviewer 1 comment #3. Specifically, interchromatin granule clusters are visible in the TEM images of the DMSO sample (see highlighted example above), however, given the heterogeneity of these structures and potential global alterations in heterochromatin that may be occurring following speckle loss, we refrained from making quantitative conclusions from this data. We instead include the raw images above.

      (4) The authors should either perform experiments to clearly show whether loop association is transcription dependent or whether association is merely a consequence of gene activation. Alternatively, they should tone down their model ascribing a direct regulatory role of methylation in control of loop association with speckles and also discuss other models. Unless the model is more clearly demonstrated, the title of the paper should be changed to reflect the uncertainty of the central conclusion.

      Please see response to Reviewer 1 comment #4 above.

      (5) The authors should either probe directly for the effect of speckle ablation on transcription or change their wording.

      We have changed our wording to RNA abundance.

      Reviewer #2 (Recommendations for the authors):

      Major:

      ⁃ There was no DNA methylation analysis after inhibitor treatment. Ideally, genome bisulfite sequencing should be performed to show that the DNMT1i-specific CTCF binding sites are indeed unmethylated. But at the very least, a quantitative method should be employed to show the extent to which 5meC levels decrease in the presence of the DNMT1 inhibitor

      Response: We have now included analysis of genome wide bisulfite information from LIMe-Hi-C (bisulfite Hi-C) in K562 following DNMT1i inhibition. Specifically, we leverage the CpG methylation readout and find that DNTM1i-specific CTCF sites are more methylated than non-responsive CTCF peaks at baseline. In addition, these sites show the greatest decrease in CpG methylation upon 3 days of DNMT1 inhibition. We include a figure detailing these analyses in the supplement (Fig S1E). In addition, we have added CpG methylation genome browser tracks to (Fig S1D). In terms of global change, we have found that 3 days of DNMT1 inhibitor treatment leads to a reduction in methylation to about ~1/4 the level at baseline.

      I am not convinced that CUT&Tag is the proper technique to assess SON binding. CUT&Tag only works under stringent conditions (high salt), and can be a problematic assay for non-histone proteins, which bind less well to chromatin. In our experience, even strong binders such as CTCF exhibit a depleted binding profile when compared to ChIP seq data. I would need to be strongly convinced that the analysis presented in figures 2F-J and S2 D-I simply do not represent ATAC signal (ie, default Tn5 activity). For example, SON ChIP Seq, CUT&Tag in the SON degron and/or ATAC seq could be performed. What worries me is that increased chromatin accessibility would also be associated with increased looping, so they have generated artifactual results that are consistent with their model.

      As the reviewer suggested, we have now performed spike-in normalized SON Cut&Tag with DNMT1 inhibition and 6 hours of SON/SRRM2 degradation in our speckle dTAG knockin cell line. These experiments confirm that the SON Cut&Tag signal we see is SON-dependent. If the signal was truly due to artifactual binding, gained peaks would be open irrespective of speckle binding, however we see a clear speckle dependence as this signal is much lower if SON is degraded.

      Author response image 5.

      Moreover, in our original Cut&Tag experiments, we did not enrich detectable DNA without using the SON antibody (see last 4 samples-IgG controls). This further suggests that our signal is SON-dependent.

      Author response image 6.

      Finally, we see good agreement between Cut&Tag and TSA-seq (Spearman R=0.82).  The agreement is particularly strong in the top quadrant, which is most relevant since this is where the non-zero signal is.

      Author response image 7.

      Minor points

      ⁃ Why are HCT116 cells more responsive to treatment than K562 cells? This is something that could be addressed with DNA methylation analysis, for example

      K562 is a broadly hypomethylated cell line (Siegenfeld et.al, 2022 https://doi.org/10.1038/s41467-022-31857-5 Fig S2A-C). Thus, there may be less dynamic range to lose methylation compared to HCT116.

      Our results are also consistent with previous results comparing DKO HCT116 and aza-treated K562 cells (Maurano 2015, http://dx.doi.org/10.1016/j.celrep.2015.07.024). They state “In K562 cells, 5-aza-CdR treatment resulted in weaker reactivation than in DKO cells…”  In addition, cell-type-specific responsiveness to DNA methyltransferase KO depending upon global CpG methylation levels, has also been observed in ES and EpiLC cells (Monteagudo-Sanchez et al., 2024), which we now comment on in the manuscript.

      ⁃ How many significant CTCF loops in DNMTi, compared to DMSO? It was unclear what the difference in raw totals is.

      We now include a supplemental table with the HiChIP loop information. We call similar numbers of raw loops comparing DNMT1i and DMSO, as only a small subset of loops is changing.

      ⁃ For the architectural stripes, it would be nice to see a representative example in the form of a contact plot. Is that possible to do with the hiChIP data?

      As described in our methods, we called architectural stripes using Stripenn (Yoon et al 2022) from LIMe-Hi-C data under DNMT1i conditions (Siegenfeld et al, 2022). Shown below is a representative example of a stripe in the form of a Hi-C contact map.

      Author response image 8.

      ⁃ Here 4-10x more DNMT1i-specific CTCF binding sites were observed than we saw in our study. What are thresholds? Could the thresholds for DNMT1i-specific peaks be defined more clearly? For what it's worth, we defined our DNMT KO-specific peaks as fold-change {greater than or equal to} 2, adjusted P< 0.05. The scatterplots (1B) indicate a lot of "small" peaks being called "reactivated."

      We called DNMT1i-specific peaks using HOMER getDifferentialPeaksReplicates function. We used foldchange >2 and padj <0.05. We further restricted these peaks to those that were not called in the DMSO condition. 

      ⁃ On this note, is "reactivated" the proper term? Reactivated with regards to what? A prior cell state? I think DNMT1i-specific is a safer descriptor.

      We chose this term based on prior literature (Maurano 2015 http://dx.doi.org/10.1016/j.celrep.2015.07.024, Spracklin 2023 https://doi.org/10.1038/s41594-022-00892-7) . However, we agree it is not very clear, so we’ve altered the text to say “DNMT1i-specific”. We thank the reviewer for suggesting this improved terminology.

      ⁃ It appears there is a relatively small enrichment for CTCF peaks (of any class) in intergenic regions. How were intergenic regions defined? For us, it is virtually half of the genome. We did some enrichment of DNMT KO-specific peaks in gene bodies (our Supplemental Figure 1C), but a substantial proportion were still intergenic.

      We defined intergenic peaks using HOMER’s annotatepeaks function, with the -gtf option using Ensembl gene annotations (v104). We used the standard annotatepeaks priority order, which is TSS > TTS> CDS Exons > 5’UTR exons >3’ UTR exons > Introns > Intergenic.

      Maurano et. al. 2015 (http://dx.doi.org/10.1016/j.celrep.2015.07.024) also found reduced representation of intergenic sites among demethylation-reactivated CTCF sites in their Fig S5A. We note this is not a perfect comparison because their data is displayed as a fraction of all intergenic peaks.

      ⁃ We also recently published a review on this subject: The impact of DNA methylation on CTCF-mediated 3D genome organization NSMB 2024 (PMID: 38499830) which could be cited if the authors choose.

      We have cited this relevant review.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      This manuscript investigates beta burst dynamics in the primate motor cortex during movement and recovery from stroke. The authors differentiate between "global" beta bursts, which are synchronous across cortical and often subcortical regions, and more spatially confined "local" bursts. Global bursts are associated with reduced spiking variability, slower movements, and are more frequent after stroke, while local bursts increase during recovery and grasp execution. The study provides compelling evidence that beta bursts with different spatial and temporal characteristics may play distinct roles in motor control and recovery.

      We thank the reviewer for their assessment that the manuscript proves compelling evidence for distinct roles of local and global beta bursts on motor control and recovery.  

      Strengths:

      The major strength of this paper lies in its conceptual advance: the identification and characterization of distinct global and local beta bursts in the primate motor cortex. This distinction builds upon and considerably extends previous work on the heterogeneity of beta bursts. The paper is methodologically rigorous, using simultaneous cortical and subcortical recordings, detailed behavioral tracking, and thorough analyses of spikeLFP interactions. The use of stroke models and neurotypical animals provides converging evidence for the functional dissociation between burst types. The observation that local bursts increase with motor recovery and occur during grasping is particularly novel and may prove valuable for developing biomarkers of motor function.

      We thank the reviewer for recognizing the strengths of this manuscript. 

      Weaknesses:

      There are several conceptual and methodological limitations that should be addressed. First, the burst detection method relies on an amplitude threshold (median + 1 SD), which is susceptible to false positives and variability (Langford & Wilson, 2025). The classification into global or local bursts then depends on the number of co-bursting channels, compounding the arbitrariness. Second, the imposition of a minimum of three co-bursting cortical channels may bias against the detection of truly local bursts. 

      We thank the reviewer for bringing up these methodological details. We plan to conduct a follow-up analysis using alternative burst detection methods to verify that the paper’s main results hold when using different burst detection methodologies. We anticipate this will improve confidence in our results. 

      Third, the classification is entirely cortical; subcortical activity is considered post hoc rather than integrated into the classification, despite the key role of subcortical-cortical synchrony in motor control. 

      We thank the reviewer for this comment. First, because the different animals had subcortical recording sites in different locations, we hesitate to use subcortical activity in the classification of bursts since we were not sure we would be identifying the same burst-phenomenon (e.g. thalamo-cortical bursts vs. capsule-cortical bursts may differ). Second, we believe that having a cortical-only criteria allows the designation of local vs. global bursts to be more widely applied in preparations that only have access to cortical data (e.g. surface ECoG recordings, EEG, Utah array recordings). Thus, in this study we chose to analyze the subcortical data post-hoc (after burst detection and classification) to support our “global” vs. “local” designation of burst types 

      Fourth, the apparent dissociation between global and local bursts raises important questions about their spatial distribution across areas like M1 and PMv, which are not thoroughly analyzed. 

      We thank the reviewer for this comment. In our study’s stroke animals, we chose to study PMv due to its role in compensating for damage to M1, thus we hesitate to make any comparisons between PMv (which was recorded in stroke animals) and M1 (recorded in healthy unimpaired animals). Furthermore, animals are doing different tasks (e.g. reaching vs. reaching and grasping) which may also influence the spatial distribution. We agree that future work should certainly investigate the spatial distribution of global vs. local beta bursts across areas of sensorimotor cortex and subcortex, and that this comparison would be best done in healthy animals with both reaching and grasping behaviors.  

      Finally, while the authors interpret local bursts during grasping as novel, similar findings have been reported (e.g., Szul et al., 2023; Rayson et al., 2023), and a deeper discussion of these precedents would strengthen the argument.

      Thank you for these references! We will review them and incorporate them into our discussion of our results. 

      Impact:

      This work is likely to have a substantial impact on the field of motor systems neuroscience. The distinction between global and local beta bursts offers a promising framework for understanding the dual roles of beta in motor inhibition and sensorimotor computation. The findings are relevant not only for basic research but also for translational efforts in stroke rehabilitation and neuromodulation, particularly given the emerging interest in beta burst-based biomarkers and stimulation targets. The dataset and analytical framework will be useful to researchers investigating beta dynamics, spike-field relationships, and recovery from neural injury.

      We thank the reviewers for their assessment that our work will likely have a substantial impact on the field of motor systems neuroscience. 

      Reviewer #2 (Public review):

      Summary:

      The paper by Khanna et al. describes global vs local beta synchrony between a cortical premotor area (PMv) and subcortical structures during motor tasks in the non-human primate, specifically investigating the progression following M1 injury. They found that increases in global beta synchrony between PMv and subcortical structures during the sub-acute phase of injury, and that global synchrony was associated with relatively slower motor movements. As recovery progressed, they report a shift from global synchrony to local synchrony and a subsequent reduction in the movement time. The authors suggest that global changes in subcortical and cortical beta synchrony may generally underpin a variety of movement disorders, including Parkinson's disease, and that shifting from global to local (or reducing global synchrony) might improve functional outcomes.

      Strengths:

      Ischemic insults and other acquired brain injuries have a significant public health impact. While there is a large body of clinical and basic science studies describing the behavioral, neurophysiological, and mechanistic outcomes of such injury, there is a significant lack studies looking at longitudinal, behaviorally-related neurophysiological measures following cortical injury, so any information has outsized contribution to understanding how brain injury disrupts underlying neural activity and how this may contribute to injury presentation and recovery.

      A significant percentage of pre-clinical stroke studies tend to focus on peri-infarct or other cortical structures and their role in recovery. The addition of subcortical recordings allows for the investigation of the role of thalamo-basal gangliar-cortical loops that may be contributing to the degree of impairment or to the recovery process is important for the field. Here, there are longitudinal (up to 3 months post-injury) recordings in the ventral premotor area (PMv) and either the internal capsule or sensorimotor thalamus that can be synchronized with phases of behavioral recovery.

      The methods are well described and can act as a framework for assessing synchrony across other data sets with similar recording locations. Limitations in methodology, recordings, and behavior were noted.

      We thank the reviewer for their comments on the strengths of this paper.  

      Weaknesses:

      A major limitation of this paper is that it is a set of case studies rather than a welldesigned, well-controlled study of beta synchrony following motor cortex injury. While non-human primate neurophysiological studies are almost always limited by extremely low animal numbers, they are made up for by the fact that they can acquire significant numbers of units or channels, and in the case of normal behavior, can obtain many behavioral trials over months of individual sessions. Here, there were two NHPs used, but they had different subcortical implant locations (thalamus vs internal capsule). They had different injury outcomes, with one showing a typical recovery curve following injury while one had complications and worsening behavior before ultimately recovering. Further, there were significant differences in the ability to record at different times, with one NHP having poor recordings early in the recovery process while one had poor recordings late in the process. Due to the injury, the authors report sessions in which they were not able to record many trials (~10). Assuming that recovery after a cortical injury is an evolving process, breaking analysis into "Early" and "Late" phases reduces the interpretation of where these shifts occur relative to recovery on the task, especially given different thresholds for recovery were used between animals. Because of this, despite a careful analysis of the data and an extensive discussion, the conclusions derived are not particularly compelling. To overcome this, the authors present data from neurotypical NHPs, but with electrodes in M1 rather than PMv, doing a completely different task with no grasping component, again making accurate conclusions about the results difficult. Even with low numbers, the study would have been much stronger if there were within-animal longitudinal data prior to and after the injury on the same task, so the impact of M1 injury could be better assessed.

      We thank the reviewer for these comments. Below we address some of these in more detail: 

      Different subcortical implant locations: We would like to clarify that the subcortical recordings were only used to confirm that global beta bursts (as characterized by cortical recordings alone) did indeed occur on subcortical sites coincidentally with cortical site more frequently than local beta bursts. Neither the beta burst categories nor the beta bursts themselves were influenced by the subcortical recordings.  

      Different injury outcomes: There is difficulty in creating strokes that result in identical deficits across animal as we and others have noted in previous work[1.3]. As a field, we are still understanding what factors give rise to variability in recovery curves. For example, one recent study noted that biological sex is a factor in predicting differences in recovery rates[4], and another noted that baseline white matter hyperintensities is also predictive of post-stroke recovery [5]. Overall, our methodology that creates structurally-consistent lesions can still result in very different functional outcomes depending on a variety of factors. Given this state of the field, we have done our best to match the recovery curves between our two animals, especially the initial recovery curves before Monkey H’s secondary decline. 

      Differences in ability to record at different times: We note this as a strength. One concern with these studies that induce stroke at the same time as implanting electrode arrays is that it is well appreciated that single-unit neuron yield right after array implantation is low and then improves in the following weeks [6]. There is always that concern that having more units later in recovery may drive results, but in this case, since one animal showed the opposite trend we are more confident that results are not driven by increases in unit-yield. We also note that we broadly see similar unit quality metrics in the early and late stages in both animals (Fig. S7).  

      Breaking continuous recovery curve into early and late: We note that this division was only made for one main analysis in the paper (Fig. 5CD): assessment of mean firing and variance of single-unit firing rates.  Without this split our analyses would be underpowered and inconclusive, thus we would not be able to provide any comment on how firing rates change, even coarsely, with recovery. 

      Presentation of data from M1 of healthy animals doing a different task: We agree that the strongest data would be longitudinally recorded from the same animals/brain areas pre-stroke and then post-stroke. However, we also view our inclusion of separate healthy animals doing a different task as evidence that our global vs. local segregation of beta bursts generalizes beyond the reach-to-grasp task to reaching-only tasks.  

      Overall, we appreciate the reviewer pointing out these notes about our data. In some cases we do not think these notes are concerning, in others, we acknowledge that have done the best we can given the state of the neurophysiology stroke recovery field. 

      It is unclear to what extent the subpial aspiration used is a stroke model. While it is much more difficult to perform a pure ischemic motor injury using electrocoagulatory methods in animal models that do not have a lissencephalic cortex, the suction ablation method that the authors use leads to different outcomes than an ischemic injury alone. For instance, in rat models, ischemic vs suction ablation leads to very different electrophysiological profiles and differences in underlying anatomical reorganization (see Carmichael and Chesselet, 2002), even if the behavioral outcomes were similar. There is a concern that the effects shown may be an artifact of the lesion model rather than informing underlying mechanisms of recovery.

      We thank the reviewer for bringing this up. 

      Clarification of our stroke model methodology: We wish to highlight that when we create stroke, we first do surface vessel occlusion as the first step. This is designed to match true ischemic injury. After a waiting period, the injured tissue is then aspiration to reduce the effects of edema and secondary mass effect in the model. 

      Carmichael and Chesselet 2002: The rodent work cited did show differential effects of a suction ablation method (without any surface vessel occlusion first) versus an ischemic method. The effects observed in this work were in the first 5 days following stroke. In our case, we started recording on day 7 and examined recovery over extended periods (weeks to months). 

      Effects of acute insult on rehabilitation: From a rehabilitation perspective, it remains unclear how the acute insult affects outcomes weeks and months later. One line of evidence to suggest that the manner that the acute insult occurs may not matter for rehabilitation is the observation that one therapeutic approach (vagus nerve stimulation) has been found to successfully improve rehabilitation outcomes in a range of injury models (intracranial hemorrhage, stroke, spinal cord injury). We agree that additional work is required in this area.

      Human stroke data shows similar results reported: Lastly, we note that neurophysiology performed in humans with clinical strokes supports the results we seek here (e.g.[7], see discussion section for full elaboration) suggesting that our stroke model methodology is similar enough to clinical stroke to result in similar results. 

      The injury model leads to seemingly mild impairments in grasp (but not reach), with rapid and complete recovery occurring within 2-3 weeks from the time of injury. Because of the rapid recovery, relating the physiological processes of recovery to beta synchronization becomes challenging to interpret - Are the global bursts the result of the loss of M1 input to subcortical structures? Are they due to the lack of M1 targets, so there is a more distributed response? Is this due to other post-injury sub-acute mechanisms? How specific is this response - is it limited to peri-infarct areas (and to what extent is the PMv electrode truly in peri-infarct cortex), or would this synchrony be seen anywhere in the sensorimotor networks? Are the local bursts present because global synchrony wanes over time as a function of post-injury homeostatic mechanisms, or is local beta synchrony increasing as new motor plans are refined and reinforced during task re-acquisition? How coupled are they related to recovery - if it is motor plan refinement, the shift from global to local seemingly should lag the recovery?  

      We think these are all wonderful questions that could be addressed in follow-up studies! 

      While the study has significant limitations in design that reduce the impact of the results, it should act as a useful baseline/pilot data set in which to build a more complete picture of the role of subcortical-cortical beta synchrony following cortical injury.

      We agree that this is a study that should be treated as a starting point for further investigation. 

      Reviewer #3 (Public review):

      Summary:

      Khanna et al. use a well-conceived and well-executed set of experiments and analyses primarily to document the interaction between neural oscillations in the beta range (here, 13-30 Hz) and recovery of function in an animal model of stroke. Specifically, they show that cortical "beta bursts", or short-term increases in beta power, correlate strikingly with the timeline of behavioral recovery as quantified with a reach-to-grasp task. A key distinction is made between global beta bursts (here, those that synchronize between cortical and subcortical areas) and local bursts (which appear on only a few electrodes). This distinction of global vs. local is shown to be relevant to task performance and movement speed, among other quantities of interest.

      A secondary results section explores the relationship between beta bursts and neuronal firing during the grasp portion of the behavioral task. These results are valuable to include, though mostly unsurprising, with global beta in particular associated with lower mean and variance in spike rates.

      Last, a partial recapitulation of the primary results is offered with a neurologically intact (uninjured) animal. No major contradictions are found with the primary results.

      Highlights of the Discussion section include a thoughtful review of atypical movements executed by individuals with Parkinson's disease or stroke survivors, placing the current results in an appropriate clinical context. Potential physiological mechanisms that could account for the observed results are also discussed effectively.

      Strengths:

      Overall, this is a very interesting paper. The ultimate impact will be enhanced by the authors' choice to analyze beta bursts, which remain a relatively under-explored aspect of neural coding.

      The reach-and-grasp task was also a well-considered choice; the combination of a relatively simple movement (reaching towards a target in the same location each time) and a more complex movement (a skilled object-manipulation grasp) provides an internal control of sorts for data analysis. In addition, the task's two sub-movements provide a differential in terms of their likelihood to be affected by the stroke-like injury: proximal muscles (controlling reach) are likely to be less affected by stroke, while distal muscles (controlling grasp) are highly likely to be affected. Lastly, the requirement of the task to execute an object lift maximizes its difficulty and also the potential translational impact of the results on human injury.

      The above comments about the task exemplify a strength that is more generally evident: a welcome awareness of clinical relevance, which is in evidence several times throughout the Results and Discussion.

      Weaknesses:

      The study's weaknesses are mostly minor and, for the most part, correctable.

      One concern that may not be correctable in this study: the results about the spatial extent of beta activity seem constrained by relatively poor-quality data. It seems half or more of the electrodes are marked as too noisy to provide useful data in Figure 3. If this reflects the wider reality for all analyses, as mentioned, it may not be correctable for the present study. In that case, perhaps some of the experiments or analyses can be revisited or expanded for a future study, when better electrode yields are available.

      We thank the reviewer for their comments. We note that we have chosen to be particularly conservative with which channels we considered noise-free and acceptable for analysis as our animals were not head-posted (see methods: “On each day, trials were manually inspected alongside camera data for any movement or chewing artifacts (note that animals were not head-posted) and were discarded from neural data analysis if there were any artifacts”). After re-visiting our analysis, we note that the data shown in Fig. 3 (spatial distribution of local bursts) is not representative from a data quality perspective – this data was from a session that had a particularly large number of channels discarded due to artifacts. We plan to correct this to show a more representative figure. 

      Other concerns:

      In some places, there is a lack of clarity in the presentation of the results. This is not serious but should be addressed to aid readers' comprehension.

      We thank the reviewer for this comment and for their numerous suggestions in the notes to the authors. We plan to address as many of these as we can to improve clarity and comprehension.  

      Lastly, given the central role of beta oscillations within the study, it would be better for completeness to include even a brief exploration of sustained beta power (rather than bursts), and the modulation of sustained beta (or lack thereof) in the study's areas of concern: behavioral recovery, task performance, etc.

      We thank the reviewer for this suggestion – we plan to include this in our revisions.  

      References cited in response to public reviewer comments: 

      (1) Ganguly, K., Khanna, P., Morecraft, R. J. & Lin, D. J. Modulation of neural co-firing to enhance network transmission and improve motor function after stroke. Neuron 110, 2363–2385 (2022).

      (2) Khanna, P. et al. Low-frequency stimulation enhances ensemble co-firing and dexterity after stroke. Cell 184, 912-930.e20 (2021).

      (3) Darling, W. G. et al. Sensorimotor Cortex Injury Effects on Recovery of Contralesional Dexterous Movements in Macaca mulatta. Exp Neurol 281, 37–52 (2016).

      (4) Bottenfield, K. R. et al. Sex differences in recovery of motor function in a rhesus monkey model of cortical injury. Biology of Sex Differences 12, 54 (2021).

      (5) Schwarz, A. et al. Association that Neuroimaging and Clinical Measures Have with Change in Arm Impairment in a Phase 3 Stroke Recovery Trial. Ann Neurol 97, 709– 719 (2025).

      (6) Gulati, T. et al. Robust Neuroprosthetic Control from the Stroke Perilesional Cortex. J. Neurosci. 35, 8653–8661 (2015).

      (7) Silberstein, P. et al. Cortico-cortical coupling in Parkinson’s disease and its modulation by therapy. Brain 128, 1277–1291 (2005).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary 

      The authors describe a method for gastruloid formation using mouse embryonic stem cells (mESCs) to study YS and AGM-like hematopoietic differentiation. They characterise the gastruloids during nine days of differentiation using a number of techniques including flow cytometry and single-cell RNA sequencing. They compare their findings to a published data set derived from E10-11.5 mouse AGM. At d9, gastruloids were transplanted under the adrenal gland capsule of immunocompromised mice to look for the development of cells capable of engrafting the mouse bone marrow. The authors then applied the gastruloid protocol to study overexpression of Mnx1 which causes infant AML in humans.

      In the introduction, the authors define their interpretation of the different waves of hematopoiesis that occur during development. 'The subsequent wave, known as definitive, produces: first, oligopotent erythro-myeloid progenitors (EMPs) in the YS (E8-E8.5); and later myelo-lymphoid progenitors (MLPs - E9.5-E10), multipotent progenitors (MPPs - E10-E11.5), and hematopoietic stem cells (HSCs - E10.5-E11.5), in the aorta-gonad-mesonephros (AGM) region of the embryo proper.' Herein they designate the yolk sac-derived wave of EMP hematopoiesis as definitive, according to convention, although paradoxically it does not develop from intra-embryonic mesoderm or give rise to HSCs.

      Our definition of primitive and definitive waves is widely used in the field (e.g. PMID: 18204427; PMID: 28299650; PMID: 33681211). Definitive haematopoiesis, encompassing EMP, MLP, MPP and HSC, highlights their origin from haemogenic endothelium, generation of mature cells with adult characteristics from progenitors with multilineage potential and direct and indirect developmental contributions to the intra-embryonic and time-restricted generation of HSCs. 

      General comments 

      The authors make the following claims in the paper: 

      (1) The development of a protocol for hemogenic gastruloids (hGx) that recapitulates YS and AGMlike waves of blood from HE.

      (2) The protocol recapitulates both YS and EMP-MPP embryonic blood development 'with spatial and temporal accuracy'.

      (3) The protocol generates HSC precursors capable of short-term engraftment in an adrenal niche.

      (4) Overexpression of MNX1 in hGx transforms YS EMP to 'recapitulate patient transcriptional signatures'.

      (5) hGx is a model to study normal and leukaemic embryonic hematopoiesis. 

      There are major concerns with the manuscript. The statements and claims made by the authors are not supported by the data presented, data is overinterpreted, and the conclusions cannot be justified. Furthermore, the data is presented in a way that makes it difficult for the reader to follow the narrative, causing confusion. The authors have not discussed how their hGx compares to the previously published mouse embryoid body protocols used to model early development and hematopoiesis. Specific points 

      (1) It is claimed that HGxs capture cellularity and topography of developmental blood formation. The hGx protocol described in the manuscript is a modification of a previously published gastruloid protocol (Rossi et al 2022). The rationale for the protocol modifications is not fully explained or justified. There is a lack of novelty in the presented protocol as the only modifications appear to be the inclusion of Activin A and an extension of the differentiation period from 7 to 9 days of culture. No direct comparison has been made between the two versions of gastruloid differentiation to justify the changes.

      The Reviewer paradoxically claims that the protocol is not novel and that it differs from a previous publication in at least 2 ways – the patterning pulse and the length of the protocol. Of these, the patterning pulse is key. As documented in Fig. 1S1, we cannot obtain Flk1-GFP expression in the absence of Activin A (Fig. 1S1A), and the concentration of Activin A scales activity of the Flk1 locus (Fig. 1S1B). Expression of Flk1 is a fundamental step in haemato-endothelial specification and, accordingly, we do not see CD41 or CD45+ cells in the absence of Activin A. Furthermore, these markers also titrate with the dose of Activin A (in Fig. 1S1B).

      Also, in our hands, there is a clear time-dependent progression of marker expression, with sequential acquisition of CD41 and CD45, with the latter not detectable until 192h (Fig. 1C-D), another key difference relative to the Rossi et al (2022) protocol. We suggest, and present further evidence for in this rebuttal and the revised manuscript, that the 192h-timepoint captures the onset of AGM-like haematopoiesis. We have edited the manuscript to clarify the differences and novelty in our protocol (lines 132-143) and provided a more detailed comparison with the report from Rossi et al. (2022) in the Discussion (lines 574-586).

      The inclusion of Activin A at high concentration at the beginning of differentiation would be expected to pattern endoderm rather than mesoderm. BMP signaling is required to induce Flk1+ mesoderm, even in the presence of Wnt.

      Again, we call the Reviewer’s attention to Fig. 1S1A which clearly shows that Activin A (with no BMP added) is required for induction of Flk1 expression, in the presence of Wnt. Activin A in combination with Wnt, is used in other protocols of haemato-endothelial differentiation from pluripotent cells, with no BMP added in the same step of patterning and differentiation (PMID: 39227582; PMID: 39223325). In the latter protocol, we also call the Reviewer’s attention to the fact that a higher concentration of Activin A precludes the need for BMP4 addition. Finally, one of us has recently reported that Activin A, on its own, will induce Flk1, as well as other anterior mesodermal progenitors (https://www.biorxiv.org/content/10.1101/2025.01.11.632562v1). In addressing the Reviewer’s concerns with the dose of Activin A used, we titrated its concentration against activation of Flk1, confirming optimal Flk1-GFP expression at the 100ng/ml dose used in the manuscript. We have included this data in the manuscript in Figure 1S1B.                         

      FACS analysis of the hGx during differentiation is needed to demonstrate the co-expression of Flk1GFP and lineage markers such as CD34 to indicate patterning of endothelium from Flk1+ mesoderm. The FACS plots in Fig. 1 show C-Kit expression but very little VE-cadherin which suggests that CD34 is not induced. Early endoderm expresses C-Kit, CXCR4, and Epcam, but not CD34 which could account for the lack of vascular structures within the hGx as shown in Fig. 1E.

      We were surprised by the Reviewer’s comment that there are no endothelial structures in our haemogenic gastruloids. The presence of a Flk1-GFP+ network is visible in the GFP images in Fig. 1B, from 144h onwards, and is detailed in the revised Fig. 2A, which shows overlap between Flk1GFP and the endothelial marker CD31. In addition, our single-cell RNA-seq data, included in the manuscript, confirms the presence of endothelial cells with a developing endothelial, including arterial, programme. This is now presented in the revised Fig. 3B-D of the manuscript, which updates a representation in the original manuscript. In contrast with the Reviewer’s claims that no endothelial cells are formed, the data show that Kdr (Flk1)+ cells co-express Cdh5/VE-Cadherin and indeed Cd34, attesting to the presence of an endothelial programme. Arterial markers Efnb2, Flt1, and Dll4 are present. A full-blown programme, which also includes haemogenic markers including Sox17, Esam, Cd44 and Mecom is clear at early (144h) and, particularly at late (192h) timepoints in cells sorted on detection of surface C-Kit (Fig. 3B-E in the manuscript). To address the specific point by the Reviewer, we also document co-expression of Flk1-GFP, CD34 and/or CD31 by flow cytometry (Fig. 2S1A-B in the revised manuscript).

      To summarise new and revised data in the manuscript in relation to this point:

      Immunofluorescence staining showing the Flk1-GFP-defined vascular network in Figure 1E and co-expression of endothelial marker CD31 in Figure 2A. In text: lines 159-163; 178-180.

      Flow cytometry analysis of co-expression of Flk1-GFP with CD31 and CD34 in Figure 2S1AD, including controls. In text: 180-187.

      Real-time quantitative (q)PCR analysis showing time-dependent expression of haematoendothelial and arterial markers in Figure 2F (specifically Dll4 and Mecom). In text: 200-209.

      An improved representation of our scRNA-seq data highlighting key haemato-endothelial markers in Figure 3B-D. In text: 268-304

      (2) The protocol has been incompletely characterised, and the authors have not shown how they can distinguish between either wave of Yolk Sac (YS) hematopoiesis (primitive erythroid/macrophage and erythro-myeloid EMP) or between YS and intraembryonic Aorta-Gonad-Mesonephros (AGM) hematopoiesis. No evidence of germ layer specification has been presented to confirm gastruloid formation, organisation, and functional ability to mimic early development. Furthermore, differentiation of YS primitive and YS EMP stages of development in vitro should result in the efficient generation of CD34+ endothelial and hematopoietic cells. There is no flow cytometry analysis showing the kinetics of CD34 cell generation during differentiation. Benchmarking the hGx against developing mouse YS and embryo data sets would be an important verification. 

      The Reviewer is correct that we have not provided detailed characterisation of the different germ layers, as this was not the focus of the study. In that context, we were surprised by the earlier comment assuming co-expression of C-Kit, Cxcr4 and Epcam, which we did not show, while overlooking the endothelial programme reiterated above, which we have presented. Given our focus on haemato-endothelial specification, we have started the single-cell RNA-seq characterisation of the haemogenic gastruloid at 120h and have not looked specifically at earlier timepoints of embryo patterning. This said, we show the presence of neuroectodermal cells in cluster 9; on the other hand, cluster 7 includes hepatoblast-like cells, denoting endodermal specification (Supplementary File S2). However, in the absence of earlier timepoints and given the bias towards mesodermal specification, we expect that specification of ectodermal and endodermal programmes may be incomplete. 

      In respect of the contention regarding the capture of YS-like and AGM-like haematopoiesis, we had presented evidence in the original version of the manuscript that haemogenic cells generated during gastruloid differentiation, particularly at late 192h and 216h timepoints project onto highly purified CKit+ CD31+ Gfi1-expressing cells from mouse AGM (PMID: 38383534), providing support for at least partial recapitulation of the corresponding developmental stage. These projections are represented in Fig. 4A, right and 4S1C of the revised manuscript. In distinguishing between YS-like and AGM-like haematopoiesis, we call the Reviewer’s attention to the replotting of the single-cell RNA-seq data already in the manuscript, which we provided in response to point 1 (Fig. 3B-D and 3S2B), which highlights an increase in Sox17, but not Sox18, expression in the 192h haemogenic endothelium, which suggests an association with AGM haematopoiesis (PMID: 20228271). A significant association of Cd44 and Procr expression with the same time-point (Fig. 3B-D in the manuscript), further supports an AGM-like endothelial-to-haematopoietic transition at the 192h timepoint. We have re-analysed the scRNA-seq data to better represent the expression of these markers in Fig. 3A-E and S32B. We agree that it remains challenging to identify markers exclusive to AGM haematopoiesis, which is operationally equated with generation of transplantable haematopoietic stem cells. While HSC generation is a key event characteristic of the AGM, not all AGM haematopoiesis corresponds to HSCs, an important point in evaluating the data presented in the manuscript, and one that is acknowledged by us. The main text has been edited to clarify the experiments pertaining to distinguishing AGM and YS haematopoiesis, which are detailed in lines 180-187, 200-221, 268-304, and 315-356.

      Following on the Reviewer’s comments about Cd34, we also inspected co-expression of Cd34 with Cd41 and Cd45, the latter co-expression present in, although not necessarily exclusive to, AGM haematopoiesis. Reassuringly, we observed clear co-expression with both markers (Author response image 1), in addition to a CD41+CD34- population, which likely reflects YS EMP-independent erythropoiesis. Flow cytometry analysis of co-expression of CD31 and CD34 in CD41+ and CD45+ populations at 144h and 216h timepoints has been included in Fig. 2B-D, Fig. 2S1A-D, including controls. In text: 180-187. We have earlier on in the rebuttal highlighted the fact that marker expression is responsive to the levels of Activin A used in the patterning pulse, with the 100ng/ml Activin A used in our protocol superior to 75ng/ml.

      Author response image 1.

      Association of CD34 with CD41 and CD45 expression is Activin A-responsive and supports the presence of definitive haematopoiesis. A. Flow cytometry analysis of CD34 and CD41 expression in 216h-haemogenic gastruloids; two doses of Activin A were used in the patterning pulse with CHI99021 between 48-72h. FMO controls shown. B. Flow cytometry analysis of CD34 and CD45 at 216h in the same experimental conditions.

      Given the centrality of this point in comments by all the Reviewers, we have conducted projections of our single-cell RNA-seq data against two studies which (1) capture arterial and haemogenic specification in the para-splanchnopleura (pSP) and AGM region between E8.0 and E11 (Hou et al, PMID: 32203131), and (2) uniquely capture YS, AGM and FL progenitors and the AGM endothelial-tohaematopoietic transition (EHT) in the same scRNA-seq dataset (Zhu et al, PMID: 32392346). Focusing the analysis on the subsets of haemogenic gastruloid cells sorted as CD41+ (144h) C-Kit+ (144h and 192h) and CD45+ (192h and 216h) (now represented in Fig. 3A, and projected onto the studies in Fig. 4A), we show:

      (1) That a subset of haemato-endothelial cells from haemogenic gastruloids at 144h to 216h project onto intra-embryonic cells spanning E8.25 to E10 (revised Fig. 4A left and 4S1A). This is in agreement with our original interpretation that 216h are no later than the MPP/pre-HSC state of embryonic development, requiring further maturation to generate engrafting progenitors. We have nevertheless removed specific references to pre-HSC, and instead referred to HSPC/progenitors.

      (2) That haemogenic gastruloids contain YS-like (including EMP-like) and AGM-like haematopoietic cells (Fig. 4A centre and 4 S1B). Significantly, some of the cells, particularly CKit-sorted cells with a candidate endothelial and HE-like signature project onto AGM pre-HE and HE, as well as IAHC. Some 144h CD41+ and 192h CD45+ cells also project onto IAHC, suggesting that YS-like and AGM-like programmes arise independently and with partial timedependent organisation in the haemogenic gastruloid model. Later, predominantly 216h cells, have characteristics of MPP/LMPP-like cells from the FL, suggesting a progenitor wave of differentiation.

      Altogether, the data support the notion that haemogenic gastruloids capture YS and AGM haematopoiesis until E10, as suggested by us in the manuscript.This re-analysis of the scRNA-seq data which was indeed prompted by challenging and insightful comments from the Reviewers, has been incorporated in the manuscript as described above and further listed here:

      Re-clustering and highlights of specific markers in our scRNA-seq data in Figure 3A-E. In text: 268-304.

      Projections to mouse embryo datasets in Figure 4A (Figure 4S1A-C; Supplementary File 3). In text: 315-356. 

      Single-cell RNA sequencing was used to compare hGx with mouse AGM. The authors incorrectly conclude that ' ..specification of endothelial and HE cells in hGx follows with time-dependent developmental progression into putative AGM-like HE..' And, '...HE-projected hGx cells.......expressed Gata2 but not Runx1, Myb, or Gfi1b..' Hemogenic endothelium is defined by the expression of Runx1 and Gfli1b is downstream of Runx1.

      As a hierarchy of regulation, Gata2 precedes and drives Runx1 expression at the specification of HE (PMID: 17823307; PMID: 24297996), while Runx1 drives the EHT, upstream of Gfi1b in haematopoietic clusters (PMID: 34517413). Please note that the text segment the Reviewer refers to has been removed from the manuscript, as the analysis is no longer solely focused on projection to Thambyrajah et al (2024) data, and instead gained significantly from the projections on to the Hou et al (2020) and Zhu et al (2020) studies, as detailed above.

      (3) The hGx protocol 'generates hematopoietic SC precursors capable of short-term engraftment' is not supported by the data presented. Short-term engraftment would be confirmed by flow cytometric detection of hematopoietic cells within the recipient bone marrow, spleen, thymus, and peripheral blood that expressed the BFP transgene. This analysis was not provided. PCR detection of transcripts, following an unspecified number of amplification cycles, as shown in Figure 3G (incorrectly referred to as Figure 3F in the legend) is not acceptable evidence for engraftment.

      We provide the full flow cytometry analysis of spleen engraftment in the 5 mice which received implantation of 216h-haemogenic gastruloids in the adrenal gland and were analysed at 4 weeks; an additional (control) animal received adrenal injection of PBS (Fig. 4B-D in the revised manuscript). In this experiment, the bone marrow collection was limiting, and material was prioritised for PCR (Fig. 4C and full gels in 4S2C in the revised manuscript).

      We had previously provided only representative plots of flow cytometry analysis of bone marrow and spleen, which we described as low-level engraftment and were chosen conservatively. The analysis was meant to complement the genomic DNA PCR, where detection was present in only some of the replicates tested per animal. On this note, we confirm that PCR analysis used conventional 40 cycles; the sensitivity had already been shown in the earlier version of the manuscript and is again represented in Fig. 4S2B. We argue that the low level of cytometric and molecular engraftment at 4 weeks, from haemogenic gastruloid-derived progenitors that have not progressed beyond a stage equivalent to E10 (Fig. 4A and Supplementary File 3 in the revised manuscript from scRNAseq projections), and that we have described as requiring additional maturation in vivo, are not surprising. Indeed, as previously shown and now repeated in in Fig. 2B-E (controls in Fig. 2S1E-G) in the revised manuscript, no more than 7 CD45+CD144+ multipotent cells are present per haemogenic gastruloid. We are only able to implant 3 haemogenic gastruloids in the adrenal gland of each transplanted animal. 

      We have rephrased Results and Discussion in lines 359-415 and 588-621, respectively, to rectify the nature of the engraftment, which we now attribute more generically to progenitors, also in light of the developmental time we could capture in the gastruloids prior to implantation.

      Transplanted hGx formed teratoma-like structures, with hematopoietic cells present at the site of transplant only analysed histologically. Indeed, the quality of the images provided does not provide convincing validation that donor-derived hematopoietic cells were present in the grafts.

      As stated in the text, the images mean to illustrate that the haemogenic gastruloids developed in situ. Further analysis motivated by the Reviewers’ comments and indeed a subsequent experiment with analysis of engraftment at a later timepoint of 8 weeks (revised Fig. 4E and 4 S2F-G) did not show a direct correspondence between engraftment and in vivo development or expansion, although this occurs in some cases. To be clearer, the observation of donor-derived blood cells in the implanted haemogenic gastruloids would not correspond to engraftment, as we have amply demonstrated that they have generated blood cells in vitro. There is no evidence that there are remaining pluripotent cells in the haemogenic gastruloid after 9 days of differentiation, and it is therefore not clear that the structures observed are teratomas. We specifically comment on this point in the revised manuscript – lines 601-607.

      There is no justification for the authors' conclusion that '... the data suggest that 216h hGx generate AGM-like pre-HSC capable of at least short-term multilineage engraftment upon maturation...'. Indeed, this statement is in conflict with previous studies demonstrating that pre-HSCs in the dorsal aorta of the mouse embryo are immature and actually incapable of engraftment.

      We have clearly stated that we do not see haematopoietic engraftment through transplantation of dissociated haemogenic gastruloids, which reach the E10 state containing pre-HSC (revised Fig 4A, 4S1A and Supplementary File 3). Instead, we observed rare myelo-erythroid (revised Fig. 4S2F-G) and myelo-lymphoid (revised Fig. 4E) engraftment upon in vivo maturation of haemogenic gastruloids with preserved 3D organisation. These statements are not contradictory. Nevertheless, we have now more cautiously attributed engraftment to the present of progenitors as a generic designation, and not to pre-HSC (lines 412-414 and 588-592 in the revised manuscript).

      The statement '...low-level production of engrafting cells recapitulates their rarity in vivo, in agreement with the embryo-like qualities of the gastruloid system....' is incorrect. Firstly, no evidence has been provided to show the hGx has formed a dorsal aorta facsimile capable of generating cells with engrafting capacity. Secondly, although engrafting cells are rare in the AGM, approximately one per embryo, they are capable of robust and extensive engraftment upon transplantation.

      As indicated above, the statement in lines 412-414 now reads “Engraftment is erythromyeloid at 4 weeks and lympho-myeloid at 8 weeks, reflecting different classes of progenitors, putatively of YS-like and AGM-like affiliation.” To be clear, with our original statement we meant to highlight that the production of definitive AGM-like haematopoietic progenitors (not all of which are engrafting) in haemogenic gastruloids does not correspond to non-physiological single-lineage programming. We did and do not claim that we achieved production of HSC, which would be long-term engrafting.

      (4) Expression MNX1 transcript and protein in hematopoietic cells in MNX1 rearranged acute myeloid leukaemia (AML) is one cause of AML in infants. In the hGX model of this disease, Mnx1 is overexpressed in the mESCs that are used to form gastruloids. Mnx1 overexpression seems to confer an overall growth advantage on the hGx and increase the serial replating capacity of the small number of hematopoietic cells that are generated. The inefficiency with which the hGx model generates hematopoietic cells makes it difficult to model this disease. The poor quality of the cytospin images prevents accurate identification of cells. The statement that the kit-expressing cells represent leukemic blast cells is not sufficiently validated to support this conclusion. What other stem cell genes are expressed? Surface kit expression also marks mast cells, frequently seen in clonogenic assays of blood cells. Flow cytometric and gene expression analyses using known markers would be required.

      The haemogenic gastruloid model generates haematopoietic and haemato-endothelial cells. MNX1 expands C-Kit+ cells at 144h, which we show to have a haemato-endothelial signature (see revised Fig. 3A-E, Supplementary File 2). We have added additional flow cytometry data showing that the replating cells from MNX1 express CD31 (Figure 6S1A-B).

      Serial replating of CFC assays is a conventional in vitro assay of leukaemia transformation. Critically, colony replating is not maintained in EV control cells, attesting to the transformation potential of MNX1. Although we have not fully-traced the cellular hierarchy of MNX1-driven transformation in the haemogenic gastruloid system, the in vitro replating expands a C-Kit+ cell (revised Fig. 6E), which reflects the surface phenotype of the leukaemia, also recapitulated in the mouse model initiated by MNX1-overexpressing FL cells. Importantly, it recapitulates the transcriptional profile of MNX1leukaemia patients (revised Fig. 7C), which is uniquely expressed by MNX1144h and replated colony cells, but not to MNX1 216h gastruloid cells, arguing against a generic signature of MNX1 overexpression (revised Fig. 7B). Importantly, the MNX1-transformation of haemogenic gastruloid cells is superior to the FL leukaemia model at capturing the unique transcriptional features of MNX1-driven leukaemia, distinct from other forms of AML in the same age group (Fig 7 S1D-F). It is possible that this corresponds to a pre-leukaemia event, and we will explore this in future studies, which are beyond the proof-of-principle nature of this paper.

      (5) In human infant MNX1 AML, the mutation is thought to arise at the fetal liver stage of development. There is no evidence that this developmental stage is mimicked in the hGx model.

      We never claim that the haemogenic gastruloid model mimics the foetal liver. We propose that susceptibility to MNX1 is at the HE-to-EMP transition. Moreover, and importantly, contrary to the Reviewer’s statement, there is no evidence in the literature that the mutation arises in the foetal liver stage, just that the mutation arises before birth (PMID: 38806630), which is different. In a mouse model of MNX1 overexpression, the authors achieve leukaemia engraftment upon MNX1 overexpression in foetal liver, but not in bone marrow cells (PMID: 37317878). This is in agreement with a vulnerability of embryonic / foetal, but not adult cells to the MNX1 expression caused by the translocation. However, haematopoietic cells in the foetal liver originate from YS and AGM precursors, so the origin of the MNX1susceptible cells can be in those locations, rather than the foetal liver itself.

      Reviewer #2 (Public review):

      Summary: 

      In this manuscript, the authors develop an exciting new hemogenic gastruloid (hGX) system, which they claim reproduces the sequential generation of various blood cell types. The key advantage of this cellular system would be its potential to more accurately recapitulate the spatiotemporal emergence of hematopoietic progenitors within their physiological niche compared to other available in vitro systems. The authors present a large set of data and also validate their new system in the context of investigating infant leukemia. 

      Strengths: 

      The development of this new in vitro system for generating hematopoietic cells is innovative and addresses a significant drawback of current in vitro models. The authors present a substantial dataset to characterize this system, and they also validate its application in the context of investigating infant leukemia. 

      Weaknesses: 

      The thorough characterization and full demonstration that the cells produced truly represent distinct waves of hematopoietic progenitors are incomplete. The data presented to support the generation of late yolk sac (YS) progenitors, such as lymphoid cells, and aortic-gonad-mesonephros (AGM)-like progenitors, including pre-hematopoietic stem cells (pre-HSCs), by this system are not entirely convincing. Given that this is likely the manuscript's most crucial claim, it warrants further scrutiny and direct experimental validation. Ideally, the identity of these progenitors should be further demonstrated by directly assessing their ability to differentiate into lymphoid cells or fully functional HSCs. Instead, the authors primarily rely on scRNA-seq data and a very limited set of markers (e.g., Ikzf1 and Mllt3) to infer the identity and functionality of these cells. Many of these markers are shared among various types of blood progenitors, and only a well-defined combination of markers could offer some assurance of the lymphoid and pre-HSC nature of these cells, although this would still be limited in the absence of functional assays.

      The identification of a pre-HSC-like CD45⁺CD41⁻/lo C-Kit⁺VE-Cadherin⁺ cell population is presented as evidence supporting the generation of pre-HSCs by this system, but this claim is questionable. This FACS profile may also be present in progenitors generated in the yolk sac such as early erythromyeloid progenitors (EMPs). It is only within the AGM context, and in conjunction with further functional assays demonstrating the ability of these cells to differentiate into HSCs and contribute to long-term repopulation, that this profile could be strongly associated with pre-HSCs. In the absence of such data, the cells exhibiting this profile in the current system cannot be conclusively identified as true pre-HSCs.

      We present 2 additional pieces of evidence to support our claims that we capture YS and AGM stages of haematopoietic development.

      (I) In the new Figures 4A and 4 S1A-C and Supplementary File 3 in the revised manuscript, we project our single-cell RNA-seq data onto (1) developing intra-embryonic pSP and AGM between E8 and E11 (Fig. 4A left, 4S1A) and (2) a single-cell RNA-seq study of HE development which combines haemogenic and haematopoietic cells from the YS, the developing HE and IAHC in the AGM, and FL (Fig. 4A centre, 4S1B). Our data maps E8.25-E10, and captures YS EMP and erythroid and myeloid progenitors, as well as AGM pre-HE, HE and IAHC, with some cells matching HSPC and LMPP, as suggested by the projection onto the Thambyrajah et al data set (already presented in the previous version of the manuscript, and now in Fig. 4A right and 4 S1C). The projection of the scRNA-seq data in presented in lines 314-355 of the revised manuscript. The scRNA-seq data itself was refocused on haemato-endothelial programmes as presented in the revised Fig. 3A-E, described in lines 267-303.

      (II) Given the difficulty in finding markers that specifically associate with AGM haematopoiesis, we inspected the possibility of capturing different regulatory requirements at different stages of gastruloid development mirroring differential effects in the embryo. Polycomb EZH2 is specifically required for EMP differentiation in the YS, but does not affect AGM-derived haematopoiesis; it is also not required for primitive erythroid cells (PMID: 29555646; PMID: 34857757). We treated haemogenic gastruloids from 120h onwards with either DMSO (0.05%) or GSK126 (0.5uM), and inspected the cellularity of gastruloids at 144h, which we equate with YS-EMP, and 216h – putatively AGM haematopoiesis. We show that EZH2 inhibition / GSK126 treatment specifically reduces %CD41+ cells at 144h, but does not reduce %CD41+ or %CD45+ cells at 216h. We have included this experiment in the manuscript in Fig. 2 S2B-C (in text: 209-221).

      These data, together with the scRNA-seq projections described, provide evidence to our claim that 144h haemogenic gastruloids capture YS EMPs, while CD41+ and CD45+ cells isolated at 216h reflect AGM progenitors. We cannot conclude as to the functional nature of the AGM cells from this experiment. The main text has been edited to clarify the experiments pertaining to distinguishing AGM and YS haematopoiesis (lines 180-187; 200-221; 268-304; 315-356).

      The engraftment data presented are also not fully convincing, as the observed repopulation is very limited and evaluated only at 4 weeks post-transplantation. The cells detected after 4 weeks could represent the progeny of EMPs that have been shown to provide transient repopulation rather than true HSCs. 

      In the original version of the manuscript, we stated that there is low level engraftment and did not claim to have generated HSC. Instead, we described cells with short-term engraftment potential. We agree with the Reviewer that the cells we show in the manuscript at 4 weeks could be EMPs (revised Fig. 4B-E and 4 S2D-G). Additionally, we now have 8-week analysis of implant recipients, in which we observed, again low-level, a multi-lineage engraftment of the recipient bone marrow in 1:3 recipients (revised Fig. 4B-E and 4S2F-H). This engraftment is myeloid-lymphoid and therefore likely to have originated in a later progenitor. To be clear, we do not claim that this corresponds to the presence of HSC. It nevertheless supports the maturation of progenitors with engraftment potential. Limiting amounts of material was prioritised for flow cytometry stainings, not allowing PCR analysis. We rephrased Results and Discussion in lines 359-414 and 588-621, respectively, to rectify the nature of the engraftment.      

      Reviewer #3 (Public review):  

      In this study, the authors employ a mouse ES-derived "hemogenic gastruloid" model which they generated and which they claim to be able to deconvolute YS and AGM stages of blood production in vitro. This work could represent a valuable resource for the field. However, in general, I find the conclusions in this manuscript poorly supported by the data presented. Importantly, it isn't clear what exactly are the "YS" and the "AGM"-like stages identified in the culture and where is the data that backs up this claim. In my opinion, the data in this manuscript lack convincing evidence that can enable us to identify what kind of hematopoietic progenitor cells are generated in this system. Therefore, the statement that "our study has positioned the MNX1-OE target cell within the YS-EMP stage (line 540)" is not supported by the evidence presented in this study. Overall, the system seems to be very preliminary and requires further optimization before those claims can be made.

      Specific comments below: 

      (1) The flow cytometric analysis of gastruloids presented in Figure 1 C-D is puzzling. There is a large % of C-Kit+ cells generated, but few VE-Cad+ Kit+ double positive cells. Similarly, there are many CD41+ cells, but very few CD45+ cells, which one would expect to appear toward the end of the differentiation process if blood cells are actually generated. It would be useful to present this analysis as consecutive gating (i.e. evaluating CD41 and CD45 within VE-Cad+ Kit+ cells, especially if the authors think that the presence of VE-Cad+ Kit+ cells is suggestive of EHT). The quantification presented in D is misleading as the scale of each graph is different.

      Fig. 1C-D provide an overview of haemogenic markers during the timecourse of haemogenic gastruloid differentiation, and does indeed show a late up-regulation of CD45, as the Reviewer points out would be expected. The %CD45+ cells is indeed low. However, we should point out that the haemogenic gastruloid protocol, although biased towards mesodermal outputs, does not aim to achieve pure haematopoietic specification, but rather place it in its embryo-like context. We refute that the scale is misleading: it is a necessity to represent the data in a way that is interpretable by the reader: and we made sure from the outset that the gates (in C) are truly representative and annotated, as are the plot axes (in D). Consecutive gating at the 216h-timepoint is shown and quantified in Fig. 2S1D-F, or in the alternative consecutive gating suggested by the Reviewer, in Author response iamge 2 below. At the request of Reviewer 1, we also analysed CD31 and CD34 within CD41 and CD45 populations, again as validation of the emergent haematopoietic character of the cells obtained. This new analysis is shown in revised Fig. 2B, quantified in 2C.

      Author response image 2.

      Flow cytometry analysis of VE-cadherin+ cells in haemogenic gastruloids at 216h of the differentiation protocol, probing co-expression of CD45, CD41 and C-Kit.

      (2) The imaging presented in Figure 1E is very unconvincing. C-Kit and CD45 signals appear as speckles and not as membrane/cell surfaces as they should. This experiment should be repeated and nuclear stain (i.e. DAPI) should be included.

      We included the requested immunofluorescence staining in Figure 1E (216h). We also show the earlier timepoint of 192h here as Author response image 3. In text: lines 158-162.

      Author response image 3.

      Confocal images of haematopoietic production in haemogenic gastruloids. Wholemount, cleared haemogenic gastruloids were stained for CD45 (pseudo-coloured red) and C-Kit antigens (pseudo-coloured yellow) with indirect staining, as described in the manuscript. Flk1-GFP signal is shown in green. Nuclei are contrasted with DAPI. (A) 192h. (B) 216h.

      (3) Overall, I am not convinced that hematopoietic cells are consistently generated in these organoids. The authors should sort hematopoietic cells and perform May-Grunwald Giemsa stainings as they did in Figure 6 to confirm the nature of the blood cells generated.

      It is factual that the data are reproducible and complemented by functional assays shown in revised Fig. 2D-E, which clearly demonstrate haematopoietic output. The single-cell RNA-seq data also show expression of a haematopoietic programme, which we have complemented with biologically independent qRT-PCR analysis of the expression of key endothelial and haematopoietic marker and regulatory genes (revised Fig. 2F; in text: 200-209). As requested, we include Giemsa-Wright’s stained cytospins obtained at 216h to illustrate haematopoietic output. These are shown in revised Fig. 2S2A, in text: lines 194-199. Inevitably, the cytospins will be inconclusive as to the presence of endothelial-tohaematopoietic transition or the generation of haematopoietic stem/progenitor cells, as these cells do not have a distinctive morphology.

      (4) The scRNAseq in Figure 2 is very difficult to interpret. Specific points related to this: - Cluster annotation in Figure 2a is missing and should be included. 

      Why do the heatmaps show the expression of genes within sorted cells? Couldn't the authors show expression within clusters of hematopoietic cells as identified transcriptionally (which ones are they? See previous point)? Gene names are illegible.

      I see no expression of Hlf or Myb in CD45+ cells (Figure 2G). Hlf is not expressed by any of the populations examined (panels E, F, G). This suggests no MPP or pre-HSC are generated in the culture, contrary to what is stated in lines 242-245. (PMID 31076455 and 34589491).Later on, it is again stated that "hGx cells... lacked detection of HSC genes like Hlf, Gfi1, or Hoxa9" (lines 281-283). To me, this is proof of the absence of AGM-like hematopoiesis generated in those gastruloids.

      For a combination of logistic and technical reasons, we performed single-cell RNA-seq using the Smart-Seq2 platform, which is inherently low throughput. We overcame the issue of cell coverage by complementing whole-gastruloid transcriptional profiling at successive time-points with sorting of subpopulations of cells based on individual markers documented in Fig. 1. We clearly stated which platform was used as well as the number and type of cells profiled (Fig. 3S1 and lines 226-241 of the revised manuscript), and our approach is standard. Following suggestions of the Reviewers to further focus our analysis on the haemogenic cellular differentiation within the gastruloids, we revised the presentation of the scRNA-seq data to now provide UMAP projections with representation and quantification of individual genes, including the ones queried by the Reviewer in Fig. 3 and respective supplements. Specifically, re-clustering and highlighting of specific markers are shown in Figure 3A-D and presented in lines 267-303 of the revised manuscript. Complementary independent real-time quantitative (q)PCR analysis showing time-dependent expression of endothelial and haematopoietic markers is now in Figure 2F. In text: 200-208.

      (5) Mapping of scRNA-Seq data onto the dataset by Thambyrajah et al. is not proof of the generation of AGM HE. The dataset they are mapping to only contains AGM cells, therefore cells do not have the option to map onto something that is not AGM. The authors should try mapping to other publicly available datasets also including YS cells.

      We have done this and the data are presented in Figure 4A (Figure 4S1A) and Supplementary File. In text: 314-355. As detailed in response to Reviewer 1, we have conducted projections of our single-cell RNA-seq data against two studies which (1) capture arterial and haemogenic specification in the para-splanchnopleura (pSP) and AGM region between E8.0 and E11 (Hou et al, PMID: 32203131) (revised Fig. 4A and 4 S1A), and (2) uniquely capture YS, AGM and FL progenitors and the AGM endothelial-to-haematopoietic transition (EHT) in the same scRNA-seq dataset (Zhu et al, PMID: 32392346) (revised Fig. 4A and 4 S1B). Specifically in answering the Reviewers’ point, we show that different subsets of haemogenic gastruloid cells sorted on haemogenic surface markers C-Kit, CD41 and CD45 cluster onto pre-HE and HE, intra-aortic clusters and FL progenitor compartments, and to YS EMP and erythroid and myeloid progenitors. This lends support to our claim that the haemogenic gastruloid system specifies both YS-like and AGM-like cells. Please note that we now do point out that some CD41+ cells at 144h project onto IAC, as do cells at the later timepoints, suggesting that AGM-like and YS-EMP-like waves may overlap at the 144h timepoint (lines…). In the future, we will address specific location of these cells, but that corresponds to a largescale spatial transcriptomics analysis requiring extensive optimisation for section capture which is beyond the scope of this manuscript and this revision. 

      (6) Conclusions in Figure 3, named "hGx specify cells with preHSC characteristics" are not supported by the data presented here. Again, I am not convinced that hematopoietic cells can be efficiently generated in this system, and certainly not HSCs or pre-HSCs.

      We have provided evidence in the original manuscript, and now through additional experiments, that there is haematopoietic specification, including of progenitor cells, in the haemogenic gastruloid system. Molecular markers are shown in revised Fig. 2F and Fig. 3 and supplements; CFC assays are shown in revised Fig. 2D-E; cytospins are in revised Fig. 2 S2A; further analysis of 4-week implants and new analysis of 8-week implants (discussed below) are in revised Fig. 4 B-D and Fig. 4 S2 and we discussed the new scRNA-seq projections above. Importantly, we have never claimed, and again do not, that haemogenic gastruloids generate HSC. We accept the Reviewer’s comment that we have not provided sufficient evidence for the specification of pre-HSC-like cells and accordingly now refer more generically and conservatively to progenitors.

      FACS analysis in 3A is again very unconvincing. I do not think the population identified as C-Kit+ CD144+ is real. Also, why not try gating the other way around, as commonly done (e.g. VE-Cad+ Kit+ and then CD41/CD45)?

      Our gating strategy is not unconventional, which was done from a more populated gate onto the less abundant one to ensure that the results are numerically more robust. In the case of haemogenic gastruloids, unlike the AGM preparations the Reviewer may be referring to, CD41 and CD45+ cells are more abundant as there is no circulation of more differentiated haematopoietic cells away from the endothelial structures. This said, we did perform the gating as suggested (Rev Fig. 2), indeed confirming that most VE-cad+ Kit+ cells are CD45+. Interestingly VE-cad+Kit- are predominantly CD41+, reinforcing the haematopoietic nature of these cells.

      The authors must have tried really hard, but the lack of short- or long-engraftment in a number of immunodeficient mouse models (lines 305-313) really suggests that no blood progenitors are generated in their system. I am not familiar with the adrenal gland transplant system, but it seems like a very non-physiological system for trying to assess the maturation of putative pre-HSCs. The data supporting the engraftment of these mice, essentially seen only by PCR and in some cases with a very low threshold for detection, are very weak, and again unconvincing. It is stated that "BFP engraftment of the Spl and BM by flow cytometry was very low level albeit consistently above control (Fig. S4E)" (lines 337-338). I do not think that two dots in a dot plot can be presented as evidence of engraftment.

      We have presented the data with full disclosure and do not deny that the engraftment achieved is low-level and short-term, indicating incomplete maturation of definitive haematopoietic progenitors in the current haemogenic gastruloid system. Indeed, by not wanting to overstate the finding, we were deliberately conservative in our representative flow cytometry plots and focused on the PCR for sensitivity. We now present the full flow cytometry analysis for spleen where we preserved more cells after the genomic DNA extraction (revised Fig. 4C) and call the Reviewer’s attention to the fact that detection of BFP+ cells by PCR and flow cytometry in the recipient animals is consistent between the 2 methods (revised Fig. 4C and D; full gels previously presented now in Fig. 4S2C; sensitivity analysis was also previously available and is now in Fig. 4S2B). In addition, we have now also been able to detect low-level myelo-lymphoid engraftment in the bone marrow and spleen 8 weeks after adrenal implantation, again suggesting the presence of a small number of definitive haematopoietic progenitors that potentially mature from the 3 haemogenic gastruloids implanted (Fig. 4E and 4 S2F-G in the revised manuscript. We rephrased Results and Discussion at lines 359-414 and 589-621, respectively, to rectify the nature of the engraftment which we attribute to progenitors.

      (7) Given the above, I find that the foundations needed for extracting meaningful data from the system when perturbed are very shaky at best. Nevertheless, the authors proceed to overexpress MNX1 by LV transduction, a system previously shown to transform fetal liver cells, mimicking the effect of the t(7;12) AML-associated translocation. Comments on this section:

      The increase in the size of the organoid when MNX1 is expressed is a very unspecific finding and not necessarily an indication of any hematopoietic effect of MNX1 OE.

      We agree with the Reviewer on this point; it is nevertheless a reproducible observation which we thought relevant to describe for completeness and data reproducibility.

      The mild increase of cKit+ cells (Figure 4E) at the 144hr timepoint and the lack of any changes in CD41+ or CD45+ cells suggests that the increase in Kit+ cells % is not due to any hematopoietic effect of MNX1 OE. No hematopoietic GO categories are seen in RNA seq analysis, which supports this interpretation. Could it be that just endothelial cells are being generated?

      The Reviewer is correct that the MNX1-overexpressing cells have a strong endothelial signature, which is present in patients (revised Fig. 5A). We investigated a potential link with C-Kit by staining cells from the replating colonies during the process of in vitro transformation with CD31. We observed that 40-50% of C-Kit+ cells (20-30% total colony cells) co-expressed CD31, at least at early plating. These cells co-exist with haematopoietic cells, namely Ter119+ cells, as expected from the YSlike erythroid and EMP-like affiliation of haematopoietic output from 144h-haemogenic gastruloids. These data are included in Fig. 6S1A-B (in text 506-507) of the revised manuscript.

      (8) There seems to be a relatively convincing increase in replating potential upon MNX1-OE, but this experiment has been poorly characterized. What type of colonies are generated? What exactly is the "proportion of colony forming cells" in Figures 5B-D? The colony increase is accompanied by an increase in Kit+ cells; however, the flow cytometry analysis has not been quantified.

      Given the inability to replate control EV cells, there is not a population to compare with in terms of quantification. The level of C-Kit+ represented in Fig. 6E of the revised manuscript is achieved at plate 2 or 3 (depending on the experiment), both of which are significantly enriched for colony-forming cells relative to control (revised Fig. 6B, D).  

      (9) Do hGx cells engraft upon MNX1-OE? This experiment, which appears not to have been performed, is essential to conclude that leukemic transformation has occurred.

      For the purpose of this study, we are satisfied with confirmation of in vitro transformation potential of MNX1 haemogenic gastruloids, which can be used for screening purposes. Although interesting, in vivo leukaemia engraftment from haemogenic gastruloids is beyond the scope of this study.

      Reviewer #2 (Recommendations for the authors):

      (1) Minor comments

      (a) I find the denomination "hGx" very confusing as it would suggest that these gastruloids are human, whereas, in fact, they are murine.

      We agree with the Reviewer on the confusing nomenclature and have edited the manuscript to call “haemGx” instead.

      (b) I find the presence of mast cells in CFC of MNX1-OE cultures very puzzling as this does not bear any resemblance to human leukemia.

      We detect an enrichment of mast cell transcriptional programmes, as defined by the cell type repositories. While it is not mast cells to represent leukaemic cells in patients, this ontology is likely to reflect the developmental stage and origin of progenitors which are affected by MNX1.

      (2) I have a few suggestions to improve figures and tables clarity, to help readers better follow the data presented.

      (a) To enhance readability, it would be beneficial to highlight the genes mentioned in the text within the scRNA-seq figures. Many figures currently display over 30-40 genes in small font sizes, making it difficult to quickly locate specific genes discussed in the text. Additionally, implementing a colorcoding system to categorize these genes according to their proposed lineages would improve clarity and organization.

      We have now performed major re-organisation and re-analyses of the scRNA-seq data, which we believe has improved the readability and clarity of the corresponding sections of the manuscript.

      (b) The data presented in Supplementary Table 1, along with other supplementary tables, are challenging to interpret due to insufficient annotations. Enhancing these tables with clearer and more detailed annotations would significantly improve clarity and aid readers in understanding the supplementary materials.

      Descriptive text has been added to accompany each Supplementary File to aid in understanding the results reported therein.

      Reviewer #3 (Recommendations for the authors):

      In addition to what was written in the public review, I would suggest the authors simplify and shorten the text. Currently, a lot of unnecessary detail is included which makes the story very hard to follow. Moreover, the authors should modify the figures to make them more comprehensible, especially for RNA-seq data.

      We have significantly re-arranged and shortened parts of the manuscript, particularly by focusing the Discussion. Results presentation has also been improved through additional analysis and graphic representation of the scRNA-seq data, which we believe has improved the readability and clarity.s

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The manuscript aims to elucidate the impact of a prophage within the genome of Shewanella fidelis on its interaction with the marine tunicate Ciona robusta. The authors made a deletion mutant of S. fidelis that lacks one of its two prophages. This mutant exhibited an enhanced biofilm phenotype, as assessed through crystal violet staining, and showed reduced motility. The authors examined the effect of prophage deletion on several genes that could modulate cyclic-diGMP levels. While no significant changes were observed under in vitro conditions, the gene for one protein potentially involved in cyclic-diGMP hydrolysis was overexpressed during microbe-host interactions. The mutant was retained more effectively within a one-hour timeframe, whereas the wild-type (WT) strain became more abundant after 24 hours. Fluorescence microscopy was used to visualize the localization patterns of the two strains, which appeared to differ. Additionally, a significant difference in the expression of one immune protein was noted after one hour, but this difference was not evident after 23 hours. An effect of VCBC-C addition on the expression of one prophage gene was also observed.

      Strengths:

      I appreciate how the authors integrate diverse expertise and methods to address questions regarding the impact of prophages on gut microbiome-host interactions. The chosen model system is appropriate, as it allows for high-throughput experimentation and the application of simple imaging techniques.

      Weaknesses:

      My primary concern is that the manuscript primarily describes observations without providing insight into the molecular mechanisms underlying the observed differences. It is particularly unclear how the presence of the prophage leads to the phenotypic changes related to bacterial physiology and host-microbe interactions.

      We appreciate the overall, enthusiastic reviewer feedback.  The current manuscript presents experimental evidence of the biological impact of the deletion of a stably integrated prophage in the genome of Shewanella fidelis 3313. The molecular mechanisms responsible for these biological effects are currently unknown but based on the limited genetic insight of some predicted gene regions, we can speculate on prophage-mediated influences impacting swimming behaviors. Below, we address additional concerns raised by the reviewer.

      Which specific prophage genes are critical, or is the insertion at a specific site in the bacterial genome the key factor?  While significant effects on bacterial physiology are reported under in vitro conditions, there is no clear attribution to particular enzymes or proteins.

      In this particular case, it is not entirely clear, as most ORFs within the prophage region have unknown functions, i.e., predicted as hypothetical proteins. In addition, the original insertion site does not appear to interrupt any specific gene but may impact adjacent genes/pathways (Fig 1b). Enhanced annotations, along with future targeted deletion methods for distinct prophage segments, will help us better investigate which predicted gene regions influence the observed traits. This will deepen our understanding of the mechanisms that regulate prophage influence on these traits.

      In contrast, when the system is expanded to include the tunicate, differences in the expression of a cyclic-diGMP hydrolase become apparent. Why do we not observe such differences under in vitro conditions, despite noting variations in biofilm formation and motility? Furthermore, given that the bacterial strain possesses two prophages, I am curious as to why the authors chose to target only one and not both.

      Differences in expression patterns of c-di-GMP regulators were also noted in vitro, but they just missed the statistical significance threshold when rho was used as a bacterial reference gene. The expression pattern of pdeB was consistent among each biological replicate, however. In full transparency, pdeB qPCR was originally performed with recA as a reference standard (bioRxiv preprint, ver 1). Here, significant changes in pdeB expression were observed in the in vitro assays comparing WT and ΔSfPat. These results prompted us to study changes in pdeB expression during in vivo colonization experiments, which also revealed significant changes. However, there was a concern that a potential SOS response would also activate recA, despite our preliminary data suggesting SOS was not involved. As a precautionary, we repeated the experiments with rho as a reference gene after it was identified as a stable reference. However, with rho as a reference gene, statistically significant responses were noted during in vivo colonization, but not in the in vitro assays. 

      In the current manuscript, one prophage was targeted based on preliminary findings indicating that the SfPat prophage region influences behaviors likely to impact colonization of the Ciona robusta gut. A separate genetic segment was also previously targeted for deletion as a misidentified prophage-like region, but that strain is not included in the current description. The currently presented data indicate that the observed phenomena can be attributed to the SfPat prophage.

      Regarding the microbe-host interaction, it is not clear why the increased retention ability of the prophage deletion strain did not lead to greater cell retention after 24 hours, especially since no differences in the immune response were observed at that time point.

      A predominantly adherent (non-motile) phenotype would likely facilitate elimination within fecal strings. There is substantial evidence from multiple model systems that strong swimming ability enhances the exploration and colonization of mucosal surfaces. Swimming helps with the penetration of mucus layers, chemotaxis toward epithelial surfaces, and overall “decision-making” in terms of shifting from a free-swimming (planktonic) state in the lumen within dietary material to a more sessile, adherent phenotype at the mucosal surface.

      Concerning the methodological approach, I am puzzled as to why the authors opted for qPCR instead of transcriptomics or proteomics. The latter approaches could have provided a broader understanding of the prophage's impact on both the microbe and the host.

      We agree with the reviewer that a transcriptomics approach would provide a broader understanding of the prophage’s impact on the microbe and animal host. Future studies will include a full multi-omic evaluation of this interaction. 

      Reviewer #1 (Recommendations for the authors):

      Besides my above mentioned issues, I have a few more mini things:

      (A) what makes S. fidelis being a persistant member of the host microbiome? Please elaborate more on quantitive studies in this respect. –

      Shewanella species are stable members of the Ciona gut, and previous efforts (Dishaw et al, 2016) revealed that chitin and/or secreted host effectors could influence biofilm formation. The Ciona gut produces copious amounts of endogenous chitin-rich mucus, and a variety of bacteria have been identified that thrive under these conditions. In addition, versatile bacteria like Shewanella sp. likely expand the metabolic potential of filter-feeders like Ciona. Thus, our subsequent studies began to focus on these and other microbes isolated from the Ciona gut that appear to be stable residents. Identical strains have been recovered numerous times (since 2011) from this wild population of Ciona robusta.  

      (B) The authors use the word inter kingdom and refer to phage, bacterium and animal. As phages are not part of the three kingdoms of life I believe the terminology is wrong.

      Thank you for bringing this to our attention. In this context, we were referring to bacteria+phage as a unit and their interkingdom interaction with the animal host. But we recognize that this term can be misleading. Another, more appropriate term is ‘tripartite,’ and we have changed interkingdom to tripartite as appropriate, e.g., the abstract.

      (C) I like lines 55-61 and was expecting to see in the manuscript what of those things would be true for the chosen prophage.

      We looked at the coding region annotations within the prophage and the adjacent regions. The prophage coding regions are mostly annotated as unknown or predicted proteins, and a few as known phage-related components. We intend to reanalyze future and improved annotations and conduct deletion experiments targeting specific open reading frames (ORFs).

      (D) In line 76 the authors mention a Gödecke reference for Pseudomonas. I believe that this paper only deals with S. oneidensis.

      The inadvertent Gödecke reference has been removed.

      (E) All figures: The captions are too short to understand what the figures are showing and everything is too small and hard to read or see. Along these lines it is often unclear what the many datapoints show. Biological replicates, technical replicates....Overall figure 1 does not seem to contain much information.

      Figures and captions have been improved as suggested. Thank you for bringing this to our attention.

      (F) Figure 3 what are a and b showing?

      Figure and descriptive legend have been improved.

      (G) Figure 4: Why did the author check expression only for one gene after 1 h but several genes after 24 h?

      Since we observed that in vitro VCBP-C alters biofilms of S. fidelis 3313 (Dishaw et al 2016), we hypothesized that the bacteria may alter host VCBP-C expression and that the influence of integrated prophages may further modulate gene expression. Since VCBP-C is endogenously expressed in the gut of Ciona, we expected that early exposure/colonization (one hour) would be crucial for the bacterial-VCBP interactions. Hence, the VCBP-C was our primary target. We then tested multiple immune response genes at 24 hours to get a more detailed understanding of the maturing immune responses. Future studies will expand our efforts using global transcriptomics to understand better the immune response during bacterial exposure and colonization events.

      (H) Do the authors mean stationary or localised?

      We are not sure about the context of the reviewer’s question here but we think our modifications have addressed these concerns. 

      Reviewer #2 (Public review):

      Summary:

      In the manuscript, "Prophage regulation of Shewanella fidelis 3313 motility and biofilm formation: implications for gut colonization dynamics in Ciona robusta", the authors are experimentally investigating the idea that integrated viruses (prophages) within a bacterial colonizer of the host Ciona robusta affect both the colonizer and the host. They found a prophage within the Ciona robusta colonizing bacterium Shewanella fidelis 3313, which affected both the bacteria and host. This prophage does so by regulating the phosphodiesterase gene pdeB in the bacterium when the bacterium has colonized the host. The prophage also regulates the activity of the host immune gene VCBP-C during early bacterial colonization. Prophage effects on both these genes affect the precise localization of the colonizing bacterium, motility of the bacterium, and bacterial biofilm formation on the host. Interestingly, VCBP-C expression also suppressed a prophage structural protein, creating a tripartite feedback loop in this symbiosis. This is exciting research that adds to the emerging body of evidence that prophages can have beneficial effects not only on their host bacteria but also on how that bacteria interacts in its environment. This study establishes the evolutionary conservation of this concept with intriguing implications of prophage effects on tripartite interactions.

      Strengths:

      This research effectively shows that a prophage within a bacterium colonizing a model ascidian affects both the bacterium and the host in vivo. These data establish the prophage effects on bacterial activity and expand these effects to the natural interactions within the host animal. The effects of the prophage through deletion on a suite of host genes are a strength, as shown by striking microscopy.

      Weaknesses:

      Unfortunately, there are abundant negative data that cast some limitations on the interpretation of the data. That is, examining specific gene expression has its limitations, which could be avoided by global transcriptomics of the bacteria and the host during colonization by the prophage-containing and prophage-deleted bacteria (1 hour and 24 hours). In this way, the tripartite interactions leading to mechanism could be better established.

      We thank the reviewer for their comments and recognize this important limitation. As a follow-up to the current study, we plan to perform more comprehensive global meta-transcriptomics analyses to better understand differentially expressed genes across both the host and microbe during colonization.

      Impact:

      The authors are correct to speculate that this research can have a significant impact on many animal microbiome studies, since bacterial lysogens are prevalent in most microbiomes. Screening for prophages, determining whether they are active, and "curing" the host bacteria of active prophages are effective tools for understanding the effects these mobile elements have on microbiomes. There are many potential effects of these elements in vivo, both positive and negative, this research is a good example of why this research should be explored.

      Context:

      The research area of prophage effects on host bacteria in vitro has been studied for decades, while these interactions in combination with animal hosts in vivo have been recent. The significance of this research shows that there could be divergent effects based on whether the study is conducted in vitro or in vivo. The in vivo results were striking. This is particularly so with the microscopy images. The benefit of using Ciona is that it has a translucent body which allows for following microbial localization. This is in contrast to mammalian studies where following microbial localization would either be difficult or near impossible.

      Reviewer #2 (Recommendations for the authors):

      In general, I found that the research shown in this manuscript is solid, and the manuscript is well-written. I have no specific comments about the writing of the manuscript that would be of benefit.

      Figure 1 would benefit from the shrinking of white space between panels a and b. Also, in panel b, it is very difficult to read the x-axis, the number of basepairs. It is suggested to increase this font size.

      Figure 1 has been improved as suggested.

      Figure 2 is fine, however, what do three asterisks (***) in panel a signify? It is not described in the legend. One minor point that affects data understanding as presented, the wildtype (WT) change in expression is normalized to itself, therefore always equaling 1.0. This method of presentation muddies the variation in gene expression in the presence of the prophage. This is not an issue in Figure 2, but does have an effect on understanding Figure 2 - figure supplement 1.

      Figure 2 - figure supplement 1, as stated above, the normalization of the WT change in gene expression to 1.0 makes it difficult to understand the results. Why is pilZ change in gene expression not significant in panel s1a? It seems the median change is 50%, or whatever averaging is done, it's unclear whether this is the median and whether the error bars are standard deviation or some other metric.

      These should be defined in the statistical analysis section of the methods or in the legend itself. Further, in panel s1b, why is the reduction in gene expression of pdeB statistically significant, while a similar reduction in gene expression of pleD is not statistically significant?

      RQ values were calculated from 2<sup>-ddCt</sup>. The error bars in the figures were calculated by adding or subtracting the standard error from RQ. Since WT was used as a reference value for qPCR, the RQ value was normalized as 1 for all replicates and nonparametric tests were used to calculate the statistical significance. The values for pilZ were very close to significant; a value of 0.063 was derived via the Wilcoxon test. Only the changes in expression of pdeB were determined to be statistically significant, via the Wilcoxon test.

      Figure 3 panels a and b would be helped by having the same y-axis for each. It is impressive the amount of WT bacterial colonization takes place in 24 hours, particularly in the absence of the prophage, but it does not appear as impressive when the axes are changed between panels. Similar axes should be considered for every comparative graph.

      Figure 3 - figure supplement 1 legend would benefit from the same description of the animal's digestive locations as in the legend in Figure 3.

      We appreciate these suggestions and have made these changes accordingly. We have remade and combined Figure 3 a and b

      Figure 4, while it is unfortunate that none of the immune genes evaluated had a response to the deletion of the SfPat prophage in S. fidelis 3313 at 24 hours, did any of these genes have an effect at 1 hour of evaluation as VCBP-C did?

      The expression of this expanded gene set was not evaluated at one hour. This time point will, however, be included in our global evaluation of gene expression in our future transcriptome sequencing effort.

      Figure 5, the only question I have with these data is whether or not there is a dose-dependent effect of VCBP-C on SfPat P5 expression?

      Prior studies have found VCBP-C can impact biofilm formation in Shewanella sp. in a dose-dependent manner (some of the data appears in Dishaw et al, 2016). However, we have not yet considered whether VCBP-C impacts the expression of SfPat P5 (a phage capsid component) in a dose-dependent manner. We will consider this in future experimental designs.

      It is mentioned in the introduction (and data shown in the preprint) that there is more than one active prophage in Shewanella fidelis 3313. The preprint data shows that the Mu prophages had little effect on the studies. It may be worth discussing the presence and lack of effects of these Mu prophages. It also may lead to some discussion about the complexities of polylysogeny (as discussed by Silpe, et al, Nature, 2023).

      A full-length, inducible, Mu-like prophage region has been identified in the genome that has not been targeted for deletion, but will be included in follow-up studies. An earlier incomplete genome assembly contributed to the incorrect targeting and deletion of a prior Mu-like region, which was discussed in an earlier preprint version. Discussion and references to that strain have been removed from the more recent preprint versions. For clarity, the current manuscript describes strains that remain focused on the SfPat prophage, noting its contribution to the observed behavioral changes / traits.

      Is there any spontaneous induction of SfPat in vitro or in vivo with temperature change (prophages have been induced with heat stress), excessive UV exposure, or mitomycin C treatment?

      Preliminary induction studies using UV, mitomycin C, and temperature have been completed, but remain inconclusive with SfPat due to inconsistent induction patterns.

      Could you speculate, or perhaps do the experiment, as to whether the addition of VCBP-C to S. fidelis 3313 cultures affects biofilm production? The deletion of SfPat leads to greater biofilm production in vitro, while exogenously added VCBP-C represses SfPat P5 expression, would VCPB-C addition lead to greater biofilm production? Lastly, and this may be a failure of my understanding, is VCBP-C able to bind to S. fidelis? If so, does the prophage alter the bacteria and, consequently, the ability of VCBP-C to bind to the bacteria?

      Our lab is actively working to better understand the physical interactions of VCBP-C and bacteria, particularly lysogenic bacteria. Deletion mutants are helping us better understand the potential influence of the bacterial accessory genome on interactions with host immune mediators. Biofilm assays have been done in the context of VCBP-C (Dishaw et al, 2016). Subsequently, we tested the influence of 50 µg/ml VCBP-C on WT and prophage KO-strains, which include SfPat KO along with neutral (control) regions of the genome. We found that the presence of VCBP-C reduced biofilm formation in WT and phage KO variants at 4 hrs and 24 hrs. However, at 12 hrs, VCBP-C treatment appears to increase biofilm formation in the phage-KO strain. While the role (if any) of SfMu is remains unclear, these preliminary data imply the existence of a feedback circuit (influenced by time) where immune effector binding and prophage influence on host gene expression together shape retention outcomes in the gut microbiome. This hypothesis remains to be tested further.

      Author response image 1.

      WT S. fidelis 3313 was exposed in vitro to 50 µg/ml VCBP-C in stationary cultures. Biofilms were observed for 24hrs.  At 12 hrs, the presence of VCBP-C increased the amount of biofilms, whereas reduced biofilms were observed at 4 and 24hrs. Our findings (manuscript Fig 2a) reveal that SfPat contributes to biofilm formation, exposure to SfPat deletion mutants increases host VCBP-C expression (manuscript Fig. 4a), and VCBP-C binding to WT S. fidelis 3313 reduces the expression of SfPat P5 capsid protein (manuscript Fig. 5). These findings suggest that in vivo exposure/ colonization assays benefit from detailed time-course observations to be further explored in follow-up, future experiments.

      Reviewer #3 (Public review):

      In this manuscript, Natarajan and colleagues report on the role of a prophage, termed SfPat, in the regulation of motility and biofilm formation by the marine bacterium Shewanella fidelis. The authors investigate the in vivo relevance of prophage carriage by studying the gut occupation patterns of Shewanella fidelis wild-type and an isogenic SfPat- mutant derivative in a model organism, juveniles of the marine tunicate Ciona robusta. The role of bacterial prophages in regulating bacterial lifestyle adaptation and niche occupation is a relatively underexplored field, and efforts in this direction are appreciated.

      While the research question is interesting, the work presented lacks clarity in its support for several major claims, and, at times, the authors do not adequately explain their data.

      Major concerns:

      (1) Prophage deletion renders the SfPat- mutant derivative substantially less motile and with a higher biofilm formation capacity than the WT (Fig. 2a-b). The authors claim the mutant is otherwise isogenic to the WT strain upon sequence comparison of draft genome sequences (I'll take the opportunity to comment here that GenBank accessions are preferable to BioSample accessions in Table 1). Even in the absence of secondary mutations, complementation is needed to validate functional associations (i.e., phenotype restoration). A strategy for this could be phage reintegration into the mutant strain (PMID: 19005496).

      We are currently investigating complementation strategies. However, there have been some challenges in re-infecting and/or reintegrating the prophage into the genome. A preferred integration site may be damaged due to the deletion approach. While the SfPat prophage has mostly predicted genes of unknown function or significance, we have begun prioritizing the deletion of distinct segments to help identify functional relevance.

      (2) The authors claim that the downshift in motility (concomitant with an upshift in biofilm formation) is likely mediated by the activity of c-di-GMP turnover proteins. Specifically, the authors point to the c-di-GMP-specific phosphodiesterase PdeB as a key mediator, after finding lower transcript levels for its coding gene in vivo (lines 148-151, Fig. 2c), and suggesting higher activity of this protein in live animals (!)(line 229). I have several concerns here:

      (2.1) Findings shown in Fig. 2a-b are in vitro, yet no altered transcript levels for pdeB were recorded (Fig. 2c). Why do the authors base their inferences only on in vivo data?

      (2.2) Somewhat altered transcript levels alone are insufficient for making associations, let alone solid statements. Often, the activity of c-di-GMP turnover proteins is local and/or depends on the activation of specific sensory modules - in the case of PdeB, a PAS domain and a periplasmic sensor domain (PMID: 35501424). This has not been explored in the manuscript, i.e., specific activation vs. global alterations of cellular c-di-GMP pools (or involvement of other proteins, please see below). Additional experiments are needed to confirm the involvement of PdeB. Gaining such mechanistic insights would greatly enhance the impact of this study.

      (2.3) What is the rationale behind selecting only four genes to probe the influence of the prophage on Ciona gut colonization by determining their transcript levels in vitro and in vivo? If the authors attribute the distinct behavior of the mutant to altered c-di-GMP homeostasis, as may be plausible, why did the authors choose those four genes specifically and not, for example, the many other c-di-GMP turnover protein-coding genes or c-di-GMP effectors present in the S. fidelis genome? This methodological approach seems inadequate to me, and the conclusions on the potential implication of PdeB are premature.

      We chose to study genes that were shown previously to influence biofilms and motility in a cyclic-di-GMP dependent manner in a Shewanella spp (Chao et al 2013, S Rakshe 2011). Future transcriptomic efforts and targeted deletion approaches will further define the specific influence of prophages.

      (3) The behavior of the WT strain and the prophage deletion mutant is insufficiently characterized. For instance, how do the authors know that the higher retention capacity reported for the WT strain with respect to the mutant (Fig. 3b) is not merely a consequence of, e.g., a higher growth rate? It would be worth investigating this further, ideally under conditions reflecting the host environment.

      To clarify the method, in vitro growth curves did not suggest any significant difference in growth rate between the WT and the deletion mutant strains. Subsequently, for the in vivo experiments, bacterial cultures were pelleted and resuspended in sterile, nutrient-free artificial seawater. This limits growth until the bacterial strains are introduced to the animals.

      (4) Related to the above, sometimes the authors refer to "retention" (e.g., line 162) and at other instances to "colonization" (e.g., line 161), or even adhesion (line 225). These are distinct processes. The authors have only tracked the presence of bacteria by fluorescence labeling; adhesion or colonization has not been assessed or demonstrated in vivo. Please revise.

      We thank the reviewer for this feedback; the manuscript has been revised accordingly. While we refer to our assays as ‘colonization assays,’ we report results of ‘retention’ of various bacterial strains in the ‘exposed’ animals. Furthermore, when fluorescent staining is utilized, we report retention in defined niches. Since colonization is likely a two-step process, i.e., 1) retention and 2) colonization or long-term establishment of these microbial communities, using these terms correctly is warranted. In separate (unpublished) surveys of adult animals taken from the field, identical strains have been recovered numerous times over a twelve-year period.

      (5) The higher CFU numbers for the WT after 24 h (line 161) might also indicate a role of motility for successful niche occupation or dissemination in vivo. The authors could test this hypothesis by examining the behavior of, e.g., flagellar mutants in their in vivo model.

      Interestingly, we find numerous flagellar/motility-associated protein coding genes like Flg, Fli and Fle present within the S. fidelis genome possessing an EAL domain, implicating them in the regulation of cyclic-di-GMP. Hence, a future global transcriptomic approach will help improve our understanding of the roles of these regulatory pathways.

      (6) The endpoint of experiments with a mixed WT-mutant inoculum (assumedly 1:1? Please specify) was set to 1 h, I assume because of the differences observed in CFU counts after 24 h. In vivo findings shown in Fig. 3c-e are, prima facie, somewhat contradictory. The authors report preferential occupation of the esophagus by the WT (line 223), which seems proficient from evidence shown in Fig. S3. Yet, there is marginal presence of the WT in the esophagus in experiments with a mixed inoculum (Fig. 3d) or none at all (Fig. 3e). Likewise, the authors claim preferential "adhesion to stomach folds" by the mutant strain (line 225), but this is not evident from Fig. 3e. In fact, the occupation patterns by the WT and mutant strain in the stomach in panel 3e appear to differ from what is shown in panel 3d. The same holds true for the claimed "preferential localization of the WT in the pyloric cecum," with Fig. 3d showing a yellow signal that indicates the coexistence of WT and mutant.

      The results section is reworded to improve clarity. The WT and KO are mixed 1:1 to achieve the 10<sup>7</sup> cfu count.

      (7) In general, and especially for in vivo data, there is considerable variability that precludes drawing conclusions beyond mere trends. One could attribute such variability in vivo to the employed model organism (which is not germ-free), differences between individuals, and other factors. This should be discussed more openly in the main text and presented as a limitation of the study.

      Yes, a salient feature of this model is that we can leverage genetic diversity in our experimental design, but it can introduce experimental variability.

      Even with such intrinsic factors affecting in vivo measurements, certain in vitro experiments, which are expected, in principle, to yield more reproducible results, also show high variability (e.g., Fig. 5). What do the authors attribute this variability to?

      For experiments involving VCBP-C protein, we can use affinity-purified protein recovered from live animals, or recombinant protein that we synthesize in-house (Dishaw et al 2011, 2016). In the latter, we often observe slight lot-to-lot variation in affinity for the target (the bacterial surface). To account for this variation and to ensure the observations are robust despite it, production lots can be mixed in additional biological replicates. As such, slight variability in the in vitro assays can be due to this batch effect.

      (8) Line 198-199: Why not look for potential prophage excision directly rather than relying on indirect, presumptive evidence based on qPCR?

      The decision to rely on qPCR of prophage structural genes was based on preliminary data, in particular among lysogens possessing more than one prophage. Neither the plaque assay nor SYBR Gold staining could distinguish among the particles, and TEM imaging was not sufficiently qualitative. Since these prophages do not exclusively produce particles when induced, qPCR targeting structural proteins was found to be most informative.

      Reviewer #3 (Recommendations for the authors):

      Other major comments:

      Line 137 (and Fig. 2 legend): The authors did not test chemotaxis towards any specific chemoeffector, only motility. Please correct and see below my comments about motility assays.

      The reviewer is correct; we have modified our descriptors.

      Lines 142-144: The authors conflate quorum sensing with c-di-GMP metabolism. If the authors measured the expression of genes "regulating cyclic di-GMP," it is likely because c-di-GMP is known to regulate the switch between planktonic and sessile lifestyles. However, whether this is mediated by quorum sensing is a separate issue that was not explored in this work. Please revise.

      Thank you; these changes were made accordingly.

      Line 150: c-di-GMP is not a quorum sensing signal; please correct.

      Yes, we corrected the inadvertent yet misleading statement.

      Line 193: Please clarify "RNA was extracted from the biofilms." If S. fidelis was grown on "MA [Marine Agar] for 24 h in the presence or absence of 50 µg/ml VCBP-C" (lines 192-193), was RNA isolated from colonies growing on the plates? Was VCBP-C added to the agar? This is also unclear in the Methods section (lines 381-384), where it seems the authors conducted this experiment using broth cultures in multiwell plates, removing the supernatant, and extracting RNA from the biofilms (i.e., cells adhered to the walls and bottom of the wells?). Why only biofilm cells?

      Thank you for bringing this to our attention. We have rewritten the appropriate sections and methods to improve clarity. Following our initial studies, which revealed differential bacterial phenotypes (biofilm formation and motility assays), we decided to target and investigate gene expression in the biofilms. This way, the sessile cells that were not part of the biofilm do not obfuscate the data.

      Lines 204-205: The authors should refer to the behavior of the mutant, since they did not test what happens upon prophage integration, but after prophage deletion.

      The wording has been changed accordingly.

      Lines 206-207: Please explain why the authors state that "these different bacterial phenotypes" (referring to altered biofilm formation and motility) "influence host immune responses in a manner consistent with influences on gut colonization dynamics". What specific relationship are the authors suggesting between these processes, and in what way is this "consistent"?

      We previously demonstrated (Dishaw et al 2016) that copious amounts of VCBP-C protein are present under normal conditions in the gut and mostly found tethered to chitin-rich mucus lining the gut epithelium. The up-regulation of VCBP-C within one hour of exposure to the SfPat mutant relative to the WT S. fidelis is consistent with a role for VCBP-C in modulating bacterial settlement dynamics (Dishaw et al 2016). The mutant phenotype of reduced swimming and increased biofilm production is a likely trigger for the increased production of this secreted immune effector that may influence the retention of this bacterial variant, relative to the WT.

      Line 229: Apart from what I noted above about the authors' claim regarding PdeB activity, I believe the figure referred to here should be Fig. 2, not Fig. 5.

      Thank you for catching that oversight. It has been corrected.

      Figure 1: Was hypothetical protein 2 included in the deletion?

      Yes, the hypothetical protein 2 was included in the deletion

      Figure 3a-b: It is challenging to interpret data on plots using so many colors - including what appears to be a white circle (?) in Fig. 3a. How many replicates are represented here? Is it indeed n=3 in Fig. 3a and n=6 in Fig. 3b?  

      Figure 3a is a bee swarm plot. Each color represents biological replicates, and the smaller circles represent technical replicates. It facilitates showing ALL the data, including the spread of the data. Regarding the number replicates, 3a and 3b are different experiments, with 3a representing a biofilm assay with three biological replicates and 3b a motility assay with six biological replicates.

      Figure 3: An explanation for the abbreviation "FP" is missing.

      Thank you for catching this oversight. The abbreviation has been defined.

      Figure S3: FP, which is proficiently occupied by the WT strain (Fig. S3a), is not labeled in the images provided for the mutant (Fig. S3c-d). It would be helpful to show it for comparison.

      Those other images did not have fecal pellets to label; however, Figure 3c does show a fecal pellet for an animal exposed to both WT and the SfPat mutant.

      Questions and comments regarding methods:

      Lines 290-291, 307: Please indicate an approximate range for "room temperature."

      The information has been added to the revised manuscript.

      Lines 292, 302: Why use hybrid LB/MB broth and agar? And strictly speaking, which LB formula (Lennox/Luria/Miller)?

      The hybrid broth reduces the concentration of salts that can interfere in some assays. The LB formula was Luria, and it is now included in the manuscript.

      Lines 300-302: The conjugation procedure is poorly described. It seems the authors conducted conjugal transfer by biparental mating in broth culture by inoculating a single colony of S. fidelis 3313 into an already grown culture of the E. coli donor strain?

      The biparental mating was done on plates; the manuscript has been clarified.

      Motility assay concerns:

      Swimming motility is generally assayed in soft agar (0.25-0.3% w/v). Why did the authors use 0.5% low-melt agarose? Usually, agar is employed instead of agarose, and such a high concentration of solidifying agent typically prevents proper swimming (see e.g. Kearns 2010).

      Our laboratory uses low-melt agarose for phage propagation and other assays. We continued using it because we observed robust and reproducible results in the swarming and swimming motility assays. In addition, 0.5% agarose is less dense than 0.5% agar, and its consistency is similar to that of the lower percentage soft agar.

      Lines 316-317: Please clarify: what is the "overlay motility assay" that was carried out "overnight at RT and then inoculated onto the center of soft agar"? Was this a two-step experiment? How were bacteria inoculated (stabbed, injected)? If injected, what volume and cell density were used?

      Thank you for bringing this to our attention. The methods section has been revised for clarity.

      Line 319: Each variable tested in duplicate? From what I understand, the only variable measured in this test is the diameter of the swimming halos. Do the authors mean they used two biological replicates? If so, please indicate the number of technical replicates as well.

      Multiple biological replicates were performed, each time with two technical replicates. Two perpendicular measurements (of diameter) for each technical replicate was recorded to avoid bias. The methods section has been edited to improve clarity.

      Line 320: Were the swimming halos asymmetrical, hence the need to take two perpendicular measurements? If that was the case, it could indicate an excessive amount of solidifying agent.

      The halos were sometimes asymmetric, but to avoid variation across datasets, it became standard practice to measure perpendicular distances as stated above. 

      Regarding qPCR experiments:

      Please clarify how normalization of transcript levels was performed.

      It seems the authors conducted a double normalization, first with respect to the calibrator (rho), and again using the wild-type as a baseline reference for fold-change calculations (absence of error bars for WT data). If so, please specify on the vertical axes of the figures and in the Methods/figure legends.

      Since, in addition to rho, the authors assessed the expression stability of the "housekeeping" genes gyrB and recA, please also include the primers used for these genes.

      The appropriate manuscript sections have been updated for clarity. The bacterial qPCR was normalized to an internal standard, and then relative expression differences between SfPat and the WT were determined. The missing primer sequences have also been added.

      Observations:

      Figure 2a-b: It is intriguing that the remarkable reduction in motility of the mutant is not associated with a comparably significant increase in biofilm formation.

      A statistically significant increase in biofilm was observed, along with a decrease in motility. As is common in crystal violet assays, some of the tertiary structures were not very stable and likely washed out during processing.

      Additionally, it is noteworthy that data for the mutant in panel 2a exhibit minimal variability, with all OD570 recordings being around 3.0. Did the authors dilute the crystal violet elution solution after adding acetic acid, or might they have reached the saturation limit of the spectrophotometer?

      The eluted acetic acid was not diluted further, and significant changes were observed. If the solution had been further diluted, the observed changes might have been more pronounced. 

      Minor comments and recommendations:

      All the suggested changes below have been incorporated

      • Line 55: "Antibiotic resistance determinants" might be preferable to "genes" to avoid using "genes" twice in the same sentence.

      • Line 75-76: Italicize Pseudomonas aeruginosa.

      • Line 134: Instead of "at least," specify the average fold-change.

      • Line 141: In the heading, refer to the influence of the "prophage" (singular) rather than "prophages" (plural).

      • Discussion (style): Consider using past tense for phrases like "we utilize..." (line 202); "we find..." (line 204), etc.

      • Line 365 and elsewhere: Consider "mRNA levels" or "transcript levels" instead of "gene expression".

      • Table 3: UQ950 is a strain, not a plasmid. I assume the plasmid carried by UQ950 is pSMV3.

    1. Note: This response was posted by the corresponding author to Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1 (Evidence, reproducibility and clarity):

      In Arabidopsis, DNA demethylation is catalyzed by a family of DNA glycosylases including DME, ROS1, DML2, and DML3. DME activity in the central cell leads to the hypomethylation of maternal alleles in endosperm. While ROS1, DML2, and DML3 function in vegetative tissues to prevent spreading DNA methylation from TE boundaries, their function in the endosperm was unclear.<br /> Using whole genome methylome analysis, the authors showed that ROS1 prevents hypermethylation of paternal alleles in the endosperm thus promotes epigenetic symmetry between maternal and paternal genomes.<br /> The approach and experimental desighs are appropriate, and the key conclusions are adequately supported by the results.<br /> However, there is not sufficient evidence to support the claim that DME demethylates the maternal allele at ROS1-dependent biallelically-demethylated regions. To clarify the issue, the authors could analyze if there is an overlap between DMRs identified in ros1 endosperm and those identified in dme endosperm using published data. If there is any, the authors could show a genome browser example of DMR including dme data.

      Response: Thank you for your insight on our work. To address your concern and further test our model that DME prevents methylation of the maternal allele at regions where ROS1 is prevents methylation of the paternal allele, we turned to the allele-specific bisulfite-sequencing data published in Ibarra et al 2012. These data were from endosperm isolated at 7-8 DAP from aborting seeds of dme-2 +/- (Col-gl) plants pollinated by L_er_. Our analysis of these data is now included in Figures 6 and 7 and Supplemental Figures 13-17. We show that when the loss-of-function allele dme-2 is inherited maternally, average methylation of the maternal allele increases at ROS1-dependent regions (in the revised version of the paper now referred to as ROS1 paternal, DME maternal regions) from less than 10% CG methylation to approximately 40% CG methylation (Fig. 6D), consistent with our previous analysis using the non-allelic Hsieh et al 2009 data (now moved to Supplemental Figure 15). These results thus provide additional evidence that DME removes maternal allele methylation at regions where ROS1 removes paternal allele methylation (compare Fig. 6B and 6D). We included relevant genome browser examples in Figure 7E and Supplemental Figure 14. In the revised version, the relationship between ROS1 and DME is further expanded upon in the text.

      Reviewer #1 (Significance):

      Endosperm is a tissue unique to flowering plants. Though it is an ephemeral tissue, the endosperm plays essential roles for seed development and germination. The endosperm is also the site genomic imprinting occurs, and it has a distinct epigenomic landscape. This work provides a new insight that ROS1 may antagonize imprinted gene expression in the endosperm. However, it was not shown whether imprinted gene expression is indeed affected in ros1, or whether the ros1 mutation has phenotypic consequences. These results would be useful to discuss the evolution and significance of genomic imprinting.

      Response: We agree that the biological significance of ROS1-mediated paternal allele demethylation is presently unknown. We performed RNA-seq on wild-type and ros1 3C and 6C endosperm nuclei, but these data were unfortunately not of high enough quality to include in the manuscript. In the Discussion we suggest that disrupting ROS1-mediated paternal allele demethylation might lead to a gain of imprinting over evolutionary time. In future work we are planning to address potential relationships to gene imprinting using a molecular, RNA-sequencing approach as well as an evolutionary comparative approach. As expected, given the expectation that imprinted genes are associated with a parent-of-origin specific epigenetic mark, we did not find any relationship between known imprinted genes and ROS1-dependent regions that are biallelically-demethylated regions in wild-type endosperm (see lines 362-372).

      Reviewer #2 (Evidence, reproducibility and clarity):

      SUMMARY

      Hemenway and Gehring present evidence that the paternal genome in Arabidopsis endosperm is demethylated at several hundred loci by the DNA glycosylase/lyase ROS1. The evidence is primarily based on analysis of DNA methylation of ros1 mutants and of hybrid crosses where each parental genome can be differentiated by SNPs. I have some comments/questions/concerns, two of them potentially serious, but I think Hemenway and Gehring can address them through additional analyses of data that they already have available and a bit of clarification in writing.

      Response: Thank you for your thoughtful review of this study. Your insight and suggestions have helped add clarity to the paper.

      MAJOR COMMENTS:

      1. Could the excess methylation in ros1-3 relative to ros1-7 shown in Figures 1A and 1C be explained by a second mutation in the ros1-3 background that elevates methylation at some loci? Any mutation that increased RdDM at these loci, for example could have this effect. This could confound the identification and interpretation of biallelicly demethylated loci.

      Response: We propose a simpler explanation for the additional hypermethylation observed in ros1-3: ros1-3 is a loss-of-function (null) allele whereas ros1-7 is likely a hypomorphic allele. For clarity, we have added a diagram of all of the alleles used in this study as Supplemental Figure 1B. The ros1-3 allele was first described in Penterman et al, PNAS, 2007. It is a T-DNA insertion allele that was isolated in the Ws accession and then backcrossed 6 times to Col-0, greatly minimizing the risk of unlinked secondary mutations being present. There is no genetic evidence that there is another T-DNA insertion in this line. The ros1-7 allele was described in Williams et al, Plos Genet, 2015. It was isolated from the Arabidopsis Col-0 TILLING population and is missense mutation (E956K) in a residue in the glycosylase domain that is conserved among the four DNA glycosylases. It is known that ROS1 transcripts are produced from the ros1-7 allele (Williams et al 2015). We observe less hypermethylation in the ros1-7 background compared to the ros1-3 background, and thus propose that the ros1-7 allele is a hypomorphic allele of ROS1. The use of two independent ros1 mutant alleles for initial endosperm methylation profiling strengthens the findings of our study. Importantly, regions that are hypermethylated in ros1-3 are also hypermethylated in ros1-7, but to a lesser extent, and vice versa (Fig 1D, Supplemental Figs. 3 and 4).

      We also use a third allele in this study, ros1-1, which is a nonsense allele in the C24 accession. Notably, we find that the regions are demethylated on both maternal and paternal alleles in wild-type C24 gain DNA methylation primarily on the paternal allele in ros1-1 endosperm (Figure 4C,D and Supplemental Figure 10). This is discussed further in response to your second point.

      Given these lines of evidence, a gain-of-function mutation in a methylation pathway, like RdDM, in the ros1-3 background is an unlikely explanation for increased hypermethylation compared to ros1-7. The use of three independent ros1 alleles for methylation profiling, all of which lead to the same conclusions, is a major strength of our study.

      1. It appears that the main focus of the manuscript, the existence of loci that are paternally demethylated by ROS1, is supported by a set of 274 DMRs. This is a small number relative to the size of the genome and raises suspicions of rare false positives. Even the most stringent p-values that DMR-finding tools report do not guarantee that the DMRs are actually reproducible in an independent experiment. Demonstrating overlap between these 274 DMRs and an independently defined set using a different WT control and different ros1 allele would suffice to remove this concern. It appears that authors already have the needed raw data with ros1-1 and ros1-7 alleles.

      Response: First, we should clarify that paternal demethylation by ROS1 is supported by more than the 274 DMRs. All ros1 CG hyperDMRs show an increase in paternal allele methylation in ros1 (Fig. 4B,D). The 274 DMRs are a distinct subset defined as having less methylation on the maternal allele than the paternal allele in ros1 endosperm and where there is no maternal allele hypomethylation in wild-type endosperm (refer to Fig. 5B).

      We agree with your sentiments about DMR-finders and we are cautious of relying exclusively on DMR calls when making conclusions. We verify the nature of identified DMRs using metaplots and weighted average comparisons throughout the paper, which we think increases confidence in the conclusions and goes beyond a simple DMR-calling approach.

      We argue that we have replicated the major conclusion of the paper, that ROS1 prevents paternal allele hypermethylation at target regions in the endosperm, in the following ways:

      1. In the dataset without allelic-specific methylation information (Figures 1-3), we found that both ros1-3 and ros1-7 CG hyperDMRs have a limited capacity for hypermethylation in the endosperm relative to leaf or sperm (Table 1, Fig 3, Supplemental Fig. 4). In the allele-specific dataset, ros1-3 CG hyperDMRs were revealed to have particularly low maternal mCG relative to paternal mCG in ros1 mutant endosperm (Fig 4A-B, Supplemental Fig. 10).
      2. We found that ros1-3 and ros1-1 hyperDMRs, which we identified using non-allelic data, are biased for paternal allele hypermethylation in the endosperm of F1 hybrids (Fig 4B,D). The replicability of the paternal bias in hypermethylation in both ros1-3 in the Col-0 ecotype and ros1-1 in the C24 ecotype is a critical result, and we have moved the ros1-1 hyperDMR plots from the supplement to main figure 4C-D in the revised version of the manuscript as a result of your comment.
      3. The 274 DMRs identified as “biallelically-demethylated, ROS1-dependent” are by definition replicated between reciprocal cross directions. (Note that we now refer to these regions as ROS1 paternal, DME maternal regions in the revision.) Regions in this category had to be called as maternally-hypomethylated in both ros1-1 x ros1-3 and ros1-3 x ros1-1 endosperm. These regions also had to not be identified as maternally-hypomethylated in both C24 x Col-0 and Col-0 x C24. We hope this is clarified for readers by Table 1, which we have included based on your suggestion in comment #3, as well as other clarifying edits we made in this section of the paper.comparisons between maternal and paternal methylation in endosperm, DMRs defined by comparison between mutants and wildtype, and more. These need clearer descriptions of which sets are being referred to throughout the main text and in figure legends. A table summarizing them might help (not in the supplement). Use of consistent and precisely defined terms would help. Stating the number of DMRs along with the name for each set would help a lot, even though this would make for some redundancy. (The number of DMRs in each set not only helps with interpretation but also act as a sort of ID). The reason I put this as a major concern is because the text and figures are difficult to understand, and it is currently hard to evaluate both the results and the authors' conclusions from those results.

      Response: Thank you for your feedback and suggestions. We have edited the main text so that only one descriptive name is used for each DMR type throughout the paper. We have also renamed regions for greater clarity. The previous “ROS1-independent, maternally demethylated regions” are now referred to as “DME maternal regions”. The previous “ROS1-_independent, biallelically-demethylated regions” are now referred to as “_ROS1 paternal, DME maternal regions”. These changes provide greater clarity and also emphasize the role of DME at regions that are paternally hypermethylated in ros1. We have added Table 1 to summarize the DMR classes of interest.

      MINOR COMMENTS

      1. The sRNA results in Figure 2B are difficult to interpret because they do not reveal anything about the number of TEs that have siRNAs overlapping them or their flanks. While the magnitude of some of the highest endosperm sRNA peaks is higher than the embryo peaks, that could be explained by a small number of TEs with large numbers of sRNAs. To make this result more interpretable, we also need some information about how many TEs have a significant number of sRNAs associated with them in endosperm and embryo in each region (e.g., middle, 5', 3', and flanks of TEs). What a "significant number of sRNAs" is would be up to the authors to decide based on the distribution of sRNA counts they observe for TEs. Perhaps the top quartile of TEs? Combined with the same analysis done in parallel with non-ROS1 target TEs, this would reveal whether there is any evidence for ROS1 counteracting sRNA-driven methylation spread from TEs.

      Response: Thank you for the suggestion. We now present these data and the data for individual TEs underlying the metaplots in Supplemental Figure 7. As suggested by the reviewer, ROS1 TEs do not have uniformly higher levels of sRNA in their flanks in the endosperm compared to the embryo. We have modified our interpretations accordingly.

      1. The statement "we are likely underestimating the true degree of differential methylation among genotypes" should be validated and partially quantified using a methylation metaplot like Figure 2A, but substitute DMRs for TEs. Related to that, Figure 1B needs an indicator of scale in bp.

      Response: We have now included a methylation metaplot over ros1-3 hyperDMRs and ros1-7 hyperDMRs as Supplemental Figure 3 These plots show that indeed there is additional hypermethylation in DMR-proximal regions. We have added a scale bar to Figure 1B and other browser examples in the paper.

      1. The statement "Over half of ROS1 target regions identified in the ros1-3 mutant endosperm were within 1 kb or intersecting a TE (Fig. 1D)" is hard to interpret without some kind of ROS1 non-target regions or whole-genome control comparison. How different are the numbers in Fig. 1D from a random expectation?

      Response: We have now included a control for random regions in Figure 1E. We define these as regions where there was sufficient methylation data coverage and a low enough methylation level in wild-type to detect hypermethylation if it existed.

      1. The sentence at line 262 is confusing. Is the comparison between dme mutant and ros1 mutant or between different types of regions? And it appears that the comparison value is missing in the "3-5% CG methylation gain..." e.g., "3-5% CG methylation vs 10-20%" or something like that.

      Response: This section has been re-written as we now focus on allele-specific dme endosperm methylation data for our comparisons.

      1. The dme mutant data in Figure 5C appear to be key to the model in Figure 7. The relative impact of the dme mutant in the two types of regions should be quantified.

      Response: Thank you for this comment. To further probe our model that DME prevents hypermethylation of the maternal allele at regions where ROS1 is preventing hypermethylation of the paternal allele, we turned to the allele-specific bisulfite-sequencing data published in Ibarra et al 2012 (see also response to reviewer #1). Using these data, we show that when the loss-of-function allele dme-2 is inherited maternally, ROS1 paternal, DME maternal regions (previous referred to as ROS1-_dependent, biallelically-demethylated regions) are CG hypermethylated on the maternal allele (Figure 6D). Thus, these results both replicate the observations made with the Hsieh et al 2009 data, and provide additional evidence that _DME prevents maternal allele hypermethylation at regions were ROS1 is preventing paternal allele hypermethylation. These results have replaced the Hsieh et al 2009 results in Figure 6, and we have moved the analysis of Hsieh et al 2009 data to Supplemental Figure 15.

      1. Looks like sRNA methods are missing.

      Response: Thank you for identifying this. We previously included the reference for the analyzed dataset we used and the method for plotting under an unclear section header. These methods are now in the section “Analysis of average methylation and 24-nt sRNA patterns for features of interest”, and we have added additional reference to the specific dataset we used.

      1. Supplemental Figure 1 is hard to interpret since it only list gene IDs, not gene names.

      Response: As suggested, we have added gene names to this figure.

      The last comments are suggestions for increasing the impact of this study:

      1. Figure 2A and 3B suggest that ROS1 target TEs show demethylation in their flanks but not in the TE themselves. This is an interesting result. If it is true, more DMRs would be expected in the ROS1 target flanks than in the ROS1 target TEs. Reporting how many ROS1 target TEs have DMRs in them and what proportion have DMRs in their flanking 1-Kb regions would answer this question. Given the significance of this result, it also deserves a bit more context: Is the magnitude of increased methylation flanking TEs in ros1 mutant endosperm different than in ros1 mutant leaves or other tissue? Does methylation in TE flanks behave the way in dme mutant endosperm?

      Response: We define “ROS1 target TEs” (now referred to more simply as ROS1 TEs) as TEs within 1kb or intersecting a ros1-3 hyperDMR. Consistent with your interpretation, 80% of the TEs in this category do not have a DMR overlapping them, instead they have a TE within 1kb. We now mention this in the text on line 150.

      The total level of DNA methylation at ROS1 TEs is lower in the endosperm than in leaf, as DNA methylation levels are overall lower in endosperm than in leaf. The magnitude of increased methylation flanking TEs in ros1 mutant endosperm is not different between the two tissues. This is observable in Supplemental Fig. 5 in the revised version of the paper, and we report this result in the revised text. In the revision we also present methylation profiles of DME TEs in WT and ros1 endosperm (Fig. 7B-D). DME TEs are hypomethylated in both the body and flanks in WT and ros1.

      1. The idea of biallelic demethylation has been theoretically suggested in maize to explain weak overlap between endosperm DMRs and imprinting (Gent et al 2022). If that were true in Arabidopsis, then ROS1 target, biallelicly demethylated loci would be less likely to have imprinted expression than maternally demethylated loci. This prediction could be tested using available data in Arabidopsis.

      Response: Indeed, as you hypothesize, there are no known imprinted genes (Pignatta et al 2014) associated with biallelically-demethylated, ROS1-dependent regions (now referred to as ROS1 paternal, DME maternal regions). Expectedly, there are imprinted genes associated with maternally-demethylated regions (now referred to as DME regions). 23 imprinted genes identified in the Pignatta et al 2014 study are within 1 kb or intersecting a DME region. This is discussed on lines 364-374.

      1. There is currently no evidence for biological significance of biallelicly demethylated loci. Knowing where they are in the genome might give some hints. A figure like Fig. 1D but specifically showing the biallelicly demethylated DMRs would be valuable.

      Response: This is now included in Figure 7A.

      1. It is hard to make the comparisons between genotypes and parental genomes in Figure 6 and know what they mean. Maybe a different way of displaying the data would help. Or maybe even a different labeling system could make it a little more accessible.

      Response: We have revised this figure (now Fig. 8) in the following ways, which we believe address your comments and clarify the main conclusions:

      Figure 8C is now a boxplot comparing methylation of the paternal allele of ROS1 paternal, DME maternal regions (previously referred to as biallelically-demethylated, ROS1-dependent regions) across endosperm ROS1 genotypes. This plot shows increased methylation of paternal alleles when the paternal parent is a ros1 mutant, regardless of whether the resultant F1 endosperm is homozygous or heterozygous for ros1 (columns 3, 4, 6).

      Figure 8B remains as a scatterplot, where we can observe significant correlation between individual ROS1 paternal, DME maternal regions in homozygous ros1 endosperm and heterozygous ros1/+ endosperm. Note that paternal allele methylation is higher in homozygous ros1 endosperm for most regions.

      Reviewer #2 (Significance):

      Demethylation of the maternal genome in endosperm has been the subject of much research because it can result in genomic imprinting of gene expression. The enzymes responsible, DNA glycosylases/lyases, also demethylate DNA in other cell types as well, where DNA methylation is not confined to one parental genome (biallelic or biparental as opposed to uniparental demethylation). To the best of my knowledge, the extent or even existence of biallelelic demethylation in endosperm has not been studied until now (except for a superficial look in a bioRxiv preprint, https://www.biorxiv.org/content/10.1101/2024.07.31.606038v1). Hemenway and Gehring have carried out a thoughtful and detailed analysis of the topic in Arabidopsis at least as far as it depends on the DNA glycosylase ROS1.

      A limitation is that the study design would miss biallelic demethylation by any of the other three DNA glycosylases in Arabidopsis. A second limitation is that there is no clear biological significance, just some conjecture about evolution. Nonetheless, given the novelty of the topic, biological significance may follow.

      The audience for biallelic DNA demethylation in Arabidopsis endosperm is certainly in the "specialized" category, but its relevance to the larger topic of gene regulation in endosperm will attract a larger audience.

      Response: With regard to the other demethylases, note that we also profiled methylation in ros1 dml2 dml3 triple mutant endosperm. We did not find evidence for many DMRs that were present in the triple mutant that were not present in the ros1 single mutant. We do not rule out a function for DML2 or DML3 in the endosperm, but this is not observed at the level of bulk endosperm.

      The reviewer is correct that we have shown a molecular phenotype (paternal allele hypermethylation) and not a developmental or morphological phenotype. A function that occurs in one parent but not the other is, to us, exciting. Our thoughts about how this finding might relate to imprinting are indeed speculative, but not wildly so.

      Reviewer #3 (Evidence, reproducibility and clarity):

      DNA demethylases play a key role in DNA methylation patterning during flowering plant reproduction. The demethylase DME, in particular, is critical for proper endosperm development. While the function of DME in endosperm development has been explored, the contributions of the other demethylases in the same family, ROS1, DML2 and DML3 in Arabidopsis, have not yet been investigated. In vegetative tissues, ROS1 prevents hypermethylation of some loci. In this work, Hemenway and Gehring explore whether ROS1, DML2 and DML3 also affect DNA methylation patterns in endosperm. Using EM-seq of sorted endosperm nuclei, they show that loss of ROS1 indeed causes hypermethylation of a number of loci, particularly the flanks of methylated transposons, while loss of DML2 and DML3 has minimal additional effect. By obtaining allele-specific EM-seq data through crosses of Col and C24, the authors show that ros1 endosperm hypermethylation is mostly restricted to the paternal allele. The authors propose that at some sites, ROS1 helps bring down paternal methylation levels to match maternal methylation levels, which are typically reduced in endosperm due to DME activity in the female gametophyte prior to fertilization. In a ros1 mutant with paternal hypermethylation, these sites become differentially methylated on the maternal and paternal alleles, resembling imprinted loci. This work convincingly establishes a function for ROS1 in DNA methylation patterning in endosperm. However, I struggled with the clarity of the writing and reasoning in a few places, and would suggest clarification of a few points and additional analyses below.

      Response: Thank you for your thoughtful review of our paper. Your questions and suggestions have been invaluable in revising the work.

      I think making a few simple changes to streamline nomenclature would improve readability. For example, in the section starting on line 129, the same set of genomic features are called ROS1 target-proximal TEs, TEs that are near a ROS1 target region, and ROS1 target-associated TE regions. Also for example in line 254 "regions that are maternally-demethylated in wild-type endosperm, and are not dependent on ROS1 for proper demethylation" - are these the same as the "ROS1-independent, maternally-demethylated" regions in Fig. 5a? Given how complex these terms are, being consistent throughout the manuscript really helps the reader.

      Response: We edited the text and figures so that only one descriptive name is used for each DMR class or region throughout the paper. Thank you for this feedback; these edits have made the paper much clearer.

      Is there any notable effect of ros1 on gene expression in endosperm? Endosperm is a terminal tissue, so maintaining DNA methylation boundaries as ROS1 does in vegetative tissues seems less important. It begs the question of why ROS1 is doing this in endosperm, is it just because it's there, or is there an endosperm-specific function? Exploring effects on imprinting would be particularly interesting (does loss of ROS1 'create' imprinted loci at these newly asymmetrically methylated sites?) but probably beyond the scope of the present work.

      Response: We agree, the question of the functional consequence of ROS1 activity in the endosperm is something we are keen to address in future work. We performed RNA-seq on wild-type and ros1 3C and 6C endosperm nuclei, but these data were unfortunately not of high enough quality to include in the manuscript. We are in particular interested in this question you have proposed – if loss of ROS1 can ‘create’ imprinted loci. We are planning to address this both using a molecular, RNA-sequencing approach as well as an evolutionary comparative approach. This is an important and exciting future direction.

      Is DME expressed in sperm, or is expression of DME affected in ros1 sperm or endosperm? One other explanation for ros1 hypermethylation occurring primarily on the paternal allele is that, potentially, DME can substitute for ROS1 in the central cell where DME is already very active, but not in sperm cells. Related, how well expressed is ROS1 vs. DME in sperm cells?

      Response: This is an important series of questions, and something we are very interested in as well. Studies of Arabidopsis pollen have shown that both ROS1 and DME, while they prevent some hypermethylation in sperm, are more active in the vegetative nucleus of pollen than in sperm. ROS1 is expressed at a low level in the microspore and bicellular pollen and DME is expressed at a low level throughout pollen development. We have included Supplemental Fig. 17 with available expression data to make this point in the paper. Likely, any effects of loss of ROS1 or DME on sperm DNA methylation are inherited from precursor cells (Ibarra et al 2012, Calarco et al 2012, Khouider et al 2021). Your proposal that perhaps DME can sub in for ROS1 in the central cell but not in sperm is intriguing. Unfortunately there’s not enough data in the central cell to convincingly address this at this time.

      To investigate the relationship between DME and ROS1 in the male germline, we used the bisulfite-sequencing data generated in sperm cells in Khouider et al 2021. We calculated average DNA methylation levels in dme/+, ros1, dme/+;ros1, and wild-type Col-0 sperm cells at ROS1 paternal, DME maternal regions, shown in Supplemental Fig. 18A. We observed little increase in mCG methylation in dme/+ sperm relative to wild-type Col-0 sperm. This is consistent with your proposed model that DME is unable to demethylate these regions outside of the female germline. As expected, there is increased mCG in ROS1 paternal, DME maternal regions in ros1-3 mutant sperm relative to wild-type Col-0 sperm. DME maternal regions are highly methylated in wild-type Col-0 sperm.

      Fig 2b shows that ROS1 target-associated TEs are enriched for sRNAs in endosperm relative to embryo, whereas the reverse is true for non-ROS1-assoc TEs. Since TEs are not always well annotated and some may be missing from this analysis, what about trying the reverse analysis - are regions enriched for 24nt sRNAs in endosperm significantly hypermethylated in ros1 endosperm? All regions or only some?

      Response: We performed an analysis to address your inquiry and observed a low magnitude increase in DNA methylation in ros1 mutant endosperm at regions defined by Erdmann et al as more sRNA producing in the endosperm relative to the embryo (endosperm DSRs). Endosperm DSRs are generally lowly methylated in wild-type endosperm, as was observed originally in Erdmann et al 2017. Small increases in DNA methylation are observed at endosperm DSRs in all sequence contexts in ros1 endosperm. Overall, this is consistent with ROS1 targets being a subset of sRNA-producing regions in the endosperm. This analysis is now included in Supplemental Fig. 7C.

      What is the relationship between previously-defined DME targets and ROS1 targets identified in this paper? DME tends to target small euchromatic TE bodies, whereas Fig. 3 suggests that ROS1 helps prevent methylation spreading on the outer edges of the TEs, rather than in the TE body. Do all DME targets tend to be adjacent to or flanked by ROS1 target sites? Or are the TEs affected by DME (in body) and by ROS1 (at edges) largely nonoverlapping? Fig. 5a suggests that the ROS1-dependent, biallelically-demethylated sites are both DME and ROS1 targets, but how often do these really appear to overlap? More than by chance?

      Response: We have sought to address your comments through a series of analyses that we have included in Fig. 7 and Supplemental Fig. 16. We found that ROS1 paternal, DME maternal regions (formerly referred to as ROS1-dependent, biallelically-demethylated regions) and DME maternal regions (formerly referred to as ROS1-independent, maternally-demethylated regions) do not occupy the same genomic regions. However, we do observe some evidence for ROS1 activity in flanking regions of DME targets (Fig. 6A, Fig. 7B-D). To look at TEs specifically, as you suggest, we first identified TEs that were within 1kb or intersecting a DME maternal region. Based on our characterization of these regions, we assume these to be DME-targeted TEs. We then performed ends analysis to see if there was evidence of ROS1 activity at the ends of these TEs. Indeed, at a global level there is a slight hypermethylation of the paternal allele in a ros1 mutant at the end of these DME TEs (Fig. 7B). To better visualize how many DME TEs are showing ROS1 activity at their ends, we then plotted the difference between the median ros1-3 methylation and median Col-0 values in the non-allelic endosperm for each TE in a clustered heatmap (Fig. 7C). The parent-of-origin data does not have enough coverage for clustering in this way, so we used the non-allelic data. A small fraction of “DME TEs” gain methylation in the ros1 mutant endosperm relative to wild-type (Fig. 7C-D).

      Are the TEs whose boundaries are demethylated by ROS1 more likely to be expressed in vegetative or endosperm tissues than TEs not affected by loss of ROS1? Expressed TEs likely produce more sRNAs, which would increase RdDM in a way that might need to be more actively countered by ROS1 than transcriptionally silent or evolutionarily older TEs.

      Response: This is an interesting line of inquiry, although perhaps out of the scope of our present study. It has been shown that TEs demethylated by ROS1 are targeted by the RdDM pathway in Arabidopsis vegetative tissue (Tang et al 2016). Using data from Erdmann et al 2017, we looked at 24 nt sRNAs at ROS1-TEs in the endosperm and embryo (Supplemental Fig. 7). sRNA production at ROS1 TE-flanking regions is observed in both embryo and endosperm, but clearly not all ROS1 TEs produce 24 nt sRNA production in the seed. Future work comparing sRNA profiles in a ros1 mutant to those of wild-type could inform our understanding of TE spreading in a ros1 mutant, as would a comprehensive analysis of TE expression, again in both a ros1 mutant and in wild-type. It’s unclear to us if the endosperm would be the most informative or useful tissue to perform such analyses in.

      Fig6 - as noted in the text, one way to test whether demethylation by ROS1 occurs before or after fertilization is to provide functional ROS1 through only one parent via reciprocal WT x ros-1 crosses, so that the endosperm always has ROS1 but either sperm or central cell does not, and see if this can rescue the paternal hypermethylation. If ROS1 acts prior to fertilization, then paternal ROS1 will rescue ros1 hypermethylation, but maternal ROS1 won't. If after fertilization, then either maternally or paternally supplied ROS1 will rescue the hypermethylation phenotype (assuming both are well expressed). Thus, to distinguish the two, it is sufficient to test whether maternally supplied ROS1 in an otherwise mutant background can rescue the hypermethylation phenotype, which is what is shown in Fig. 6. However, I think it's also important to show that paternally supplied ROS1 can also rescue the hypermethylation phenotype, which is not currently shown. The plots showing no effect on maternal mCG aren't as informative, since maternal methylation levels are mostly unaffected by ros1 anyway. Instead of comparing pairs of samples in a scatterplot, it might be clearer to show paternal mCG across all four comparisons (WT x WT, WT x ros1, ros1 x WT, and ros1 x ros1) side by side in a heatmap, using clustering to group similar behavior.

      Response: We have revised this figure, now Fig. 8, in the following ways, which we believe addresses your comments and clarify the main conclusions (see same response to reviewer 2 for point 14):

      Figure 8B remains as a scatterplot, where we observe significant correlation between individual ROS1 paternal, DME maternal regions in homozygous ros1 endosperm and heterozygous ros1/+ endosperm. Note that paternal allele methylation is higher in homozygous ros1 endosperm for most regions.

      Figure 8C is now a boxplot comparing methylation of the paternal allele of ROS1 paternal, DME maternal regions (previously referred to as biallelically-demethylated, ROS1-dependent regions) across endosperm ROS1 genotypes. This plot shows increased methylation of paternal alleles when the paternal parent is a ros1 mutant, regardless of whether the resultant F1 endosperm is homozygous or heterozygous for ros1 (columns 3, 4, 6).

      I would also suggest including a little more information in the main plots rather than only in the figure legends. For example, in Fig 2 including a label of 'ROS1-associated TE' for the two plots on the left, and 'TEs not associated with ROS1' on the right. Or for example in Fig. 3a indicating 'ros1-3 CG hyperDMRs' somewhere on the plot. This would just help make the figures easier to read at a glance. Please add common gene names to figures, instead just the ATG gene ID (Fig. S1a).

      Response: Thank you for this feedback, we have made the suggested edits and additional edits of a similar nature.

      Minor:<br /> - Fig. 1E is referenced in the text before Fig. 1D<br /> - Fig. S4 and S5 - there are more lines in the plot than the 6 genotypes listed in the legend, do these represent different replicates? If so that should be noted in the legend<br /> - Fig. 1B has no color legend for the different methylation sequence contexts (looks like same as 1A,C but should indicate either in plot or legend)<br /> - Line 42 should be "correspond to TE ends"<br /> - Line 93 "Based on previous studies..." should have references to those studies<br /> - When referring to the protein (rather than the genetic locus or mutant), ROS1 should not be italicized - for example line 130<br /> - Line 150 "we conclude that the loss"<br /> - Should add a y=x line to scatterplots, like those in Fig. 6<br /> - In fig. 1d, it's hard to evaluate the significance of the overlap of ROS1 targets with genes and TEs. Comparing these numbers to a control where the ROS1 targets have been randomly shuffled would help.

      Response: We have made edits and additions where requested.

      Reviewer #3 (Significance):

      In this work, Hemenway and Gehring explore whether ROS1, DML2 and DML3 also affect DNA methylation patterns in endosperm. Using EM-seq of sorted endosperm nuclei, they show that loss of ROS1 indeed causes hypermethylation of a number of loci, particularly the flanks of methylated transposons, while loss of DML2 and DML3 has minimal additional effect. By obtaining allele-specific EM-seq data through crosses of Col and C24, the authors show that ros1 endosperm hypermethylation is mostly restricted to the paternal allele. The authors propose that at some sites, ROS1 helps bring down paternal methylation levels to match maternal methylation levels, which are typically reduced in endosperm due to DME activity in the female gametophyte prior to fertilization. In a ros1 mutant with paternal hypermethylation, these sites become differentially methylated on the maternal and paternal alleles, resembling imprinted loci. This work convincingly establishes a function for ROS1 in DNA methylation patterning in endosperm. However, I struggled with the clarity of the writing and reasoning in a few places, and would suggest clarification of a few points and additional analyses.

      Response: Thank you for your comments. We have worked on streamlining the text and analysis.

    2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #3

      Evidence, reproducibility and clarity

      DNA demethylases play a key role in DNA methylation patterning during flowering plant reproduction. The demethylase DME, in particular, is critical for proper endosperm development. While the function of DME in endosperm development has been explored, the contributions of the other demethylases in the same family, ROS1, DML2 and DML3 in Arabidopsis, have not yet been investigated. In vegetative tissues, ROS1 prevents hypermethylation of some loci. In this work, Hemenway and Gehring explore whether ROS1, DML2 and DML3 also affect DNA methylation patterns in endosperm. Using EM-seq of sorted endosperm nuclei, they show that loss of ROS1 indeed causes hypermethylation of a number of loci, particularly the flanks of methylated transposons, while loss of DML2 and DML3 has minimal additional effect. By obtaining allele-specific EM-seq data through crosses of Col and C24, the authors show that ros1 endosperm hypermethylation is mostly restricted to the paternal allele. The authors propose that at some sites, ROS1 helps bring down paternal methylation levels to match maternal methylation levels, which are typically reduced in endosperm due to DME activity in the female gametophyte prior to fertilization. In a ros1 mutant with paternal hypermethylation, these sites become differentially methylated on the maternal and paternal alleles, resembling imprinted loci. This work convincingly establishes a function for ROS1 in DNA methylation patterning in endosperm. However, I struggled with the clarity of the writing and reasoning in a few places, and would suggest clarification of a few points and additional analyses below.

      I think making a few simple changes to streamline nomenclature would improve readability. For example, in the section starting on line 129, the same set of genomic features are called ROS1 target-proximal TEs, TEs that are near a ROS1 target region, and ROS1 target-associated TE regions. Also for example in line 254 "regions that are maternally-demethylated in wild-type endosperm, and are not dependent on ROS1 for proper demethylation" - are these the same as the "ROS1-independent, maternally-demethylated" regions in Fig. 5a? Given how complex these terms are, being consistent throughout the manuscript really helps the reader.

      Is there any notable effect of ros1 on gene expression in endosperm? Endosperm is a terminal tissue, so maintaining DNA methylation boundaries as ROS1 does in vegetative tissues seems less important. It begs the question of why ROS1 is doing this in endosperm, is it just because it's there, or is there an endosperm-specific function? Exploring effects on imprinting would be particularly interesting (does loss of ROS1 'create' imprinted loci at these newly asymmetrically methylated sites?) but probably beyond the scope of the present work.

      Is DME expressed in sperm, or is expression of DME affected in ros1 sperm or endosperm? One other explanation for ros1 hypermethylation occurring primarily on the paternal allele is that, potentially, DME can substitute for ROS1 in the central cell where DME is already very active, but not in sperm cells. Related, how well expressed is ROS1 vs. DME in sperm cells?

      Fig 2b shows that ROS1 target-associated TEs are enriched for sRNAs in endosperm relative to embryo, whereas the reverse is true for non-ROS1-assoc TEs. Since TEs are not always well annotated and some may be missing from this analysis, what about trying the reverse analysis - are regions enriched for 24nt sRNAs in endosperm significantly hypermethylated in ros1 endosperm? All regions or only some?

      What is the relationship between previously-defined DME targets and ROS1 targets identified in this paper? DME tends to target small euchromatic TE bodies, whereas Fig. 3 suggests that ROS1 helps prevent methylation spreading on the outer edges of the TEs, rather than in the TE body. Do all DME targets tend to be adjacent to or flanked by ROS1 target sites? Or are the TEs affected by DME (in body) and by ROS1 (at edges) largely nonoverlapping? Fig. 5a suggests that the ROS1-dependent, biallelically-demethylated sites are both DME and ROS1 targets, but how often do these really appear to overlap? More than by chance?

      Are the TEs whose boundaries are demethylated by ROS1 more likely to be expressed in vegetative or endosperm tissues than TEs not affected by loss of ROS1? Expressed TEs likely produce more sRNAs, which would increase RdDM in a way that might need to be more actively countered by ROS1 than transcriptionally silent or evolutionarily older TEs.

      Fig6 - as noted in the text, one way to test whether demethylation by ROS1 occurs before or after fertilization is to provide functional ROS1 through only one parent via reciprocal WT x ros-1 crosses, so that the endosperm always has ROS1 but either sperm or central cell does not, and see if this can rescue the paternal hypermethylation. If ROS1 acts prior to fertilization, then paternal ROS1 will rescue ros1 hypermethylation, but maternal ROS1 won't. If after fertilization, then either maternally or paternally supplied ROS1 will rescue the hypermethylation phenotype (assuming both are well expressed). Thus, to distinguish the two, it is sufficient to test whether maternally supplied ROS1 in an otherwise mutant background can rescue the hypermethylation phenotype, which is what is shown in Fig. 6. However, I think it's also important to show that paternally supplied ROS1 can also rescue the hypermethylation phenotype, which is not currently shown. The plots showing no effect on maternal mCG aren't as informative, since maternal methylation levels are mostly unaffected by ros1 anyway. Instead of comparing pairs of samples in a scatterplot, it might be clearer to show paternal mCG across all four comparisons (WT x WT, WT x ros1, ros1 x WT, and ros1 x ros1) side by side in a heatmap, using clustering to group similar behavior.

      I would also suggest including a little more information in the main plots rather than only in the figure legends. For example, in Fig 2 including a label of 'ROS1-associated TE' for the two plots on the left, and 'TEs not associated with ROS1' on the right. Or for example in Fig. 3a indicating 'ros1-3 CG hyperDMRs' somewhere on the plot. This would just help make the figures easier to read at a glance. Please add common gene names to figures, instead just the ATG gene ID (Fig. S1a).

      Minor:

      • Fig. 1E is referenced in the text before Fig. 1D
      • Fig. S4 and S5 - there are more lines in the plot than the 6 genotypes listed in the legend, do these represent different replicates? If so that should be noted in the legend
      • Fig. 1B has no color legend for the different methylation sequence contexts (looks like same as 1A,C but should indicate either in plot or legend)
      • Line 42 should be "correspond to TE ends"
      • Line 93 "Based on previous studies..." should have references to those studies
      • When referring to the protein (rather than the genetic locus or mutant), ROS1 should not be italicized - for example line 130
      • Line 150 "we conclude that the loss"
      • Should add a y=x line to scatterplots, like those in Fig. 6
      • In fig. 1d, it's hard to evaluate the significance of the overlap of ROS1 targets with genes and TEs. Comparing these numbers to a control where the ROS1 targets have been randomly shuffled would help.

      Significance

      In this work, Hemenway and Gehring explore whether ROS1, DML2 and DML3 also affect DNA methylation patterns in endosperm. Using EM-seq of sorted endosperm nuclei, they show that loss of ROS1 indeed causes hypermethylation of a number of loci, particularly the flanks of methylated transposons, while loss of DML2 and DML3 has minimal additional effect. By obtaining allele-specific EM-seq data through crosses of Col and C24, the authors show that ros1 endosperm hypermethylation is mostly restricted to the paternal allele. The authors propose that at some sites, ROS1 helps bring down paternal methylation levels to match maternal methylation levels, which are typically reduced in endosperm due to DME activity in the female gametophyte prior to fertilization. In a ros1 mutant with paternal hypermethylation, these sites become differentially methylated on the maternal and paternal alleles, resembling imprinted loci. This work convincingly establishes a function for ROS1 in DNA methylation patterning in endosperm. However, I struggled with the clarity of the writing and reasoning in a few places, and would suggest clarification of a few points and additional analyses

    3. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #2

      Evidence, reproducibility and clarity

      Summary

      Hemenway and Gehring present evidence that the paternal genome in Arabidopsis endosperm is demethylated at several hundred loci by the DNA glycosylase/lyase ROS1. The evidence is primarily based on analysis of DNA methylation of ros1 mutants and of hybrid crosses where each parental genome can be differentiated by SNPs. I have some comments/questions/concerns, two of them potentially serious, but I think Hemenway and Gehring can address them through additional analyses of data that they already have available and a bit of clarification in writing.

      Major comments:

      1. Could the excess methylation in ros1-3 relative to ros1-7 shown in Figures 1A and 1C be explained by a second mutation in the ros1-3 background that elevates methylation at some loci? Any mutation that increased RdDM at these loci, for example could have this effect. This could confound the identification and interpretation of biallelicly demethylated loci.
      2. It appears that the main focus of the manuscript, the existence of loci that are paternally demethylated by ROS1, is supported by a set of 274 DMRs. This is a small number relative to the size of the genome and raises suspicions of rare false positives. Even the most stringent p-values that DMR-finding tools report do not guarantee that the DMRs are actually reproducible in an independent experiment. Demonstrating overlap between these 274 DMRs and an independently defined set using a different WT control and different ros1 allele would suffice to remove this concern. It appears that authors already have the needed raw data with ros1-1 and ros1-7 alleles.
      3. Because of the multiple sets of DMRs identified and used throughout the paper, it is hard to follow which one is which. There are DMRs defined solely by one sequence context, DMRs defined by all three contexts merged, DMRs defined by comparisons between maternal and paternal methylation in endosperm, DMRs defined by comparison between mutants and wildtype, and more. These need clearer descriptions of which sets are being referred to throughout the main text and in figure legends. A table summarizing them might help (not in the supplement). Use of consistent and precisely defined terms would help. Stating the number of DMRs along with the name for each set would help a lot, even though this would make for some redundancy. (The number of DMRs in each set not only helps with interpretation but also act as a sort of ID). The reason I put this as a major concern is because the text and figures are difficult to understand, and it is currently hard to evaluate both the results and the authors' conclusions from those results.

      Minor comments

      1. The sRNA results in Figure 2B are difficult to interpret because they do not reveal anything about the number of TEs that have siRNAs overlapping them or their flanks. While the magnitude of some of the highest endosperm sRNA peaks is higher than the embryo peaks, that could be explained by a small number of TEs with large numbers of sRNAs. To make this result more interpretable, we also need some information about how many TEs have a significant number of sRNAs associated with them in endosperm and embryo in each region (e.g., middle, 5', 3', and flanks of TEs). What a "significant number of sRNAs" is would be up to the authors to decide based on the distribution of sRNA counts they observe for TEs. Perhaps the top quartile of TEs? Combined with the same analysis done in parallel with non-ROS1 target TEs, this would reveal whether there is any evidence for ROS1 counteracting sRNA-driven methylation spread from TEs.
      2. The statement "we are likely underestimating the true degree of differential methylation among genotypes" should be validated and partially quantified using a methylation metaplot like Figure 2A, but substitute DMRs for TEs. Related to that, Figure 1B needs an indicator of scale in bp.
      3. The statement "Over half of ROS1 target regions identified in the ros1-3 mutant endosperm were within 1 kb or intersecting a TE (Fig. 1D)" is hard to interpret without some kind of ROS1 non-target regions or whole-genome control comparison. How different are the numbers in Fig. 1D from a random expectation?
      4. The sentence at line 262 is confusing. Is the comparison between dme mutant and ros1 mutant or between different types of regions? And it appears that the comparison value is missing in the "3-5% CG methylation gain..." e.g., "3-5% CG methylation vs 10-20%" or something like that.
      5. The dme mutant data in Figure 5C appear to be key to the model in Figure 7. The relative impact of the dme mutant in the two types of regions should be quantified.
      6. Looks like sRNA methods are missing.
      7. Supplemental Figure 1 is hard to interpret since it only list gene IDs, not gene names.

      The last comments are suggestions for increasing the impact of this study:<br /> 11. Figure 2A and 3B suggest that ROS1 target TEs show demethylation in their flanks but not in the TE themselves. This is an interesting result. If it is true, more DMRs would be expected in the ROS1 target flanks than in the ROS1 target TEs. Reporting how many ROS1 target TEs have DMRs in them and what proportion have DMRs in their flanking 1-Kb regions would answer this question. Given the significance of this result, it also deserves a bit more context: Is the magnitude of increased methylation flanking TEs in ros1 mutant endosperm different than in ros1 mutant leaves or other tissue? Does methylation in TE flanks behave the way in dme mutant endosperm?<br /> 12. The idea of biallelic demethylation has been theoretically suggested in maize to explain weak overlap between endosperm DMRs and imprinting (Gent et al 2022). If that were true in Arabidopsis, then ROS1 target, biallelicly demethylated loci would be less likely to have imprinted expression than maternally demethylated loci. This prediction could be tested using available data in Arabidopsis.<br /> 13. There is currently no evidence for biological significance of biallelicly demethylated loci. Knowing where they are in the genome might give some hints. A figure like Fig. 1D but specifically showing the biallelicly demethylated DMRs would be valuable.<br /> 14. It is hard to make the comparisons between genotypes and parental genomes in Figure 6 and know what they mean. Maybe a different way of displaying the data would help. Or maybe even a different labeling system could make it a little more accessible.

      Significance

      Demethylation of the maternal genome in endosperm has been the subject of much research because it can result in genomic imprinting of gene expression. The enzymes responsible, DNA glycosylases/lyases, also demethylate DNA in other cell types as well, where DNA methylation is not confined to one parental genome (biallelic or biparental as opposed to uniparental demethylation). To the best of my knowledge, the extent or even existence of biallelelic demethylation in endosperm has not been studied until now (except for a superficial look in a bioRxiv preprint, https://www.biorxiv.org/content/10.1101/2024.07.31.606038v1). Hemenway and Gehring have carried out a thoughtful and detailed analysis of the topic in Arabidopsis at least as far as it depends on the DNA glycosylase ROS1.

      A limitation is that the study design would miss biallelic demethylation by any of the other three DNA glycosylases in Arabidopsis. A second limitation is that there is no clear biological significance, just some conjecture about evolution. Nonetheless, given the novelty of the topic, biological significance may follow.

      The audience for biallelic DNA demethylation in Arabidopsis endosperm is certainly in the "specialized" category, but its relevance to the larger topic of gene regulation in endosperm will attract a larger audience.

    1. In other words, AI will not enable the creation of quality translations for people who previously lacked that ability. That part still requires a human feel for the linguistic and cultural elements of the translation. But for those who are just looking to get a rough but passable translation (say, for research) it should work most of the time. And for those who would love to create quality translations but face huge opportunity costs and zero financial incentives, AI could lead to new possibilities.The deeper risk is not that AI will replace historians or translators, but that it will convince us we never needed them in the first place. A tool that outputs polished, confident language with no sense of ambiguity or context is appealing to people who think facts will save us. But there is a vast difference between facts and truths. If we come to treat them as interchangeable, we cede interpretation to the machines and narrative power to those who design them. So maybe Microsoft was right after all, just not in the way they think. Historians and translators may be the first to go not because their work is easy to automate, but because the interpretive element of their labor has always been invisible or, when made visible, dismissed as odious human bias. AI will replace them only in the minds of people who never understood what they were doing in the first place. Which, given how things are going, may be enough.

      If you believe in the Fact you are more likely to heuristic your way toward that polished language as a signal of plausibility

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      The manuscript by Ozcan et al., presents compelling evidence demonstrating the latent potential of glial precursors of the adult cerebral cortex for neuronal reprogramming. The findings substantially advance our understanding of the potential of endogenous cells in the adult brain to be reprogrammed. Moreover, they describe a molecular cocktail that directs reprogramming toward corticospinal neurons (CSN).

      Strengths:

      Experimentally, the work is compelling and beautifully designed, with no major caveats. The main conclusions are fully supported by the experiments. The work provides a characterization of endogenous progenitors, genetic strategies to isolate them, and proof of concept of exploiting these progenitors' potential to produce a specific desired neuronal type with "a la carte" combination of transcription factors.

      Weaknesses:

      Some issues need to be addressed or clarified before publication. The manuscript requires editing. It is dense and rich in details while in other parts there are a few mistakes.

      We thank the reviewer for their excellent summary and for their extremely positive review of our paper. We are pleased that the experimental design and conclusions were judged to be wellsupported.

      We have revised the paper to enhance clarity, include additional relevant citations, and refine terminology in some sections of the original version.

      We appreciate the reviewer’s thoughtful review and agree that these revisions enhance the paper.

      Reviewer #2 (Public Review):

      Summary:

      Here the authors show a novel direct neuronal reprogramming model using a very pure culture system of oligodendrocyte progenitor cells and demonstrate hallmarks of corticospinal neurons to be induced when using Neurogenin2, a dominant-negative form of Olig2 in combination with the CSN master regulator Fezf2.

      Strengths:

      This is a major achievement as the specification of reprogrammed neurons towards adequate neuronal subtypes is crucial for repair and still largely missing. The work is carefully done and the comparison of the neurons induced only by Neurogenin 2 versus the NVOF cocktail is very interesting and convincingly demonstrates a further subtype specification by the cocktail.

      Weaknesses:

      As carefully as it is done in vitro, the identity of projection neurons can best be assessed in vivo. If this is not possible, it could be interesting to co-culture different brain regions and see if these neurons reprogrammed with the cocktail, indeed preferentially send out axons to innervate a co-cultured spinal cord versus other brain region tissue.

      We appreciate the reviewer’s positive evaluation of our work and their recognition of its significance in advancing neuronal subtype specification through directed differentiation of endogenous progenitors. 

      We agree with the reviewer’s suggestion that a very interesting future stage of this work would be to investigate the projection neuron identity in vivo. We aim to pursue follow-up studies to investigate in vivo integration and connectivity of such neurons generated by directed differentiation from endogenous SOX6+/NG2+ cortical progenitors. As the reviewer insightfully suggests, co-culturing different brain regions with these neurons could offer an alternative strategy to partially assess potential preferential connectivity into cultured spinal cord vs. alternate tissue.

      We agree with the reviewer that future investigation in vivo will further strengthen the implications of this work.

      Reviewer #3 (Public Review):

      Summary:

      Ozkan, Padmanabhan, and colleagues aim to develop a lineage reprogramming strategy towards generating subcerebral projection neurons from endogenous glia with the specificity needed for disease modelling and brain repair. They set out by targeting specifically Sox6-positive NG2 glia. This choice is motivated by the authors' observation that the early postnatal forebrain of Sox6 knockout mice displays marked ectopic expression of the proneural transcription factor (TF) Neurog2, suggesting a latent neurogenic program may be derepressed in NG2 cells, which normally express Sox6. Cultured NG2 glia transfected with a construct ("NVOF") encoding Neurog2, the corticofugal neuron-specifying TF Fezf2, and a constitutive repressor form of Olig2 are efficiently reprogrammed to neurons. These acquire complex morphologies resembling those of mature endogenous neurons and are characterized by fewer abnormalities when compared to neurons induced by Neurog2 alone. NVOF-induced neurons, as a population, also express a narrower range of cortical neuron subtype-specific markers, suggesting narrowed subtype specification, a potential step forward for Neurog2-driven neuronal reprogramming. Comparison of NVOF- and Neurog2-induced neurons to endogenous subcerebral projection neurons (SCPN) also indicates Fezf2 may aid Neurog2 in directing the generation of SCPN-like neurons at the expense of other cortical neuronal subtypes.

      Strengths:

      The report describes a novel, highly homogeneous in vitro system amenable to efficient reprogramming. The authors provide evidence that Fezf2 shapes the outcome of Neurog2-driven reprogramming towards a subcerebral projection neuron identity, consistent with its known developmental roles. Also, the use of the modified RNA for transient expression of Neurog2 is very elegant.

      Weaknesses:

      The molecular characterization of NVOF-induced neurons is carried out at the bulk level, therefore not allowing to fully assess heterogeneity among NVOF-induced neurons. The suggestion of a latent neurogenic potential in postnatal cortical glia is only partially supported by the data from the Sox6 knockout. Finally, some of the many exciting implications of the study remain untested.

      Discussion:

      The study has many exciting implications that could be further tested. For example, an ultimate proof of the subcerebral projection neuron identity would be to graft NVOF cells into neonatal mice and study their projections. Another important implication is that Sox6-deficient NG2 glia may not only express Neurog2 but activate a more complete neurogenic programme, a possibility that remains untested here.

      Also, is the subcerebral projection neuron dependent on the starting cell population? Could other NG2 glia, not expressing Sox6, also be co-axed by the NVOF cocktail into subcerebral projection neurons? And if not, do they express other (Sox) transcription factors that render them more amenable to reprogramming into other cortical neuron subtypes? The authors state that SOX6-positive NG2 glia are a quiescent progenitor population. Given that NG2 glia is believed to undergo proliferation as a whole, are Sox6-positive NG2 glia an exception from this rule? Finally, the authors seem to imply that subcerebral projection neurons and Sox6-positive NG2 glia are lineage-related. However, direct evidence for this conjecture seems missing.

      We appreciate the reviewer’s thoughtful and detailed review of this work. We especially appreciate the positive evaluation of the work and the highlighting of multiple strengths of our approach, including the role of Fezf2 in refining neuronal subtype identity and the use of modified RNA to enable transient expression of Neurog2.

      We acknowledge the reviewer’s comment that single-cell transcriptomic analysis would indeed provide a more granular view of likely heterogeneity. This current study focuses on investigating the feasibility of directed differentiation of corticospinal-like neurons from endogenous progenitors. Future work employing single-cell sequencing could indeed help delineate the heterogeneity of neurons generated by directed differentiation, and potentially contribute toward identification of potential molecular roadblocks in different subsets.

      Regarding the suggestion that SOX6-deficient NG2+ progenitors might activate a broader neurogenic program, we agree that this is an intriguing possibility. We are currently conducting indepth investigation of the loss of SOX6 function in NG2+ progenitors, and we aim to submit this quite distinct work for separate publication.

      The reviewer raises an important point about whether SOX6+/NG2+ progenitors and subcerebral projection neurons are indeed normally lineage-related. In the current work, we utilized postnatal cortical SOX6+/NG2+ progenitors that are thought to be largely derived from EMX1+ and GSH2+ ventricular zone neural progenitors. Our unpublished data from the separate study noted above indicate that SOX6 is expressed by both these lineages in vivo. Since subcerebral projection neurons are derived from EMX1+ ventricular zone progenitors (SOX6-expressing), at least some of the SOX6+/NG2+ progenitors are expected to share a lineage relationship with subcerebral projection neurons. While our data strongly suggest such a link, we agree that direct lineagetracing could be pursued in future work. 

      Finally, we agree with the reviewer’s suggestion that in vivo transplantation to assess the identity and connectivity of neurons generated by directed differentiation would be very interesting, and is a natural next phase of this work. We aim to pursue such work in future investigations.

      We again thank the reviewer for their insightful comments.

      Reviewer #1 (Recommendations For The Authors): 

      The most important clarification for me concerns the initial description of the progenitors. I think there is a mistake with the transgenic line NG2. The dsRed mouse used in Figure 1 C is not described until later in the results describing Figure 2. This was confusing. Moreover, perhaps this is a reason why I get confused and do not understand how the authors conclude that SOX6+ cells are a subset of NG2positive cells. Panel C shows the opposite. Please correct the description and show the quantification of data in panel 1C.

      We thank the reviewer for their thoughtful review and for highlighting this important point. We appreciate the reviewer pointing out the benefit of further clarity regarding the NG2.DsRed transgenic mouse description in Figure 1C. We have revised the text to clarify the use of the transgenic line and ensure that the DsRed mouse is properly introduced. Additionally, we have further clarified the description explaining the basis for concluding that SOX6+ cells are a subset of NG2+ cells and further integrate this conclusion with the data presented.

      During cell sorting from the cortices of NG2.DsRed mice, we observe two distinct populations of NG2-DsRed+ cells based on fluorescence intensity in FACS: NG2-DsRed “bright” and NG2-DsRed “dim” populations. The NG2-DsRed “dim” population consists of a heterogenous mix of NESTIN+ progenitors, GFAP+ astrocytes/progenitors, a subset of NG2+ cells, and other unidentified cells. In contrast, the DsRed “bright” population includes a broader group of progenitors that also give rise to oligodendrocytes (please see Zhu, Bergles, and Nishiyama 2008), along with pericytes. 

      Previous studies have shown that, while dorsal/pallial VZ progenitors express SOX6 during embryonic development, SOX6 expression becomes restricted to interneurons postnatally (these do not express NG2 proteoglycan; Azim et al., 2009) and to the broader group of NG2+ progenitors that also give rise to oligodendrocytes. The ICC image in Fig. 1C shows bright NG2+ cells in the cortex, many of which express SOX6. Thus, we conclude that SOX6+ cells constitute a subset of NG2-DsRed+ cells. 

      In a similar line, the work is beautiful, but the manuscript can gain a lot from shortening and some more editing. for example:

      (1) In the abstract, the word inappropriate should be removed. It seems to me that is an unnecessary subjective qualification - it is hardly possible that in biology we found repression of something inappropriate.

      We have removed the word “inappropriate”.

      (2) FACS-purify these genetically accessible....establish a pure culture. Genetically accessible is nice, and I understand that it conveys that they can be traced in the mouse, but everything is genetically accessible with the right tool, and perhaps it is more informative to explain which gene or report is used for the isolation. These cells are not accessible in humans. Also, I consider it best to remove pure- the culture is pure (purified by FACS) cells.

      We have revised the text to specify the gene/reporter used for isolation instead of using "genetically accessible", and we removed "pure", since FACS purification is already explicitly mentioned.

      (3) In the initial paragraph in the results: "They are exposed to the same morphogen gradients throughout embryonic development, and thus, compared to distant cell types, have similar epigenomic and transcription landscapes." This is proven in the cited publication, but the way is stated here seems a bit of an unnecessary overstatement. The hypothesis stated after this paragraph is as good as it is with or without this argument.

      We have revised the text and simplified the statement. We agree that the hypothesis remains clear and well-supported without this emphasis.

      (4) In the result sections, "two distinct populations of DsREd-positive cells were identified based on fluorescence intensity"- I know it is correct, but when reading the percentages, I was confused because those percentages divided the population into three fractions. What the authors do not explain is that they discard the intermediate-expressing population.

      We appreciate the reviewer highlighting this inadvertent point of confusion. We erred by discussing only the two populations of central interest to us (DsRed-bright and DsRed-dim), and did not explicitly mention the DsRed-negative population. We have now clarified the text to include all three cell populations and their percentages of the total cells in all three populations (in the original manuscript and still now, ~75-78% were DsRed-negative). We have also further clarified that only DsRed-Bright cells (identified as progenitors) were used for all subsequent experiments.

      These examples illustrate the type of editing that would be appreciated but which is entirely up to the authors.

      We thank the reviewer for their thoughtful suggestions toward improving clarity and precision. We have incorporated these recommendations, along with suggestions from the other two reviewers, in the revised paper.

      Reviewer #2 (Recommendations For The Authors):

      (1)  The authors start their results section by showing in situ Hybridization for Ngn2 in control and Sox6KO mice. These control sections do not look convincing, as there is not even some signal in the adult VZSVZ region and virtually no background. Please show sections where some positive signal can also be detected in the control sections.

      We agree with the reviewer that making direct comparisons in ISH experiments is an important point. In our ISH experiments, to ensure consistency and appropriate comparisons, we process WT and KO sections together and stop the signal development simultaneously. We could have extended the development time to enhance WT signal to a detectable level, but that would have led to excessive background and over-saturated signal in the KO sections.

      To address the reviewer’s point, we have added a new supplementary figure with an additional pair of WT and KO sections, along with reference data from the Allen Brain Atlas. The WT section shows faint Neurog2 expression in the dentate gyrus region of the hippocampus, while the KO section confirms very substantial upregulation of Neurog2 in the absence of SOX6 function. These additional data enhance the clarity and depth of our results.

      Please see the following link for the Allen Brain Atlas ISH data demonstrating that Neurog2 expression in the postnatal (P4) SVZ/SGZ is inherently low. (https://developingmouse.brainmap.org/experiment/show/100093831). 

      (2) As a hallmark of projection neurons is where they send their axons, it would be important to include a biological assay for this. Of course, in vivo experiments would be great, but if this is not possible, the authors could co-culture sections from the late embryonic cortex, striatum, and spinal cord to see if the reprogrammed neurons preferentially extend their axons towards one of these targets (as normally developing neurons would, see e.g. Bolz et al., 1990).

      We agree with the reviewer’s suggestion that a very interesting future stage of this work would be to investigate the projection neuron identity including connectivity in vivo. We aim to pursue follow-up studies to investigate in vivo integration and connectivity of such neurons generated by directed differentiation from endogenous SOX6+/NG2+ cortical progenitors. As the reviewer insightfully suggests, co-culturing different brain regions with these neurons could offer an alternative strategy to partially assess potential preferential connectivity into cultured spinal cord vs. alternate tissue. This area of investigation is of substantial interest to our lab, and we aim to pursue it in the coming years– it is a very large undertaking by either approach.

      (3) However, if the loss of Sox6 is sufficient for Ngn2 to be upregulated, why did the authors not pursue this approach in their reprogramming experiments? Are these endogenous levels sufficient for reprogramming? Please add some OPC cultures from WT and KO mice to explore their conversion to neurons and possibly combine them with Olig2VP16 and Fezf2.

      We thank the reviewer for this insightful comment and for raising this broader area of inquiry regarding whether SOX6 might be down-regulated to enhance induction of neurogenesis. We are writing a separate manuscript regarding function of SOX6 in these progenitors during normal or molecularly manipulated development. We investigate function of SOX6 using both whole body null mice and a series of conditional null mice. We aim to post that work as a preprint and submit it for review and publication in the coming months. Beyond that work, the potential strategy of downregulating SOX6 function while simultaneously upregulating other molecular controls to refine directed neuronal differentiation is also of substantial interest to us, and we aim to pursue this in follow-up work. Though these are both interesting questions/topics, we respectfully submit that these broad areas of parallel, complex, and future investigation would substantially expand the scope of work in this paper, so we aim to address them in separate studies.

      (4) Please indicate independent biological replicates as individual data points in all histograms, i.e. also in Figure 2K, Figure 4I, S2H.

      We have updated the figure legends indicating the biological replicates, and explained the broad media optimization that was used successfully in all further experiments.

      (5) GFP labelling in Figures S2K-N is not convincing - too high background. Please optimize.

      We have redesigned this figure and now present it as a new supplementary figure, with GFP pseudocolored in gray and enlarged subpanels for improved visualization of cell morphology.

      Reviewer #3 (Recommendations For The Authors):

      This is an extremely well-written manuscript with very exciting implications. Obviously, not all can be tested here. Some of the suggestions are relatively easy and may be worth testing right away, others may require more extensive study in the future. In my view, completing some of the points below could make this paper a landmark study.

      I start with the key questions:

      (1) Do grafted NVOF cells give rise to subcerebral projection neurons in vivo?

      We agree with the reviewer’s suggestion that a very interesting future stage of this work would be to investigate the projection neuron identity including connectivity in vivo. As noted above in response to Reviewer 2, we aim to pursue follow-up studies to investigate in vivo integration and connectivity of such neurons generated by directed differentiation from endogenous SOX6+/NG2+ cortical progenitors. This question is of substantial interest to us, and we aim to pursue it in the coming years– as the reviewer notes, this is a very large undertaking, and beyond the scope of this paper.

      (2) What is the fate of the Sox6 deficient NG2 glia that express Neurog2? One could isolate these cells and subject them to scRNA sequencing to see how far neurogenesis proceeds without addition of exogenous factors.

      We thank the reviewer for this insightful question. As noted in our response to Reviewer 2, we are writing a separate manuscript regarding function of SOX6 in these progenitors during normal or molecularly manipulated development. We investigate function of SOX6 using both whole body null mice and a series of conditional null mice. We aim to post that work as a preprint and submit it for review and publication in the coming months, likely in early summer. We respectfully submit that this broad area of parallel, complex investigation would substantially expand the scope of work in this paper and make this paper too complex and multi-directional, so we aim to publish them as separate papers for the benefit of clarity for readers.

      (3) Obviously, what happens to Sox6-deficient (or non-deficient cells) when forced to express NVOF? In this context, it might be fair to cite Felske et al (PLoS Biol, 2023) who report Neurog2 and Fezf2-induced reprogramming in the postnatal brain. In their model, these authors did not distinguish between converted astrocytes and NG2 glia. Thus, some of the reprogrammed cells may comprise the SOX6positive cells described here.

      We thank the reviewer for highlighting for us that we inadvertently omitted referencing the important paper by Felske et al., 2023. We have now included this citation. 

      We thank the reviewer for raising this broader area of inquiry regarding whether SOX6 might be down-regulated to enhance induction of neurogenesis. Beyond the work noted above regarding function of SOX6 in these progenitors during normal or molecularly manipulated development, the potential strategy of downregulating SOX6 function while simultaneously upregulating other molecular controls to refine directed neuronal differentiation is of substantial interest to us, and we aim to pursue this in follow-up work. We again respectfully submit that this area of complex, future investigation should be addressed in future studies.

      Very interesting unaddressed questions include:

      (1) Are Sox6+ NG glia of dorsal origin? This is implied but not shown. One could use Emx1Cre lines to assess this. Are Sox6+ glia and subcerebral projection neurons clonally related? This may be more challenging. In this context, it might be again fair to refer to Herrero-Navarro et al (Science Advances 2021) who show that glia lineage related to nearby neurons gives rise to induced neurons with regional specificity.

      The reviewer raises an important question regarding the competence of SOX6+/NG2+ progenitors from distinct origins to generate corticospinal-like neurons by directed differentiation. In ongoing unpublished work, we have identified SOX6 expression by NG2+ progenitors of the three lineages derived from ventricular zone progenitors that express either Emx1, Gsh2, or Nkx2.1 transcription factors. The EMX1+ lineage-derived SOX6+/NG2+ progenitors are directly lineage related to cortical projection neurons. As the reviewer suggests, future experiments could explore potential differences in competence between these three populations.

      We again thank the reviewer for highlighting for us that we also inadvertently omitted referencing the exciting study by Herrero-Navarro that addresses the question of regional heterogeneity within astrocytes and the differential reprogramming potential related to their origins. We have now cited this paper in the manuscript.

      (2) Do other NG2 glia not give rise to subcerebral projection neurons when challenged with NVOF? Thus, how important is Sox6 expression really?

      The question of the specific competence of dorsal/cortical SOX6+/NG2+ progenitors to differentiate into corticospinal-like neurons, and the strategy of downregulating SOX6 function while simultaneously upregulating other molecular controls to direct neuronal differentiation, are both of great interest to us. In pilot experiments, we observed reduced competence of ventrallyderived SOX6+/NG2+ progenitors to generate similar neurons. We plan to pursue the SOX6 manipulation in follow up work.

      (3) Do Sox6+ NG2 glia proliferate like other NG2 glia and thereby represent a replenishable pool of progenitors?

      Yes; as noted in the text shortly after Figure 1, and as presented in Figure S3l-L, these progenitors proliferate robustly in response to the mitogens PDGF-A and FGF2.

      (4) How heterogenous are the NVOF-induced neurons? The bulk highlights the overall specificity, but does not tell whether all cells make it equally well.

      We agree with the reviewer that this is an interesting question. ICC analysis (Fig. 4G-4H) presents the variation in the levels of a few functionally important proteins in the population of NVOFinduced neurons. This could be due to any or all of at least three potential possibilities: 1) potential diversity in the population of purified SOX6+/NG2+ progenitors; 2) technical variability in the amount of NVOF plasmid delivered to individual progenitors during transfection; and/or 3) natural stochastic TF-level variations generating closely-related neuron types, that also occurs during normal development. Future experiments could explore these questions.

    1. Reviewer #1 (Public review):

      In this updated and improved manuscript, the authors investigate the role of Aurora Kinase A (AurA) in trained immunity, following a broader drug screening aimed at finding inhibitors of training. They show AurA is important for trained immunity by looking at the different aspects and layers of training using broad omics screening, followed up by a more detailed investigation of specific mechanisms. The authors finalised the investigation with an in vivo MC-38 cancer model where AurA inhibition reduces beta-glucan's antitumour effects.

      Strengths:

      The experimental methods are generally well-described. I appreciate the authors' broad approach to studying different key aspects of trained immunity (from comprehensive transcriptome/chromatin accessibility measurements to detailed mechanistic experiments). Approaching the hypothesis from many different angles inspires confidence in the results. Furthermore, the large drug-screening panel is a valuable tool as these drugs are readily available for translational drug-repurposing research.

      In response to the rebuttal, I would like to compliment and thank the authors for the large amount of work they have done to improve this manuscript. They have removed most of my previous concerns and confusions, and explained some of their approaches in a way that I now agree with them - a great learning opportunity for me as well.

      Weaknesses:

      (1) The authors have adequately responded to my comments and updated the manuscript accordingly.

      (2) The authors have removed most of my concerns. Regarding the use of unpaired tests because that is what is often done in the literature: I still don't agree with this, nor do I think that 'common practice' is a solid argument to justify the approach. However, we can agree to disagree, as I know indeed that many people argue over when paired tests are appropriate in these types of experiments. I appreciate that n=2 for sequencing experiments is justifiable in the way these analyses are used as exploratory screening methods with later experimental validation. I also want to thank the authors for reporting biological replicates where relevant and (I should have mentioned this in my original review also) I appreciate they validate some findings in a separate cell line - many papers neglect this important step.

      (3) The authors have adequately responded to my comments and updated the manuscript accordingly.

      (4) The authors have adequately responded to my comments and updated the manuscript accordingly.

      (5) The authors have adequately responded to my comments and updated the manuscript accordingly.

      (6) The authors have adequately responded to my comments and updated the manuscript accordingly. They have actually gone above and beyond.

      (7) I would like to thank the authors for highlighting this information and taking away my confusion. The authors have adequately responded to my comments and updated the manuscript accordingly.

      (8) The authors have adequately responded to my comments and updated the manuscript accordingly.

      (9) I still think adding the 'alisertib alone' control would be of great added value, but I can see how it is unreasonable to ask the authors to redo those experiments.

      (10) The authors have adequately responded to my comments and updated the manuscript accordingly.

      (11) The authors have adequately responded to my comments and updated the manuscript accordingly.

      (12) I thank the authors for their work to repeat this experiment with my suggestions included. I am convinced by this nice data. I would recommend that the authors put the data from New Figure 4 also in the manuscript as it adds value to the manuscript (unless I just missed it, I don't see it in Figure 6 or the supplement). Not every reader may look at the reviewer comments/rebuttal documents.

    2. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer#1 (Public review):

      This work regards the role of Aurora Kinase A (AurA) in trained immunity. The authors claim that AurA is essential to the induction of trained immunity. The paper starts with a series of experiments showing the effects of suppressing AurA on beta-glucan-trained immunity. This is followed by an account of how AurA inhibition changes the epigenetic and metabolic reprogramming that are characteristic of trained immunity. The authors then zoom in on specific metabolic and epigenetic processes (regulation of S-adenosylmethionine metabolism & histone methylation). Finally, an inhibitor of AurA is used to reduce beta-glucan's anti-tumour effects in a subcutaneous MC-38 model.

      Strengths:<br /> With the exception of my confusion around the methods used for relative gene expression measurements, the experimental methods are generally well-described. I appreciate the authors' broad approach to studying different key aspects of trained immunity (from comprehensive transcriptome/chromatin accessibility measurements to detailed mechanistic experiments). Approaching the hypothesis from many different angles inspires confidence in the results (although not completely - see weaknesses section). Furthermore, the large drug-screening panel is a valuable tool as these drugs are readily available for translational drug-repurposing research.

      We thank the reviewer for the positive and encouraging comments.

      Weaknesses:

      (1) The manuscript contains factual inaccuracies such as:

      (a) Intro: the claim that trained cells display a shift from OXPHOS to glycolysis based on the paper by Cheng et al. in 2014; this was later shown to be dependent on the dose of stimulation and actually both glycolysis and OXPHOS are generally upregulated in trained cells (pmid 32320649).

      We appreciate the reviewer for pointing out this inaccuracy, and we have revised our statement to ensure accurate and updated description in manuscript. We are aware that trained immunity involves different metabolic pathways, including both glycolysis and oxidative phosphorylation [1, 2]. We also detected Oxygen Consumption Rate (please see response to comment 8 of reviewer#1) but observed no obvious increase of oxygen consumption in trained BMDMs in our experiment setting. As the reviewer pointed out, it might be dependent on the dose of stimulation.

      (b) Discussion: Trained immunity was first described as such in 2011, not decades ago.

      We are sorry for the inaccurate description, and we have corrected the statement in our revised manuscript as “Although the concept of ‘trained immunity’ has been proposed since 2011, the detailed mechanisms that regulate trained immunity are still not completely understood.”

      (2) The authors approach their hypothesis from different angles, which inspires a degree of confidence in the results. However, the statistical methods and reporting are underwhelming.

      (a) Graphs depict mean +/- SEM, whereas mean +/- SD is almost always more informative. (b) The use of 1-tailed tests is dubious in this scenario. Furthermore, in many experiments/figures the case could be made that the comparisons should be considered paired (the responses of cells from the same animal are inherently not independent due to their shared genetic background and, up until cell isolation, the same host factors like serum composition/microbiome/systemic inflammation etc). (c) It could be explained a little more clearly how multiple testing correction was done and why specific tests were chosen in each instance.

      We sincerely thank the reviewer for this thoughtful comment. (a) The data from animal experiments in which trained immunity was induced in vivo are presented as mean ± SD, while the statistical results from cell-based experiments are presented as mean ± SEM in the revised manuscript. (b) We have replaced one-tailed test with two-tailed test (see Figure 3J in revised manuscript, with updated P value label). We agree that cells derived from the same animal and subjected to different treatment conditions may be deemed paired data. We reanalyzed our data using paired statistical tests. While this led to a slight reduction in statistical significance for some comparisons, the overall trends remained consistent, and our biological interpretation remains unchanged. For in vitro experiments unpaired statistical tests are commonly used in literature [3, 4]. Thus, we still used unpaired test results here. (c) We have provided a detailed description of how multiple comparisons were performed in revised figure legends.

      (d) Most experiments are done with n = 3, some experiments are done with n = 5. This is not a lot. While I don't think power analyses should be required for simple in vitro experiments, I would be wary of drawing conclusions based on n = 3. It is also not indicated if the data points were acquired in independent experiments. ATAC-seq/RNA-seq was, judging by the figures, done on only 2 mice per group. No power calculations were done for the in vivo tumor model.

      We are sorry for the confusion in our description in figure legends. For the in vivo experiment, we determined the sample size (n=5, n refers to number of mice used as biological replicates) by referring to the animal numbers used for similar experiments in literatures. And according to a reported resource equation approach for calculating sample size in animal studies [5], n=5-7 is suitable for most of our mouse experiments. The in vitro cell assay was performed at least three independent experiments (BMs isolated from different mice), and each experiment was independently replicated at least three times and points represents biological replicates in our revised manuscript. In Figure 1A, 5 biological replicates of these experiments are presented to carefully determine a working concentration of alisertib that would not significantly affect the viability of trained macrophages, and that was subsequently used in all related cell-based experiments. As for seq data, we acknowledge the reviewer's concern regarding the small sample size (n=2) in our RNA-seq/ATAC-seq experiment. We consider the sequencing experiment mainly as an exploratory/screening approach, and performed rigorous quality control and normalization of the sequencing data to ensure the reliability of our findings. For RNA-seq data analysis, we referred to the DESeq2 manual, which specifies that its statistical framework is based on the Negative Binomial Distribution and is capable of robustly inferring differential gene expression with a minimum of two replicates per group. Therefore, the inclusion of two replicates per group was deemed sufficient for our analysis. Nevertheless, the genomic and transcriptome sequencing data were used primarily for preliminary screening, where the candidates have been extensively validated through additional experiments. For example, we conducted ChIP followed by qPCR for detecting active histone modification enrichment in Il6 and Tnf region to further verify the increased accessibility of trained immunity-induced inflammatory genes.

      (e) Furthermore, the data spread in many experiments (particularly BMDM experiments) is extremely small. I wonder if these are true biological replicates, meaning each point represents BMDMs from a different animal? (disclaimer: I work with human materials where the spread is of course always much larger than in animal experiments, so I might be misjudging this.).

      Thanks for your comments. In our initially submitted manuscript, some of the statistical results were presented as the representative data (technical replicates) from one of three independent biological replicates (including BMDMs experiments showing the suppression and rescue experiments of trained immunity under different inhibitors or activators, see original Figure 1B-C, Figure 5D, and Figure 5H, also related to Figure 1B-C, Figure 5D, and Figure 5H respectively in our revised manuscript) while other experimental data are biological replicates including CCK8 experiment, metabolic assay and ChIP-qPCR. In response to your valuable suggestion, we have revised the manuscript to present all statistical results as biological replicates from three independent experiments (presented as mean ± SEM), and we have provided all the original data for the statistical analysis results (please see Appendix 2 in resubmit system).

      (3) Maybe the authors are reserving this for a separate paper, but it would be fantastic if the authors would report the outcomes of the entire drug screening instead of only a selected few. The field would benefit from this as it would save needless repeat experiments. The list of drugs contains several known inhibitors of training (e.g. mTOR inhibitors) so there must have been more 'hits' than the reported 8 Aurora inhibitors.

      Thank you for your suggestion and we have briefly reported the outcomes of the entire drug screening in the revised manuscript. The targets of our epigenetic drug library are primarily categorized into several major classes, including Aurora kinase family, histone methyltransferase and demethylase (HMTs and KDMs), acetyltransferase and deacetylase (HDACs and SIRTs), JAK-STAT kinase family, AKT/mTOR/HIF, PARP family, and BRD family (see New Figure 1, related to Figure 1-figure supplement 1B in revised manuscript). Notably, previous studies have reported that inhibition of mTOR-HIF1α signaling axis suppressed trained immunity[6]. Our screening results also indicated that most inhibitors targeting mTOR-HIF1α signaling exhibit an inhibitory effect on trained immunity. Additionally, cyproheptadine, a specific inhibitor for SETD7, which was required for trained immunity as previously reported [7], was also identified in our screening.

      JAK-STAT signaling is closely linked to the interferon signaling pathway, and certain JAK kinase inhibitors also target SYK and TYK kinases. A previous drug library screening study has reported that SYK inhibitors suppressed trained immunity [8]. Consistently, our screening results reveal that most JAK kinase inhibitors exhibit suppressive effects on trained immunity.

      BRD (Bromodomain) and Aurora are well-established kinase families in the field of oncology. Compared to BRD, the clinical applications of the Aurora kinase inhibitor are still at early stage. In previous studies using inflammatory arthritis models where trained immunity was established, both adaptive and innate immune cells exhibited upregulated expression of AurA [9, 10]. Our study provides further evidence supporting an essential role of AurA in trained immunity, showing that AurA inhibition leads to the suppression of trained immunity.

      (4) Relating to the drug screen and subsequent experiments: it is unclear to me in supplementary figure 1B which concentrations belong to secondary screens #1/#2 - the methods mention 5 µM for the primary screen and "0.2 and 1 µM" for secondary screens, is it in this order or in order of descending concentration?

      Thank you for your comments and we are sorry for unclear labelled results in original manuscript (related to Figure 1-supplement 1C). We performed secondary drug screen at two concentrations, and drug concentrations corresponding to secondary screen#1 and #2 are 0.2 and 1 μM respectively. It was just in this order, but not in an order of descending concentration.

      (a) It is unclear if the drug screen was performed with technical replicates or not - the supplementary figure 1B suggests no replicates and quite a large spread (in some cases lower concentration works better?)

      Thank you for your question. The drug screen was performed without technical replicates for initial screening purpose, and we need to verify any hit in the following experiment individually. Yes, we observed that lower concentration works better in some cases. We speculate that it might be due to the fact that the drug's effect correlates positively with its concentration only within a specific range. But in our primary screening, we simply choose one concentration for all the drugs. This is a limitation for our screening, and we acknowledge this limitation in our discussion part.

      (5) The methods for (presumably) qPCR for measuring gene expression in Figure 1C are missing. Which reference gene was used and is this a suitably stable gene?

      We are sorry for this omission. The mRNA expression of Il6 and Tnf in trained BMDMs was analyzed by a quantitative real-time PCR via a DDCt method, and the result was normalized to untrained BMDMs with Actb (β-actin) as a reference gene, a well-documented gene with stable expression in macrophages. We have supplemented the description for measuring gene expression in Material and Methods in our revised manuscript.

      (6) From the complete unedited blot image of Figure 1D it appears that the p-Aurora and total Aurora are not from the same gel (discordant number of lanes and positioning). This could be alright if there are no/only slight technical errors, but I find it misleading as it is presented as if the actin (loading control to account for aforementioned technical errors!) counts for the entire figure.

      We are very sorry for this omission. In the original data, p-Aurora and total Aurora were from different gels. In this experiment the membrane stripping/reprobing after p-Aurora antibody did not work well, so we couldn’t get all results from one gel, and we had to run another gel using the same samples to blot with anti-aurora antibody and used β-tubulin as loading control for total AurA (please see New Figure 2A, also related to original Figure 1D). We have provided the source data for β-tubulin from the same membrane of total AurA (please see Figure 1-source data). To avoid any potential misleading, we have repeated this experiment and updated this Figure (please see New Figure 2B, also related to Figure 1D in revised manuscript) with phospho-AurA, total AurA and β-actin from the same gel. The bands for phospho AurA (T288) were obtained using a new antibody (Invitrogen, 44-1210G) and we have revised this information in Material and Methods. We have provided data of three biological replicates to confirm the experiment result also see New Figure 2B, related to Figure 1D in revised manuscript, and the raw data have been added in source data for Figure 1)

      (7) Figure 2: This figure highlights results that are by far not the strongest ones - I think the 'top hits' deserve some more glory. A small explanation on why the highlighted results were selected would have been fitting.

      We appreciate the valuable suggestion. Figure 2 (see also Figure 2 in revised manuscript) presented information on the chromatin landscape affected by AurA inhibition to confirm that AurA inhibition impaired key gene activation involved in pro-inflammatory macrophage activation by β-glucan. In Figure 2B we highlighted a few classical GO terms downregulated including “regulation of growth”, “myeloid leukocyte activation” and “MAPK cascade” (see also Figure 2B in revised manuscript), among which “regulation of growth” is known function of Aurora A, just to show that alisertib indeed inhibited Aurora A function in vivo as expected. “Myeloid leukocyte activation” and “MAPK cascade” were to show the impaired pro-inflammatory gene accessibility. We highlighted KEGG terms downregulated like “JAK-STAT signaling pathway”, “TNF signaling pathway” and “NF-kappa B signaling pathway” in Figure 2F (see also Figure 2F in revised manuscript), as these pathways are highly relevant to trained immunity. Meanwhile, KEGG terms “FOXO signaling pathway” (see also Figure 2G in revised manuscript) was highlighted to confirm the anti-inflammation effect of alisertib in trained BMDMs, which was further illustrated in Figure 5 (see also Figure 5 in revised manuscript, illustrating FOXO3 acts downstream of AurA). Some top hits in Figure 2B like “positive regulation of cell adhesion”, and “pathway of neurodegeneration” and "ubiquitin mediated proteolysis" in Figure 2F and 2G, is not directly related to trained immunity, thus we did not highlight them, but may provide some potential information for future investigation on other functions of Aurora A.

      (8) Figure 3 incl supplement: the carbon tracing experiments show more glucose-carbon going into TCA cycle (suggesting upregulated oxidative metabolism), but no mito stress test was performed on the seahorse.

      We appreciate this question raised by the reviewer. We previously performed seahorse XF analyze to measure oxygen consumption rate (OCR) in β-glucan-trained BMDMs. The results showed no obvious increase in oxidative phosphorylation (OXPHOS) indicated by OCR under β-glucan stimulation (related to Figure 3-figure supplement 1 A) although the carbon tracing experiments showed more glucose-carbon going into TCA cycle. We speculate that the observed discrepancy between increased glucose incorporation into TCA cycle and unchanged OXPHOS may reflect a characteristic metabolic reprogramming induced by trained immunity. The increased incorporation of glucose-derived carbon into the TCA cycle likely serves a biosynthetic purpose—supplying intermediates for anabolic processes—rather than augmenting mitochondrial respiration[6]. Moreover, the unchanged OXPHOS may be attributed to a reduced reliance on fatty acid oxidation- “catabolism”, with glucose-derived acetyl-CoA becoming the predominant substrate. Thus, while overall OXPHOS remains stable, the glucose contribution to the TCA cycle increases. This is in line with reports showing that trained immunity promotes fatty acid synthesis- “anabolism”[11]. Alternatively, the partial decoupling of the TCA cycle from OXPHOS could result from the diversion of intermediates such as fumarate out of the cycle. Oxygen consumption rate (OCR) after a mito stress test upon sequential addition of oligomycin (Oligo, 1 μM), FCCP (1 mM), and Rotenone/antimycin (R/A, 0.5 μM), in BMDMs with different treatment for 24 h. β-glucan, 50 μg/mL; alisertib, 1 μM.

      (9) Inconsistent use of an 'alisertib-alone' control in addition to 'medium', 'b-glucan', 'b-glucan + alisertib'. This control would be of great added value in many cases, in my opinion.

      Thank you for your comment. We appreciate that including “alisertib-alone” group throughout all the experiments may further solidify the results. We set the aim of the current study to investigate the role of Aurora kinase A in trained immunity. Therefore, in most settings, we did not include the group of alisertib only without β-glucan stimulation.

      (10) Figure 4A: looking at the unedited blot images, the blot for H3K36me3 appears in its original orientation, whereas other images appear horizontally mirrored. Please note, I don't think there is any malicious intent but this is quite sloppy and the authors should explain why/how this happened (are they different gels and the loading sequence was reversed?)

      Thank you for pointing out this error. After checking the original data, we found that we indeed misassembled the orientation of several blots in original data submitted. We went through the assembling process and figured out that the orientation of blots in original data was assembled according to the loading sequences, but not saved correctly, so that the orientations in Figure 4A were not consistent with the unedited blot image. We are sorry for this careless mistake, and we have double checked to make sure all the blots are correctly assembled in the revised manuscript. We also provided three replicates of for the Western blot results showing the level of H3K36me3 in trained BMDMs was inhibited by alisertib (as seen in New Figure 7 at recommendation 2 of reviewer#2).

      (11) For many figures, for example prominently figure 5, the text describes 'beta-glucan training' whereas the figures actually depict acute stimulation with beta-glucan. While this is partially a semantic issue (technically, the stimulation is 'the training-phase' of the experiment), this could confuse the reader.

      Thanks for the reviewer’s suggestion and we have reorganized our language to ensure clarity and avoid any inconsistencies that might lead to misunderstanding.

      (12) Figure 6: Cytokines, especially IL-6 and IL-1β, can be excreted by tumour cells and have pro-tumoral functions. This is not likely in the context of the other results in this case, but since there is flow cytometry data from the tumour material it would have been nice to see also intracellular cytokine staining to pinpoint the source of these cytokines.

      Thanks for the reviewer’s suggestion. In Figure 6, we performed assay in mouse tumor model and found that trained immunity upregulated cytokines level like IL-6 in tumor tissue, which was downregulated by alisertib administration. In order to rule out the possibility that the detected cytokines such as IL-6 was from tumor cells, we performed intracellular cytokine staining of single cells isolated from tumor tissues (please see New Figure 4). The result showed that only a small fraction of non-immune cells (CD45<sup>-</sup> population) expressed IL-6 (0.37% ± 0.11%), whereas a significantly higher proportion of IL-6-positive cells was observed among CD45<sup>+</sup> population (deemed as immune cells, 13.66% ± 1.82%), myeloid cells (CD45<sup>+</sup>CD11b<sup>+</sup>, 15.60% ± 2.19%), and in particular, macrophages (CD45<sup>+</sup>CD11b<sup>+</sup>F4/80<sup>+</sup>37.24% ± 3.04%). These findings strongly suggest that immune cells, especially macrophages, are the predominant source of IL-6 cytokine within the tumor microenvironment. Moreover, we also detected higher IL-6 positive population in myeloid cells and macrophages (please see Figure 6I in revised manuscript).

      Reviewer#2 (Public review):

      Summary:

      This manuscript investigates the inhibition of Aurora A and its impact on β-glucan-induced trained immunity via the FOXO3/GNMT pathway. The study demonstrates that inhibition of Aurora A leads to overconsumption of SAM, which subsequently impairs the epigenetic reprogramming of H3K4me3 and H3K36me3, effectively abolishing the training effect.

      Strengths:

      The authors identify the role of Aurora A through small molecule screening and validation using a variety of molecular and biochemical approaches. Overall, the findings are interesting and shed light on the previously underexplored role of Aurora A in the induction of β-glucan-driven epigenetic change.

      We thank the reviewer for the positive and encouraging comments.

      Weaknesses:

      Given the established role of histone methylations, such as H3K4me3, in trained immunity, it is not surprising that depletion of the methyl donor SAM impairs the training response. Nonetheless, this study provides solid evidence supporting the role of Aurora A in β-glucan-induced trained immunity in murine macrophages. The part of in vivo trained immunity antitumor effect is insufficient to support the final claim as using Alisertib could inhibits Aurora A other cell types other than myeloid cells.

      We appreciate the question raised by the reviewer. Though SAM generally acts as a methyl donor, whether the epigenetic reprogram in trained immunity is directly linked to SAM metabolism was not formally tested previously. In our study, we provided evidence suggesting the necessity of SAM maintenance in supporting trained immunity. As for in vivo tumor model, we agree that alisertib may inhibits Aurora A in many cell types besides myeloid cells. To further address the reviewer’s concern, we have performed the suggested bone marrow transplantation experiment (trained mice as donor and naïve mice as recipient) to verify the contribution of myeloid cell-mediated trained immunity for antitumor effect (please see New Figure 8, also related to Figure 6C, 6D and Figure 6-figure supplement 1B and 1C in revised manuscript).

      Reviewer #1 (Recommendations for the authors):

      Some examples of spelling errors and other mistakes (by far not a complete list):

      (a) Introduction, second sentence: reads as if Candida albicans (which should be italicised and capitalised properly) and BCG are microbial polysaccharide components.

      (b) Methods: ECAR is ExtraCellular Acidification Rate, not 'Extracellular Acid Ratio'

      (c) Figure 2C: β-glucan is misspelled in the graph title.

      (d) TNFα has been renamed to 'TNF' for a long time now.

      (e) Inconsistent use of Tnf and Tfnα (the correct gene symbol is Tnf) (NB: this field does not allow me to italicise gene symbols)

      (f) Figure supplement 1B: 'secdonary'

      (g) Caption of figure 4: "Turkey's multiple-comparison test"

      (h) etc

      I would ask the authors that they please go over the entire manuscript very carefully to correct such errors.

      We apologize for these errors and careless mistakes. We greatly appreciate your suggestions, and have carefully proofread the revised manuscript to make sure no further mistakes.

      Please also address the points I raised in the public review about statistical approaches. Even more important than the relatively low 'n' is my question about biological replicates. Please clarify what you mean by 'biological replicate'.If you are able to repeat at least the in vitro experiments (if this is too much work pick the most important ones) a few more times this would really strengthen the results.

      Thank you for your comment. Our biological replicates refer to independently repeated experiments using bone marrow cells isolated from different mice, and n represents the number of mice used. We repeated each experiment at least three times using BMDMs isolated from different mice (n =3, biological replicates). Specifically, we repeated several in vitro experiments showing inhibition of AurA upregulated GNMT in trained BMDMs and showing transcription factor FOXO3 acted as a key protein in AurA-mediated GNMT expression to control trained immunity as well as showing mTOR agonist rescued trained immunity inhibited by alisertib (see New Figure 5, related to Figure 5B-C, Figure 5H in revised manuscript). Additionally, we have provided data with three biological replicates to show the β-glucan induced phosphorylation of AurA (see comment 6 of reviewer#1) and changes of histone modification marker under AurA inhibition and GNMT deficiency (see recommendation 2 of reviewer#2). We also repeated in vivo tumor model to analysis intratumor cytokines (see recommendation 12 of reviewer#1).

      Finally: the authors report 'no funders' during submission, but the manuscript contains funding details. Please modify this in the eLife submission system if possible.

      Thank you for your kind reminder and we have modified funding information in the submission system.

      Reviewer #2 (Recommendations for the authors):

      (1) I have the following methodological and interpretative comments for consideration:

      Aurora A has been previously implicated in M1 macrophage differentiation and NF-κB signaling. What is the effect of Aurora A inhibition on basal LPS stimulation? Considering that β-glucan + Ali also skews macrophage priming towards an M2 phenotype, as shown in Fig. 2E, further clarification on this point would strengthen the study.

      Thanks for your suggestion. Previous study showed AurA was upregulated in LPS-stimulated macrophages and the inhibition of AurA downregulated M1 markers of LPS-stimulated macrophages through NF-κB pathway but did not affect IL-4-induced M2 macrophage polarization [12]. Consistently, we also found that AurA inhibition downregulated inflammatory response upon basal LPS stimulation as shown by decreased IL-6 level (see New Figure 6). In original Figure 2E (also related to Figure 2E in revised manuscript), we showed an increased accessibility of Mrc1 and Chil3 under “β-glucan +Ali” before re-challenge, both of which are typical M2 macrophage markers. Motif analysis showed that AurA inhibition would upregulate genes controlled by PPARγ (STAT6 was not predicted). Different from STAT6, a classical transcriptional factor in controlling M2 polarization (M2a) dependent on IL-4 or IL-13, PPARγ mediates M2 polarization toward M2c and mainly controls cellular metabolism on anti-inflammation independent on IL-4 or IL-13. Thus, we speculate that inhibition of AurA might promote non-classical M2 polarization, and the details warrant future investigation.

      (2) In Figure 4A, it looks like that H3K27me3 is also significantly upregulated by β-glucan and inhibited by Ali. How many biological replicates were performed for these experiments? It would be beneficial to include densitometric analyses to visualize differences across multiple Western blot experiments for better reproducibility and quantitative assessment. In addition, what is the effect of treatment of Ali alone on the epigenetic profiling of macrophages?

      We are sorry for this confusion. Each experiment was performed with at least three independent biological replicates. In original Figure 4-figure supplement 1 (also related to Figure 4-figure supplementary 1 in the revised manuscript), we presented the densitometric analysis results from three independent Western blot experiments, which showed that β-glucan did not affect H3K27me3 levels under our experimental conditions. Three biological replicates data for histone modification were shown as follows (New Figure 7, as related to Figure 4-figure supplement 1 in revised manuscript). We appreciate that assay for “Ali alone” in macrophages may add more value to the findings. We set the aim of the current study to investigate the role of Aurora kinase A in trained immunity, and we know that alisertib itself would not induce or suppress trained immunity. Therefore, in most settings, we did not test the effect of Alisertib alone without β-glucan stimulation.

      (3) The IL-6 and TNF concentrations exhibit considerable variability (Fig. 3K and Fig. 5H), ranging from below 10 pg/mL to 500-1000 pg/mL. Please specify the number of replicates for these experiments and provide more detail on how variability was managed. Including this information would enhance the robustness of the conclusions.

      Thank you for your comment. These experiments were replicated as least three times using BMDMs isolated from different mice. The observed variations in cytokines concentration may be attributed to factors such as differences in cell density, variability among individual mice, and the passage number of the MC38 cells used for supernatant collection. We have prepared new batch of BMDMs and repeated the experiment and provided consistent results in the revised manuscript (please see Figure 5H in revised manuscript). Data for biological replicates have been provided (please see Appendix 2 in resubmit system).

      (4) The impact of Aurora A inhibition on β-glucan-induced anti-tumor responses appears complex. Specifically, GNMT expression is significantly upregulated in F4/80- cells, with stronger effects compared to F4/80+ cells as seen in Fig. 6D. To discern whether this is due to the abolishment of trained immunity in myeloid cells or an effect of Ali on tumor cells which inhibit tumor growth, I suggest performing bone marrow transplantation. Transplant naïve or trained donor BM into naïve recipients, followed by MC38 tumor transplantation, to clarify the mechanistic contribution of trained immunity versus off-target effects.

      Thanks for your valuable suggestion. Following your suggestion, we have performed bone marrow transplantation to clarify that alisertib acts on the BM cells to inhibit anti-tumor effect induced by trained immunity (see New Figure 8, related to Figure 6C-D in revised manuscript). As the results shown below, transplantation of trained BM cells conferred antitumor activity in recipient mice, while transplantation of trained BM cells with alisertib treatment lost such activity, further demonstrating that alisertib inhibited AurA in trained BM cells to impair their antitumor activity.

      References

      (1) Ferreira, A.V., et al., Metabolic Regulation in the Induction of Trained Immunity. Semin Immunopathol, 2024. 46(3-4): p. 7.

      (2) Keating, S.T., et al., Rewiring of glucose metabolism defines trained immunity induced by oxidized low-density lipoprotein. J Mol Med (Berl), 2020. 98(6): p. 819-831.

      (3) Cui, L., et al., N(6)-methyladenosine modification-tuned lipid metabolism controls skin immune homeostasis via regulating neutrophil chemotaxis. Sci Adv, 2024. 10(40): p. eadp5332.

      (4) Yu, W., et al., One-Carbon Metabolism Supports S-Adenosylmethionine and Histone Methylation to Drive Inflammatory Macrophages. Mol Cell, 2019. 75(6): p. 1147-1160 e5.

      (5) Arifin, W.N. and W.M. Zahiruddin, Sample Size Calculation in Animal Studies Using Resource Equation Approach. Malays J Med Sci, 2017. 24(5): p. 101-105.

      (6) Cheng, S.C., et al., mTOR- and HIF-1α-mediated aerobic glycolysis as metabolic basis for trained immunity. Science, 2014. 345(6204): p. 1250684.

      (7) Keating, S.T., et al., The Set7 Lysine Methyltransferase Regulates Plasticity in Oxidative Phosphorylation Necessary for Trained Immunity Induced by β-Glucan. Cell Rep, 2020. 31(3): p. 107548.

      (8) John, S.P., et al., Small-molecule screening identifies Syk kinase inhibition and rutaecarpine as modulators of macrophage training and SARS-CoV-2 infection. Cell Rep, 2022. 41(1): p. 111441.

      (9) Glant, T.T., et al., Differentially expressed epigenome modifiers, including aurora kinases A and B, in immune cells in rheumatoid arthritis in humans and mouse models. Arthritis Rheum, 2013. 65(7): p. 1725-35.

      (10) Jeljeli, M.M. and I.E. Adamopoulos, Innate immune memory in inflammatory arthritis. Nat Rev Rheumatol, 2023. 19(10): p. 627-639

      (11) Ferreira, A.V., et al., Fatty acid desaturation and lipoxygenase pathways support trained immunity. Nat Commun, 2023. 14(1): p. 7385.

      (12) Ding, L., et al., Aurora kinase a regulates m1 macrophage polarization and plays a role in experimental autoimmune encephalomyelitis. Inflammation, 2015. 38(2): p. 800-11.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public review):

      The paper is well written and the figures well laid out. The methods are easy to follow, and the rational and logic for each experiment easy to follow. The introduction sets the scene well, and the discussion is appropriate. The summary sentences throughout the text help the reader.

      The authors have done a lot of work addressing my previous concerns and those of the other Reviewers.

      We are pleased that the revised manuscript satisfactorily addresses the previous concerns of the reviewer.

      Reviewer #2 (Public review):

      Summary

      Le Roy et al quantify wing morphology and wing kinematics across twenty eight and eight hoverfly species, respectively; the aim is to identify how weight support during hovering is ensured across body sizes. Wing shape and relative wing size vary non-trivially with body mass, but wing kinematics are reported to be size-invariant. On the basis of these results, it is concluded that weight support is achieved solely through size-specific variations in wing morphology, and that these changes enabled hoverflies to decrease in size. Adjusting wing morphology may be preferable compared to the alternative strategy of altering wing kinematics, because kinematics may be subject to stronger evolutionary and ecological constraints, dictated by the highly specialised flight and ecology of the hoverflies.

      Strengths

      The study deploys a vast array of challenging techniques, including flight experiments, morphometrics, phylogenetic analyses, and numerical simulations; it so illustrates both the power and beauty of an integrative approach to animal biomechanics. The question is well motivated, the methods appropriately designed, and the discussion elegantly places the results in broad biomechanical, ecological, and evolutionary context.

      We thank the reviewer for appreciating the strengths of our study.

      Weaknesses

      (1) In assessing evolutionary allometry, it is key to pinpoint the variation expected from changes in size alone. The null hypothesis for wing morphology is well-defined (isometry), but the equivalent predictions for kinematic parameters, although specified, are insufficiently justified, and directly contradict classic scaling theory. A detailed justification of the "kinematic similarity" assumption, or a change in the null hypothesis, would substantially strengthen the paper, and clarify its evolutionary implications.

      We agree with the reviewer that a clearly articulated null hypothesis is crucial for interpreting scaling relationships. In fact, when carefully reviewing our manuscript, we realized that we nowhere did so, and which might have led to a misinterpretation of this. In the revised manuscript, we therefore now explicitly state our newly defined null hypotheses (lines 120–125, 340-352), and how we tested these (lines 359-360).

      In fact, we define two alternative null hypotheses: (1) weight support is maintained across sizes using allometric scaling of wing morphology only, and thus wingbeat kinematics are kept constant (kinematic similarity); (2) weight support is maintained across sizes using allometric scaling of wingbeat kinematics, while wing morphology scales isometrically (morphological similarity).

      According to the first null hypothesis, the second-moment-of-area of the wing should scale linearly with body mass, resulting in negative allometry of S<sub>2</sub> relative to body mass (S<sub>2</sub>∼m<sup>1</sup> <m<sup>4/3</sup>). According to the second null hypothesis, the product of wingbeat frequency and amplitude should scale with mass under negative allometry (ω∼ƒ A<sub>ϕ</sub>∼m<sup>-1/6</sup>). We test these alternative null hypotheses using Phylogenetic Generalized Least Square (PGLS) regressions of the morphology and kinematics metrics against the body mass.

      Furthermore, in our revised manuscript, we now also better explain the use of "kinematic similarity" assumption as a theoretical scenario, that is physically, biomechanically nor physiological sustainable across sizes, but that we merely use to define our null hypotheses (lines 340-351). This is made particularly explicit in a new subsection named “Theoretical considerations” (lines 448–461). Note that our second null hypothesis is thus not that hoverflies fly under "kinematic similarity", but that wingbeat kinematics scales under negative allometry (ω∼ƒ A<sub>ϕ</sub>∼m<sup>-1/6</sup>), which we assume is in line with the classic scaling theory that the reviewer refers to.

      We sincerely thank the reviewer for making us aware that we did not explicitly state our null hypotheses, and that introducing these new null hypotheses removed the confusion about the assumptions in our study.

      (2) By relating the aerodynamic output force to wing morphology and kinematics, it is concluded that smaller hoverflies will find it more challenging to support their body mass--a scaling argument that provides the framework for this work. This hypothesis appears to stand in direct contrast to classic scaling theory, where the gravitational force is thought to present a bigger challenge for larger animals, due to their disadvantageous surface-to-volume ratios. The same problem ought to occur in hoverflies, for wing kinematics must ultimately be the result of the energy injected by the flight engine: muscle. Much like in terrestrial animals, equivalent weight support in flying animals thus requires a positive allometry of muscle force output. In other words, if a large hoverfly is able to generate the wing kinematics that suffice to support body weight, an isometrically smaller hoverfly should be, too (but not vice versa). Clarifying the relation between the scaling of muscle mechanical input, wing kinematics, and weight support would help resolve the conflict between these two contrasting hypotheses, and considerably strengthen the biomechanical motivation and evolutionary interpretation.

      We agree with the reviewer that, due to disadvantageous surface-to-volume ratios, larger animals are more challenged to maintain weight-support, and that this is also the case for hovering hoverflies. In the current manuscript, we do not aim to challenge this universal scaling law of muscle force with body mass.

      Instead, we here focus merely on how the flight propulsion system (wing morphology and kinematics) scale with size, and how this allows hovering hoverflies to maintain weight support. We also fully agree with the reviewer that in theory, “if a large hoverfly is able to generate the wing kinematics that suffice to support body weight, an isometrically smaller hoverfly should be, too”. This aligns in fact with our second null hypothesis where wingbeat frequency should scale as ƒ∼m<sup>-1/6</sup>, to maintain weight support under morphological isometry.

      In our study, we show that this null hypothesis is rejected (lines 511-517, and line 525), and thus hoverflies primarily adjust their wing morphology to maintain in-hovering weight-support across sizes, and wingbeat kinematics is in fact highly conserved. Why this specific flight kinematics is so strongly conserved is not known, and thus a key topic in the discussion section of our manuscript.

      We agree with the reviewer that muscle physiology might be an important driver for this conserved kinematics, but also aerodynamic efficiency and maneuverability could be key aspects here. In our revised manuscript, we now discuss these three aspects in more detail (lines 762-775). Also, we here now also mention that we aim to address this outstanding question in future studies, by including muscle physiology in our animal flight studies, and by studying the aerodynamics and maneuver kinematic of hoverflies in more detail. 

      Moreover, in our revised introduction section, we now also mention explicitly that the capability for maintaining in-flight weight-support scales inversely with animal size, due to the negative isometric scaling of muscle force with body mass (line 52-56). Furthermore, we removed all statements that might suggest the opposite. We hope that these adjustments helped resolve the apparent conflict between our null hypotheses and general muscle scaling laws.

      Finally, in the Discussion section (lines 770-775), we now more explicitly acknowledge that wing motion is ultimately driven by the flight motor musculature, and that a full biomechanical interpretation must consider the scaling of muscle mechanical input alongside wing kinematics and morphology. While we decided to keep the focus primarily on aerodynamic constraints in this study, we agree that future work integrating both aerodynamic and physiological scaling will be essential to fully resolve these contrasting perspectives.

      (3) One main conclusion-- that miniaturization is enabled by changes in wing morphology--is insufficiently supported by the evidence. Is it miniaturization or "gigantism" that is enabled by (or drives) the non-trivial changes in wing morphology? To clarify this question, the isolated treatment of constraints on the musculoskeletal system vs the "flapping-wing based propulsion" system needs to be replaced by an integrated analysis: the propulsion of the wings, is, after all, due to muscle action. Revisiting the scaling predictions by assessing what the engine (muscle) can impart onto the system (wings) will clarify whether non-trivial adaptations in wing shape or kinematics are necessary for smaller or larger hovering insects (if at all!).

      In many ways, this work provides a blueprint for work in evolutionary biomechanics; the breadth of both the methods and the discussion reflects outstanding scholarship.

      In response to the first review round, we have removed all references to “miniaturization,” as our data does not allow us to infer evolutionary trajectories of body size (i.e., whether lineages have become smaller or larger over time). We now frame our conclusion more conservatively: that changes in wing morphology enable small hoverflies to maintain weight support despite the aerodynamic disadvantages imposed by isometric scaling.

      We fully agree that an integrated biomechanical framework, explicitly linking muscle mechanical output with wing kinematics and morphology, would significantly strengthen the study. However, we believe that performing an integrated analysis assessing the scaling of muscle input into the wing is beyond the current scope, which focuses specifically on the aerodynamic consequences of morphological and kinematic variation (see reply above).

      Reviewer #3 (Public review):

      This paper addresses an important question about how changes in wing morphology vs. wing kinematics change with body size across an important group of high-performance insects, the hoverflies. The biomechanics and morphology convincingly support the conclusions that there is no significant correlation between wing kinematics and size across the eight specific species analyzed in depth and that instead wing morphology changes allometrically. The morphological analysis is enhanced with phylogenetically appropriate tests across a larger data set incorporating museum specimens.

      The authors have made very extensive revisions that have significantly improved the manuscript and brought the strength of conclusions in line with the excellent data. Most significantly, they have expanded their morphological analysis to include museum specimens and removed the conclusions about evolutionary drivers of miniaturization. As a result, the conclusion about morphological changes scaling with body size rather than kinematic properties is strongly supported and very nicely presented with a strong complementary set of data. I only have minor textual edits for them to consider.

      We thank the reviewer for this positive feedback. We are pleased to hear that the revised manuscript is satisfactory.

      Reviewer #2 (Recommendations For The Authors):

      My main remaining qualm remains the null hypothesis for the scaling of kinematic parameters - all weaknesses come back to this point. I appreciate that the authors now specify an expectation, but they offer no justification. This is a problem, because the expectation dictates the interpretation of the results and is thus crucial to some of the key claims (including one in the paper title!): the choice made by the authors indeed implies that hovering is harder for small hoverflies, so that the reported changes in size-specific wing morphology are to be interpreted as an adaptation that enables miniaturization. However, why is this choice appropriate over alternatives that would predict the exact opposite, namely that hovering is harder for larger hoverflies?

      In my original review, I suggested that the authors may address this key question by considering the scaling of muscle mechanical output, and provided a quick sketch of what such an argument would look like, both in classic textbook scaling theory, and in the framework of more recent alternative approaches. The authors have decided against an implementation of this suggestion, providing various version of the following justification in their reply: "our study focuses precisely on this constraint on the wing-based propulsion system, and not on the muscular motor system." I am puzzled by this distinction, which also appears in the paper: muscle is the engine responsible for wing propulsion. How can one be assessed independent of the other? The fact that the two must be linked goes straight to the heart of the difficulty in determining the null hypotheses for the allometry of kinematic and dynamic parameters: they must come from assertions on how muscle mechanical output is expected to vary with size, and so couple muscle mechanical output to the geometry of the wing-based propulsion system. What if not muscle output dictates wing kinematics?

      I fully agree with the authors that null hypotheses on kinematic parameters are debatable. But then the authors should debate their choice, and at least assess the plausibility of its implications (note that the idea of "similarity" in scaling does not translate to equal or invariant, but is tied closely to dimensional analysis - so one cannot just proclaim that kinematic similarity implies no change in kinematic parameters). I briefly return to the same line of argument I laid out in the initial review to provide such an assessment:

      Conservation of energy implies:

      W = 1/2 I ω2

      where I is the mass moment of inertia and W is the muscle work output. Under isometry, I ∝m5/3, the authors posit ω ∝m0, and it follows at once that they predict W ∝m5/3. That is, the "kinematic similarity" hypothesis presented in the paper implies that larger animals can do substantially more work per unit body mass than small animals (unless the author have an argument why wing angular velocity is independent of muscle work capacity, and I cannot think of one). This increase in work output is in contradiction with the textbook prediction, going all the way back to Borelli and Hill: isogeometric and isophysiological animals ought to have a constant mass-specific work output. So why, according to the authors, is this an incorrect expectation, ie how do they justify the assumption ω ∝m0 and its implication W ∝m5/3? How can larger animals do more mass-specific work, or, equivalently, what stops smaller animals from delivering the same mass-specific work? If non-trivial adaptations such as larger relative muscle mass enable larger animals to do more work, how does this fit within the interpretation suggested by the authors that the aerodynamics of hovering require changes in small animals?

      A justification of the kinematic similarity hypothesis, alongside answers to the above questions, is necessary, not only to establish a relation to classic scaling theory, but also because a key claim of the paper hinges on the assumed scaling relationship: that changes in wing morphology enable hovering in small hoverflies. If I were to believe Borelli, Hill and virtually all biomechanics textbooks, the opposite should be the case: combing constant mass-specific work output with eq. 1, one retrieves F∝m2/3, so that weight support presents a bigger challenge for larger animals; the allometry of wing morphology should then be seen as an adaptation that enables hovering in larger hoverflies - the exact opposite of the interpretation offered by the authors.

      Now, as it so happens, I disagree with classic scaling theory on this point, and instead believe that there are good reasons to assume that muscle work output varies non-trivially with size. The authors can find a summary of the argument for this disagreement in the initial review, or in any of the following references:

      Labonte, D. A theory of physiological similarity for muscle-driven motion. PNAS, 2023, 120, e2221217120

      Labonte, D.; Bishop, P.; Dick, T. & Clemente, C. J. Dynamics similarity and the peculiar allometry of maximum running speed. Nat Comms., 2024, 15, 2181

      Labonte, D. & Holt, N. Beyond power limits: the kinetic energy capacity of skeletal muscle. J Exp Bio, 2024, 227, jeb247150

      Polet, D. & Labonte, D. Optimal gearing of musculoskeletal systems. Integr Org Biol, 2024, 64, 987-10062024

      I am asking neither that the authors agree with the above references nor that they cite them. But I do expect that they critically discuss and justify their definition of kinematic similarity, its relation to expectation from classic scaling theory, and the implications for their claim that hovering is harder for small animals. I do note that the notion of "physiological similarity" introduced in the above references predicts a size-invariant angular velocity for small animals, that small animals should be able to do less mass-specific work, and that average muscle force output can grow with positive allometry even for isogeometric systems. These predictions appear to be consistent with the data presented by the authors.

      We agree with the reviewer that our null hypothesis was not clearly articulated in our previous version of the manuscript, and that this might have led to a misinterpretation of the merits and limitations of our study. In the revised manuscript, we therefore now explicitly introduce our null hypotheses in the Introduction (lines 120–125), we define these in the Methods section (lines 340–360), test these in the Results section (lines 511–517), and reflect on the results in the Discussion (lines 602–610). We thank the reviewer for pointing out this unclarity in our manuscript, because revising it clarified the study significantly. See our replies in the “Public Review” section for details.

      Minor points

      L56: This is somewhat incomplete and simplistic; to just give one alternative option, weight support with equivalent muscle effort could also be ensured by a change in gearing (see eg Biewener's work). It is doubtful whether weight support is a strong selective force, as any animal that can move will be able to support its weight. The impact of scaling on dynamics is thus arguably more relevant.

      We thank the reviewer for pointing out that our original sentence may be too simplistic. We now briefly mention alternative mechanisms (suggested by the reviewer) to provide more nuance (line 56-58).

      L58: I am not aware of any evidence that smaller animals have reduced the musculature dedicated to locomotion beyond what is expected from isometry; please provide a reference for this claim or remove it.

      We removed that claim.

      The authors use both isometry and geometric similarity. As they also talk about muscle, solely geometric similarity (or isogeometry) may be preferable, to avoid confusion with isometric muscle contractions.

      To avoid confusion, we now use “geometric similarity” wherever the use of isometry might be ambiguous.

      L86: negative allometry only makes sense if there is a justified expectation for isometry - I suggest to change to "The assumed increase in wingbeat frequency in smaller animals" or similar, or to clarify the kinematic similarity hypothesis.

      We edited the sentence as suggested.

      L320: This assertion is somewhat misleading. Musculoskeletal systems are unlikely to be selected for static weight support. Instead, they need to allow movement. Where movement is possible, weight support is trivially possible, and so weight support should rarely, if ever, be a relevant constraint. At most, the negative consequence of isometry on weight support would be that a larger fraction of the muscle mass needs to be active in larger animals to support the weight.

      We fully agree with the reviewer that musculoskeletal systems are unlikely not selected for static loads, as the ability to move dynamically in the real world is crucial for survival. That said, we here look at hovering flight, which is far from static. In fact, hovering flight is among the energetic most costly movement patterns found in nature, due to the required high-frequency wingbeat motions (Dudley 2002). Rapid maneuvers are of course more power demanding, but hovering is a good proxy for this. For example, in fruit flies maximum force production in rapid evasive maneuvers are only two times the force produced during hovering (Muijres et al., 2014).

      We agree with the reviewer that it is important to explicitly mention the differences in functional demands on the motor system in hovering and maneuvering flight, and thus we now do so in both the introduction and discussion sections (lines 116-118 and 762-765, respectively).

      Dudley, Robert. The biomechanics of insect flight: form, function, evolution. Princeton university press, 2002.Muijres, F. T., et al. "Flies evade looming targets by executing rapid visually directed banked turns." Science 344.6180 (2014): 172-177.

      Reviewer #3 (Recommendations For The Authors):

      Throughout, check use of "constrains" vs. "constraints"

      Thank you for pointing this out. We have corrected these errors.

      Line 52 do you mean lift instead of thrust?

      We agree with the reviewer that the use of “thrust” might be confusing in the context of hovering flight, and thus we replaced “flapping-wing-based aerodynamic thrust-producing system” with the “flapping-wing-based propulsion system”. This way, we no longer use the word thrust in this context, and only use lift as the upward-directed force required for weight-support.

      Line 60 "face also constrains" wording

      Corrected.

      Line 79 Viscous forces only "dominate" at Re<1 and so this statement only refers to very very small insects which I suspect are far below the scale of the hoverflies considered (likely Re ~100) although maybe not for the smallest 3 mg ones?

      Indeed, viscous forces do not “dominate” force production at the Reynolds numbers of our flying insects. We thank the reviewer for pointing out this incorrect statement, which we corrected in the revised manuscript.

      Line 85 again thrust doesn't seem to be right

      Agreed. See reply 3.2.

      533 "maximized" should probably be "increased"

      We now use “increased”.

      Line 705-710 The new study by Darveau might help resolve this a bit because of the reliability of this relationship across and between orders. Darveau, C.-A. (2024). Insect Flight Energetics And the Evolution of Size, Form, And Function. Integrative And Comparative Biology icae028.

      We thank the reviewer for this highly relevant reference, which was unfortunately not included in the original manuscript. In connection with this work, we now further discuss the relationship between wing size allometry and deviations from the expected scaling of wingbeat frequency (lines 730-735).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      This Tanzanian study focused on the relationship between human genetic ancestry, Mycobacterium tuberculosis complex (MTBC) diversity, and tuberculosis (TB) disease severity. The authors analyzed the genetic ancestry of 1,444 TB patients and genotyped the corresponding MTBC strains isolated from the same individuals. They found that the study participants predominantly possess Bantu-speaking genetic ancestry, with minimal European and Asian ancestry. The MTBC strains identified were diverse and largely resulted from introductions from South or Central Asia. Unfortunately, no associations were identified between human genetic ancestry, the MTBC strains, or TB severity. The authors suggest that social and environmental factors are more likely to contribute to TB severity in this setting.

      Strengths:

      In comparison to other studies investigating the role of human genetics in TB phenotypes, this study is relatively large, with more than 1,400 participants.

      The matched human-MTBC strain collection is valuable and offers the opportunity to address questions about human-bacterium co-evolution.

      Weaknesses:

      Although the authors had genome-wide genotyping and whole genome sequencing data, they only compared the associations between human ancestry and MTBC strains. Given the large sample size, they had the opportunity to conduct a genome-wide association study similar to that of Muller et al. (https://doi.org/10.1016/j.ygeno.2021.04.024).

      Thank you very much for taking the time to carefully review our manuscript and for your suggestions and comments. In another published study using the same cohort (https://doi.org/10.1101/2023.05.11.23289848), we performed a genome-wide association analysis between the genome-wide SNPS of the host and the genome-wide SNPs from the paired MTBC strains. In the current work we were interested in testing specifically if host ancestry and pathogen genotype family, as well as their interaction, were associated with differences in disease severity, a clinical phenotype with direct consequences for both host and pathogen fitness. The study of Müller et al, referred to by the reviewer, investigates whether MTBC families of strains causing disease in two patient cohorts (South Africa and Ghana) were associated with particular human SNPS assessed genome-wide. In that study, clinical phenotypes were not assessed and human ancestries, in a much broader sense than the ones used in our current study, were used as covariates. To leverage the genome-wide information and the clinical variables collected in our study, we have now added a genome-wide association analysis of all the human SNPs with disease severity measures while adjusting for co-variates (age, sex,  smoking, cough duration, socioeconomic status, history of previous TB, malnutrition, education level, and drug resistance status) and for human population stratification . Yet, no significant statistical associations were detected (L243-249).

      The authors tested whether human genetic ancestry is associated with TB severity. However, the basis for this hypothesis is unclear. The studies cited as examples all focused on progression to active TB (from a latent infection state), which should not be conflated with disease severity. It is difficult to ascertain whether the role of genetic ancestry in disease severity would be detectable through this study design, as some participants might simply have been sicker for longer before being diagnosed (despite the inquiry about cough duration). This delay in diagnosis would not be influenced solely by human genetics, which is the conclusion of the study.

      Evidence that mortality and natural recovery from TB vary by disease presentation spectrum come from studies carried out before the introduction of anti-TB chemotherapy. Patients with mild disease presentation, as measured by radiology at the time of diagnosis had higher odds of recovering naturally compared to those with advanced disease (doi: 10.5588/ijtld.23.0254, doi: 10.1164/arrd.1960.81.6.839). Given the deleterious effects of an MTBC infection leading to symptomatic disease on human fitness, we hypothesized that natural selection has acted on human traits underlying TB disease severity. If those traits are heritable one would expect to find underlying genetic variation in human populations. In addition, because certain MTBC genotype families and human populations have co-existed since a least a few centuries to a few millennia, we hypothesized that some of that genetic variation could be related to human ancestry. We have added more details to the introduction to make our rational clearer (L118-127).  In our patient cohort, we observed a large variation in disease severity using as approximations; TB-Score, X-Ray score and bacterial burden in sputa (Ct-value as determined with GeneXpert). However, the reviewer is absolutely correct in that patients in our study are being diagnosed at different stages of disease confounding our analysis. This is a limitation of our study which cannot be fully accounted for by including cough duration, as we also acknowledged in the manuscript (L343-346).

      Additionally, the study only included participants who attended the TB clinic.

      Yes, this is related to the previous point, our study only considers patients that felt ill enough to visit the TB clinic potentially not including patients that had less severe disease as acknowledged.

      Including healthy controls from the general population would have provided an interesting comparison to see if ancestry proportions differ.

      We agree that it would be interesting to compare the ancestries of healthy controls to the ancestries of TB patients from the same population. However, that would be especially informative with respect to TB susceptibility and would not necessarily be informing disease severity traits and its underlying genetics. The similarities between the ancestry proportions of our cohort with those of neighboring countries such as Kenya, Malawi and Mozambique publicly available genomic data, suggests that there would be no major differences between TB patients and healthy controls.

      Although the authors suggest that social and environmental factors contribute to TB severity, only age, smoking, and HIV status were characterised in the study.

      Based on the comments of both reviewers, we added the following additional variables as covariates in the regression models: the socioeconomic status representing the ratio between the household income and the number of individuals in the household, malnutrition, the education level and whether it was a relapse/reinfection or a new case.

      Reviewer #2 (Public review):

      Summary:

      This manuscript reports the results of an observational study conducted in Dar es Salaam, Tanzania, investigating potential associations between genetic variation in M. tuberculosis and human host vs. disease severity. The headline finding is that no such associations were found, either for host / bacillary genetics as main effects or for interactions between them.

      Strengths:

      Strengths of the study include its large size and rigorous approaches to classification of genetic diversity for host and bacillus.

      Weaknesses:

      (1) There are some limitations of the disease severity read-outs employed: X-ray scores and Xpert cycle thresholds from sputum analysis can only take account of pulmonary disease. CXR is an insensitive approach to assessing 'lung damage', especially when converted to a binary measure. What was the basis for selection of Ralph score of 71 to dichotomise patients? If outcome measures were analysed as continuous variables, would this have been more sensitive in capturing associations of interest?

      Thank you very much for taking the time to carefully review our manuscript and for your suggestions and comments.  

      We recruited active TB patients with pulmonary TB disease that were sputum smear-positive and GeneXpert-positive. In this study we aimed at obtaining paired samples from both the patient and the strain, and in the current analysis we aimed at testing if human ancestry and its interaction with the strain genotype could explain differences in disease severity. It is often difficult to obtain microbiological cultures from extra-pulmonary cases and including those cases would have not been possible at the scale of this cohort. We believe as well that extra-pulmonary TB is of less relevance for the question we are addressing because in exclusively extrapulmonary cases, disease severity is not linked with bacterial transmission. However, extra-pulmonary TB can be extremely severe, and it would be very interesting to explore the potential role of human genetic variation underlying extra-pulmonary TB in future studies.

      As to the insensitivity of CXR to measure lung damage, we would argue that it depends on what is being assed. As a rationale for the Ralph score, its inventors argue that as in other grading methods, the proportion of affected lung and or cavitation is important to assess severity. It has been described as a “validated method for grading CXR severity in adults with smear-positive pulmonary TB that correlates with baseline clinical and microbiological severity and response to treatment, and is suitable for use in clinical trials” (https://thorax.bmj.com/content/thoraxjnl/65/10/863.full.pdf). While the validation of the score is convincing in that study, and the score has been used in several TB studies and trials, the low proportion of HIV co-infections might have been a limitation. Indeed, as shown in our previous publication, in our cohort of patients, chest X-ray scores were significantly lower in HIV infected TB patients https://doi.org/10.1371/journal.ppat.1010893. In the current analysis, regression analyses performed for the CXR severity and for the other severity measures did not include HIV co-infected patients.

      We obtained the same pattern of results using a continuous outcome. However, an assumption of linear regression was violated. The residuals were not normally distributed stemming from the bimodal distribution of the scores in our dataset. The threshold of 71 for the Ralph score has been used by others in previous studies; in its original description it has been suggested as the optimal cut-off point for predicting a positive sputum smear status after two months, which in turn has been shown to predict unfavorable outcomes (https://doi.org/10.1136/thx.2010.136242). Another study showed that a Ralph score higher than 71 was significantly associated with a longer duration of symptoms, higher clinical scores and a lower BMI (doi: 10.5603/ARM.2018.0032).

      (2) There is quite a lot of missing data, especially for TB scores - could this have introduced bias? This issue should be mentioned in the discussion.

      While we have a TB-score available for each patient, the chest X-ray score is missing for many patients. However, this is random and due both to the absence of an X-ray picture or to the bad quality of X-ray pictures that the radiologists could not assess. When stating that there is a lot of missing data for the TB scores, we assume that the reviewer was referring to the “missing N” columns in Table 1. There, the number of observations missing in each of the disease severity measures actually relates to the explanatory variables (i.e MTBC genotype and human ancestries). This table includes all patients that either had a bacterial genome available or a human genome/genotype (N = 1904). As an example for the TB-score as outcome variable, for 1471 patients the MTBC genotype was determined while it was missing for 433 patients. On the other hand for X-ray scores, 177 had a severe X-ray score, 849 a mild one and for 878 patients, there was no X-ray score available.  As for the Ct-value, despite the fact that the patients were recruited based on positive GeneXpert by the clinical team, these results were not always available to us.

      (3) The analysis adjusted for age, sex, HIV status, age, smoking and cough duration - but not for socio-economic status. This will likely be a major determinant of disease severity. Was adjustment made for previous TB (i.e. new vs repeat episode) and drug-sensitivity of the isolate? Cough duration will effectively be a correlate/consequence of more severe disease - thus likely highly collinear with disease severity read-outs - not a true confounder. How does removal of this variable from the model affect results? Data on socioeconomic status should be added to models, or if not possible then lack of such data should be noted as a limitation.

      Out of the 1904 patients that have either human or bacterial genomic data available, 48 were relapses (2.5%). The mean of the disease severity measures suggest that relapses have a higher CXR score but the TB-score and Ct-values did not differ. Based on the comments of both reviewers, we added the following additional variables as covariates to the regression models: the socioeconomic status representing the ratio between the household income and the number of individuals in the household, malnutrition examined by a doctor, the education level, and whether it was a relapse/reinfection or a new case and if the causative strain had any resistance to any anti-TB drugs. The results did not change. Cough duration could also be a consequence of more severe disease, as pointed out by the reviewer. We present now the results excluding cough duration as a variable from the model, however this also did not affect the results.

      (4) Recruitment at hospitals may have led to selection bias due to exclusion of less severe, community cases. The authors already acknowledge this limitation in the Discussion however.

      (5) Introduction: References refer to disease susceptibility, but the authors should also consider the influences of host/pathogen genetics on host response - both in vitro (PMIDs 11237411, 15322056) and in vivo (PMID 23853590). The last of these studies encompassed a broader range of ethnic variation than the current study, and showed associations between host ancestry and immune response - null results from the current study may reflect the relative genetic homogeneity of the population studied.

      We thank the reviewer for these suggestions which we have added to the introduction. 

      Reviewer #1 (Recommendations for the authors):

      Minor Comments:

      (1) The authors should be careful when using the term "Bantu" as opposed to "Bantu-speaking". (i.e. referring to the language group). The term is considered offensive in some settings.

      We thanks the reviewer for this important concern, we have revised throughout the manuscript.

      (2) There are several "(Error! Reference source not found)" phrases in the place of references throughout the document.

      We thank the reviewer for pointing this out, this has been corrected in the revised version.

      (3) Please correct line 365: "... sequencing (WGS) the patient...." to "... sequencing (WGS) of the patient...."

      (4) The figures in the supplementary PDF are not numbered and some are cut-off (I think it is Supplementary Figure S2).

      This has been corrected in the revised version.

      Reviewer #2 (Recommendations for the authors):

      Typographical errors

      (1) There are multiple instances where references have not pulled through to the text, e.g. line 126 (Error! Reference source not found.)

      We thank the reviewer for pointing this out, this has been corrected in the revised version.

      (2) Line 239: have been show - have been shown?

      Thank you, this mistake has been corrected in the revised version.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      This manuscript by Tesmer and colleagues uses fiber photometry recordings, sophisticated analysis of movement, and deep learning algorithms to provide compelling evidence that activity in hypothalamic hypocretin/orexin neurons (HONs) correlates with net body movement over multiple behaviors. By examining projection targets, the authors show that hypocretin/orexin release differs in projection targets to the locus coeruleus and substantia nigra, pars compacta. Ablation of HONs does not cause differences in the power spectra of movements. The movement-tracking ability of HONs is independent of HON activity that correlates with blood glucose levels. Finally, the authors show that body movement is not encoded to the same extent in other neural populations.

      Strengths:

      The major strengths of the study are the combination of fiber photometry recordings, analysis of movement in head-fixed mice, and sophisticated classification of movement using deep learning algorithms. The experiments seem to be well performed, and the data are well presented, visually. The data support the main conclusions of the manuscript.

      We thank the reviewer for their supportive feedback.

      Weaknesses:

      The weaknesses are minor, mostly consisting of writing and data visualization throughout the manuscript. To some degree, it is already known that hypocretin/orexin neurons correlate with movement and arousal, although this manuscript studies this correlation with unprecedented sophistication and scale. It is also unfortunate that most of the experiments throughout the study were only performed in male mice. Taken together, this study is likely to be impactful to the field and our understanding of HONs across behavioral states.

      We agree that disentangling movement from arousal is an important aspect, and in the revised manuscript, we now include new data and analyses towards this (pupillometry to directly assess arousal, and multivariate analysis to assess contributions of arousal vs movemement to HON activity). In addition, we now implement many of the reviewer’s recommendations regarding writing, data presentation, and visual clarity (see our replies in the “recommendations for authors” section).

      Reviewer #1 (Recommendations for the authors):

      Some recommendations for the authors:

      (1) The first sentence of the Introduction states: "Neural activity related to body movement recently received much attention." I would rephrase or clarify this statement, as neuroscientists have been studying neural activity related to body movement for decades.

      The reviewer is correct. Our intention was to highlight the resurgence of movementrelated neurosciences enabled by modern techniques such as deep learning applied to video data (e.g. DeepLabCut, etc). The passage has been updated for clarity.

      (2) The Introduction also states that HONs orchestrate "consciousness and arousal." I would delete the word "consciousness," as consciousness represents a lofty, global concept that is challenging to define and quantify in humans, let alone mice.

      We used the word consciousness to be consistent with current literature on the function of the mouse hypothalamus (e.g. Nat Neurosci 2016 Feb;19(2):290-8). But we agree it is not necessary here, and so we followed the advice to delete it.

      (3) The authors state that HON dynamics were recorded while mice were head-fixed while on a running wheel. For clarity, it would be helpful to visualize this head-fixation in Figures 1A and 5B. It would also be helpful to clarify how certain behaviors (e.g. grooming, chewing) were performed and recorded while the mouse was head-fixed.

      In the revised manuscript, updated graphics with a head-fixed mouse have now been added to relevant figures. Representative RGB frames (colors representing sequential frames) of each behaviour have been added to Figure 2A.

      (4) In the legend for Figure 1A, the reference to Gonzalez et al. 2016 seems out of place (at least the reader should be informed why the text is referring to this previous study). Additionally, because the references are ordered by number instead of alphabetically, it would be more helpful to refer to a numbered reference rather than a name.

      Gonzalez et al. 2016 references the source of the AAV construct used in this figure. This has been moved to the methods. Following eLife formatting guidelines, references will be alphabetized upon publication.

      (5) In Figure 3F, it would be helpful to show visual validation that the HON-DTR method indeed ablates all HONs. This is depicted conceptually, but representative figures would be much more convincing.

      A representative histological slice is now included for both wild type (WT) and HON-DTR mice in the new Figure 4B.

      Reviewer #2 (Public review):

      Summary:

      Despite several methodological strengths, the major and highly significant drawback is the confound of arousal with movement. This confound is not resolved, so the results could be explained by previously established relationships between orexin and arousal/wakefulness.

      This an excellent point, and we agree. To address this directly in the revised manuscript, we now include new data and analyses towards this (pupillometry to directly assess arousal, and multivariate analysis to assess contributions of arousal vs movemement to HON activity).

      Strengths:

      The authors show that orexin neuron activity is associated with body movement and that this information is conveyed irrespective of the fasted state. They also report differences in different orexin target brain regions for orexin release during movement. This paper contains an impressive array of cutting-edge techniques to examine a very important brain system, the orexin-hypocretin system. The authors offer an original perspective on the function of this system. The authors showed that orexin neuron activity scales to some degree with the magnitude of body movement change; this is unaffected by a fasted state and seems to be somewhat unique to orexin neurons.

      The investigation of other genetically defined subcortical neuron populations to determine the specificity of findings is also a strength, as is the ability to quantify movement and use deep learning to classify specific behaviors adds sophistication to analysis. The authors also show heterogeneity in orexin projections to specific target nuclei, which is interesting.

      The authors "speculate that narcolepsy-cataplexy, caused by HON loss-of-function, is perhaps explained by oscillations into unwanted sleep-states and motor programs due to impaired control loops for wakefulness and movement". This is quite an interesting aspect of their work and deserving of further study.

      We thank the reviewer for their supportive feedback.

      Weaknesses:

      Despite the strengths, there are several major and minor weaknesses that detract significantly from the study.

      My main concern with this work is the confound of arousal with movement so that correlations with one might reflect a relationship instead with the other. The orexin system is well known to play an important role in arousal, with elevated activity of orexin neurons reported for waking and high arousal. Orexin signaling has also been strongly associated with motivation, which also is associated with arousal and movement. The authors offer no compelling evidence that the relationships they describe between different movements and orexin signaling do not simply reflect the known relationship between arousal and motivation.

      The authors could address this concern by including classical arousal measurements, eg, cortical EEG recorded simultaneously with movements. Often, EEG arousal occurs independently of movement, so this could provide one approach to disentangling this confound. The idea that orexin signaling plays a role in arousal rather than movement is supported by their finding that orexin lesions using the orexin-DTR mouse model did not impact movements. In contrast, prior lesion and pharmacologic studies have found that decreased orexin signaling significantly decreases arousal and waking.

      Another way they could test their idea would be to paralyze and respirate animals so that orexin activity could be recorded without movement. Alternatively, animals could be trained to remain motionless to receive a reward. Thus, there are several ways to test the overall hypothesis of this work that have not been examined here.

      The authors propose that "a simple interpretation of their results is that, via HON movement tracking, the brain creates a "wake up" signal in proportion to movement". This seems to argue for the role of the orexin system in arousal and motivation rather than in movement per se.

      Thank you. We agree that disentangling between arousal and movement is indeed critical. A classic approach is a multivariate analysis, wherein multiple simultaneously recorded “predictors” of HON activity – such as arousal and movement - can be directly compared. While EEG arousal is an option, another well-accepted metric for arousal is pupil diameter. Using n = 7 mice, we now simultaneously record HON activity, movement, running speed, pupil size fluctuations, and ocular movements:

      We then fit a partial least squares multivariate regression (a regression type more robust to collinearity) using the movement metric, pupil size, and ocular movements as predictors of orexin neuron activity. Consistent with previous publications, we found that pupil size alone has a positive correlation with hORX.GCaMP6s (~0.45). However, using a drop-one feature analysis in multivariate regression, we found that movement had the highest % contribution to statistically explaining orexin neuron activity. Here are the new results (which we now added as Fig. 7A-B).

      Author response image 1.

      Furthermore, we also expanded this analysis to incorporate the different frequencies found in HON dynamics, using empirical mode decomposition. We found that pupil size had a maximum correlation at lower HON frequencies than the movement metric, while ocular movements were maximally correlated in higher frequencies (now added as Fig. 7D,E).

      Overall, this analysis suggests that – while HONs encode both movement and arousal – arousal and movement do not always co-fluctuate at the same timescales, and their impacts on HONs can be disentangled in a number of ways. We now mention this in revised text on page 5.

      There are several studies that have examined the effect of orexin antagonist treatment in rodents on locomotor and other motor activities. These studies have largely found no consistent effect of antagonizing orexin signaling, especially at the OxR1 receptor, on simple motor activity. These studies are not referenced here but should be taken into account in the authors' conclusions.

      We agree. Prior studies found that orexin antagonism – or optogenetic silencing of HONs – evokes either reduced locomotion, or no effect on locomotor movements. We now added text and references to paragraph 4 of Discussion, summarising this.

      Figure 3, panel F: I understand HON-DTR is a validated model but a picture of HONs ablation is necessary, including pictures of HONs outputs ablation within the SNc and LC.

      A representative histological slice is now included for both wild type (WT) and HON-DTR mice in the new Figure 4B. Because HONs are only found in the hypothalamus, somatic deletion of HONs in this region will result in axonal degradation in output regions.

      The discussion lacks a more extensive paragraph on the distinct signal and role of Ox>SNc and Ox-LC projections.

      We now added sentences discussing potential implications of this to Discussion (middle of paragraph 4).

      Reviewer #2 (Recommendations for the authors):

      Minor weaknesses

      A very important movement in rodents is head orientation, especially given the limitation in ocular movement. However, this paper used a fixed head model which obviated this movement and did not attempt to analyze ocular movements.

      Analysing ocular movements is something we had not considered but is very easy to check using pupillometry. In n = 7 mice, we recorded both orexin neurons, and ocular movements captured through an infrared camera under constant lighting. Ocular movements had a small positive correlation with orexin neuron photometry (r = ~0.26). See response to the public review above.

      Author response image 2.

      The "HON" abbreviation is not commonly used for orexin neurons, and I suggest replacing that with a more well-known abbreviation.

      To the best of our knowledge, there is no universally agreed or best-known abbreviation for hypocretin/orexin neurons (we agree it would be nice if there was one!). “HONs” is a simple first letter abbreviation of hypocretin/orexin neurons, which acknowledges the two names for this peptide given by the original discoverers (de Lecea et al, and Sakurai et al, in 1998). Although this may not be the perfect abbreviation, we have kept it for now, also to be consistent with the large number (>10) of other published studies that recently used this abbreviation.

      The graphs showing Pearson's r values do not demonstrate a very strong correlation between neural activity and movement change; they also lack validation of genetic expression/ablation in some cases. The results would more strongly support the conclusions if statistically significant correlations could be demonstrated between activity and movement.

      We agree that a correlation of ~0.68 is probably not worthy of a “very strong” classification. While there is no universal ruleset for categorizing the strength of a correlation, we have toned down our language throughout the manuscript.

      Comment regarding statistical testing of correlations: we are cautious to stand behind correlation significance testing for large sample sizes (~48’000 photometry & video samples in a 40-minute session). In our case, correlations were always extremely significant p<0.0001. The reason for this is that correlation p-values become “too big to fail” (see Lin et al. 2013) with inflated sample size. We therefore refrain from commenting on p-values and rather report between or within-subjects statistical tests, or tests against zero. See four example experiments below.

      Author response image 3.

      Citation: Lin, M., Lucas, H. C., Jr & Shmueli, G. Research Commentary—Too Big to Fail: Large Samples and the p-Value Problem. Information Systems Research 24, 906–917 (2013).

      The rationale for looking at running speed, general movement, and specific types of nonlocomotor movements could be clarified and explained more thoroughly in the introduction. Why is it important to distinguish between locomotion (represented here with running) and all other movements? Presumably, this is because orexin is known to regulate arousal/locomotion. What evidence is there for orexin's role in other types of movements, which are being grouped together in Figure 1? This could be laid out in more detail in the Introduction. Relatedly, it is not very clear in the text whether the correlation between movement and orexin neuron activity includes movement related to running.

      The main focus of our paper is on movement in general (i.e. video pixel difference, described in Results and Methods). This movement metric includes everything captured by the video, it is agnostic to the type of movement or behaviour.  To connect this to some of the specific innate movements/behaviours typically studied in mouse literature (running, grooming, sniffing, etc), we also performed plots in Figure 2. We attempted to explain this better in revised section 1 of Results.

      What exactly is being correlated in Figure 1C (and throughout the rest of the paper?) Is this the average signal correlated with the average movement change over the entire recording time? This could be more explicitly stated in methods/results. The correlations themselves/p-values could be shown in addition to/instead of Pearson's r values. Are the correlations themselves significant? This would strengthen the claim that orexin activity is strongly coupled to the magnitude of body movement change. As another example, in Figure 2D, there are no statistics reported on the correlation between movement metric and average neural signal. In Figure 6G, orexin neuron activity is more strongly correlated with movement than MVe glut neurons, but are either of these correlations significant? The correlation between MVe glut activity and movement overall seems similar to that of orexin neurons, and may be worth noting more explicitly.

      Throughout the paper, we have recorded both neural activity (photometry) and movement at 20 Hz. This would generate, for example, 48’000 samples of photometry and movement from a 40-minute session. All the samples were used to calculate a pearson’s r between variables. To clarify this, we now added the subtext “wholesession” to relevant figures, as well as a clarification in the methods.

      Individual experiment correlations for orexin neurons and MVe glut neurons were always significant p<0.0001, even after a Bonferroni multiple comparisons correction was applied to each population. See the “too big to fail” nature of correlation hypothesis testing above.

      It could be made clearer at the end of Figure 2 that orexin neuron activity is tracking the magnitude of movement change (shown in Figure 2D), not that it is encoding different types of movement.

      We intended for original Figure 2E to illustrate this concept, however this panel has caused a great deal of confusion to several readers and was perhaps ill conceived. We have replaced Figure 2E with a new panel more directly addressing the reviewer’s statement. We can construct three models where orexin neuron activity is predicted from the behavioral classification (sometimes called “one-hot” encoding) and/or the movement metric.

      Model 1 predicts orexin neuron activity using only a categorical predictor of behavioral state. Model 2 only uses the movement metric, and model 3 allows a different movement-metric correlation within each behavioral state. We can compare these models using AIC (Akaike Information Criterion) which is a point estimate. While the most complex model 3 was the best, model 2 was much closer to model 3 than model 1. Similarly, model 2 was much better than model 1. From this we conclude that the magnitude of movement change is a more powerful predictor than behavioral state (“type of movement”). This is now Figure 2E.

      It would be interesting to see the raw movement metric data as shown in Figures 1 and 2 in the DTR mice to show that ablating orexin neurons does not impair the movement profile seen in Figures 1 and 2.

      The requested visualization has been added to Figure 4B.

      Validation that orexin was selectively ablated in these mice would be ideal.

      Histology (see response to public review) was added to a new Figure 4B.

      Figure 4A - OxLight expression in SNc does not look very robust.

      Please note this is a membrane-targeted indicator, the staining this produces is thus much weaker than cyctosolic indicators such as calcium indicator GCaMP.

      Figure 4 - It would be beneficial to see the same correlations that were done in Figures 1 and 2 to show OxLight activity vs. movement metric. Are they correlated?

      Individual traces had significant correlations with OxLight and movement, and the population averages revealed similar trends:

      Author response image 4.

      Figure 6B - Targeting of MVe neurons does not look very specific. The sample size for orexintargeted mice should be re-stated in the figure legend for clarity.

      Legend has been updated to clarify n = 15 for orexin targeted mice.

      Some citations didn't seem to match what was being referenced in the text. Similarly, in the legend for Figure 1C, the statistics do not match what is reported in the text. In Figure 1, the sample size is not noted in the text. When referring to running in Figure 1, is this referring to running speed? Perhaps the language could be more consistent.

      These typos (due to a rounding error) in the legend and text have been corrected. Sample size has been added to the text, and we have changed Figure 1D to clarify we are referring to running speed. We moved some citations to improve clarity.

      Methods - where were Cre mice obtained from?

      Sources now better referenced in Methods (JAX or Parlato et al).

      Figure 1, panel C: The authors compared Pearson's r-coefficient results for each animal and for each variable. However, it would be interesting to show the correlation curves for each variable. However, it would be interesting to show the correlation curves for each variable as well here. Also, there is mention of a strong correlation but it is unclear whether these correlations are significant.

      See below for an example mouse.

      Author response image 5.

      Figure 3, panel F: I understand HON-DTR is a validated model but a picture orexin ablation is necessary, including pictures of orexin fibers ablation within the SNc and LC.

      See our reply to the public review above.

      Figure 5, Panel A: Same comment as Figure 1, panel C.

      We have similarly clarified the panel and legend.

      Page 4: The authors mention "Within the 1st and 4th quartile of blood glucose, movement-HON correlations were not significantly different. Please add the figures.

      The requested plot has been added to Figure 6, panel G.

      Reviewer #3 (Public review):

      Summary

      The study presents an investigation into how hypothalamic orexin neurons (HONs) track body movement with high precision. Using techniques including fiber photometry, video-based movement metrics, and empirical mode decomposition (EMD), the authors demonstrate that HONs encode net body movement consistently across a range of behaviors and metabolic states. They test the ability of HONs to track body movement to that of other subcortical neural populations, from which they distinguish HONs activity from other subcortical neural populations.

      Strengths:

      The study characterizes HONs activity as key indicators of movement and arousal, and this method may have potential implications for understanding sleep disorders, energy regulation, and brain-body coordination. Overall, I think this is a very interesting story, with novel findings and implications about sensorimotor systems in animals. The manuscript is clearly written and the evidence presented is rigorous. The conclusions are well supported by experimental data with clear statistical analyses.

      We thank the reviewer for their supportive feedback.

      Weaknesses/suggestions:

      There are a couple of issues I think the authors could address to make the paper better and more complete:

      (1) The study primarily focuses on steady-state behaviors. It would be interesting if the authors' current dataset allows analyses of HON dynamics during transitions between behavioral states (e.g., resting to running or grooming to sniffing). This could provide additional insights into how HONs adapt to rapid changes in body movement.

      This is a fantastic idea, and easy to check using our classification CNN. We identified the six most frequent behavioral transitions and plotted them in Figure 2H. HONs show rapid dynamics in activity aligned with behavioral changes.

      These changes are very similar to the movement magnitude along these transitions, which is now also plotted in Figure 2G.

      (2) Given the established role of HONs in arousal and wakefulness, the study could further investigate how movement-related HON dynamics interact with arousal states. For example, does HON encoding of movement differ during sleep versus wakefulness?

      To further investigate how movement encoding interacts with arousal, we now include quantification and analysis of pupil-linked arousal (see new Figure 7). We agree it would be interesting to look at what happens during sleep, especially REM sleep when some HONs are thought to be active where there is no/little body movement, but this is beyond the scope of the present study.

      (3) Although HON ablation experiments suggest that HONs do not shape movement frequency profiles. It would be more compelling if the authors could investigate whether HONs contribute to specific types of movements (e.g., fine motor vs. gross motor movements) or modulate movement initiation thresholds.

      We performed this analysis using the k-means classifier for small/large movements. Consistent with previous results, we found no significant effect (p = 0.2767) of genotype on the frequency of identified small (fine) or large (gross) movement clusters. This plot has been added to Figure 4E.

      (4) The heterogeneous movement-related orexin dynamics observed in the LC and SNc raise intriguing questions about the circuit-level mechanisms underlying these differences. Optogenetic or chemogenetic manipulation of these projections could validate the functional implications of these dynamics.

      We agree. We now discuss some implications of this in revised Discussion (paragraph 4). Please note that previous work already demonstrated that orexin action in the SNc can produce locomotion (referenced in the paragraph), though we agree that further work would be valuable.

      Reviewer #3 (Recommendations for the authors):

      Additional feedback:

      (1) Figure 1C: the individual data points are hard to track or see. Consider using a larger marker face to help data visualization. Similar issues can be found in Figures 2C, 2E, 5E, 6C, 6F, and 6G.

      Thickness of the lines and scatterplots have been increased.

      (2) First Section of Results: the authors claim to use a deep-learning network to automatically classify video recordings into five distinct behaviors. However, several issues need to be addressed here:

      a. In Results, the corresponding sentence lacks a reference to the Methods Section.

      Reference has been added to the text.

      b. In Methods, the description of the CNN model is quite limited, lacking many basic, necessary components including necessary references to published papers, the model training, characterization (only an overall accuracy is not enough), as well as dataset definition, preparation, augmentation (if any), etc.

      We have expanded the methods section regarding the CNN model.

      (3) First Section of Results: in the second paragraph, the authors claim that "Overall, these results reveal HON population activity precisely tracks a general degree of body movement across recorded behaviors." This is not accurate. To indicate that HONs activity tracks the general degree of body movement across behavior states, they need to further show that behavioral states with similar levels of movement metrics can be differentiated via HON activities. However, as they showed in Figure 2D, some behaviors with similar values of movement metric do not seem to be easily discerned by HON activity levels.

      We agree with you, and this is also what we originally intended to convey – now reworded for clarity.

      (4) Technical issue: Figures 3B, 3C, 3G, using local regression to plot the solid lines makes them touch negative values, which does not make sense for "power proportion" (this quantity is always non-negative).

      This is a good point. To fix this, we first log-transformed the power metric, then performed a local regression, and used the link function to transform the model predictions back to %-units for visualization. This has been noted in the methods.

      (5) Figure 3G: For a better comparison, consider combining the two plots into a single plot.

      The two plots have been merged as shown in Figure 4C.

      (6) Figure 5E: For a better data visualization, the current pair of plots can be consolidated into one single plot where the x-axis is Move and the y-axis is dGlu. In this way, it is easier to understand and the orthogonality as claimed in the manuscript can be more apparent.

      The requested plot has been added as Figure 6F.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      This is a new and important system that can efficiently train mice to perform a variety of cognitive tasks in a flexible manner. It is innovative and opens the door to important experiments in the neurobiology of learning and memory. 

      Strengths: 

      Strengths include: high n's, a robust system, task flexibility, comparison of manual-like training vs constant training, circadian analysis, comparison of varying cue types, long-term measurement, and machine teaching. 

      Weaknesses: 

      I find no major problems with this report. 

      Minor weaknesses: 

      (1)  Line 219: Water consumption per day remained the same, but number of trails triggered was more as training continued. First, is this related to manual-type training? Also, I'm trying to understand this result quantitatively, since it seems counter-intuitive: I would assume that with more trials, more water would be consumed since accuracy should go up over training (so more water per average trial). Am I understanding this right? Can the authors give more detail or understanding to how more trials can be triggered but no more water is consumed despite training? 

      Thanks for the comment. We would like to clarify the phenomenon described in Line 219: As the training advanced, the number of trials triggered by mice per day decreased (rather than increased as you mentioned in the comment) gradually for both manual and autonomous groups of mice (Fig. 2H left). The performance, as you mentioned, improved over time (Fig. 2D and 2E), leading to an increased probability of obtaining water and thus relatively stable daily water intake (Fig. 2H middle). We believe the stable daily intake is the minimum amount of water required by the mice under circumstance of autonomous behavioral training. To make the statement more clearly, we indicated the corresponding figure numbers in the text.

      Results “… As shown in Fig. 2H, autonomous training yielded significantly higher number of trial/day (980 ± 25 vs. 611 ± 26, Fig. 2H left) and more volume of water consumption/day (1.65 ± 0.06 vs. 0.97 ± 0.03 ml, Fig. 2H middle), which resulted in monotonic increase of body weight that was even comparable to the free water group (Fig.2H right). In contrast, the body weight in manual training group experienced a sharp drop at the beginning of training and was constantly lower than autonomous group throughout the training stage (Fig. 2H right).”

      (2) Figure 2J: The X-axis should have some label: at least "training type". Ideally, a legend with colors can be included, although I see the colors elsewhere in the figure. If a legend cannot be added, then the color scheme should be explained in the caption.

      Thanks for the suggestion. The labels with corresponding colors for x-axis have been added for Fig. 2J.

      (3) Figure 2K: What is the purple line? I encourage a legend here. The same legend could apply to 2J.

      Thanks for the suggestion. The legend has been added for Fig. 2K.

      (4) Supplementary Figure S2 D: I do not think the phrase "relying on" is correct. Instead, I think "predicted by" or "correlating with" might be better. 

      We thank the reviewer for the valuable suggestion. The phrase has been changed to ‘predicted by’ for better suitability.

      Figure S2 “(D), percentage of trials significantly predicted by different regressors during task learning. …”

      Reviewer #2 (Public review): 

      Summary: 

      The manuscript by Yu et al. describes a novel approach for collecting complex and different cognitive phenotypes in individually housed mice in their home cage. The authors report a simple yet elegant design that they developed for assessing a variety of complex and novel behavioral paradigms autonomously in mice. 

      Strengths: 

      The data are strong, the arguments are convincing, and I think the manuscript will be highly cited given the complexity of behavioral phenotypes one can collect using this relatively inexpensive ($100/box) and high throughput procedure (without the need for human interaction). Additionally, the authors include a machine learning algorithm to correct for erroneous strategies that mice develop which is incredibly elegant and important for this approach as mice will develop odd strategies when given complete freedom. 

      Weaknesses:

      (1) A limitation of this approach is that it requires mice to be individually housed for days to months. This should be discussed in depth. 

      Thank you for raising this important point. We agree that the requirement for individual housing of mice during the training period is a limitation of our approach, and we appreciate the opportunity to discuss this in more depth. In the manuscript, we add a section to the Discussion to address this limitation, including the potential impact of individual housing on the mice, the rationale for individual housing in our study, and efforts or alternatives made to mitigate the effects of individual housing.

      Discussion “… Firstly, our experiments were confined to single-housed mice, which is known to influence murine behavior and physiology, potentially affecting social interaction and stress levels [76]. In our study, individual housing was necessary to ensure precise behavioral tracking, eliminate competitive interactions during task performance, and maintain consistent training schedules without disruptions from cage-mate disturbances. However, the potential of group-housed training has been explored with technologies such as RFID [28,29,32–34] to distinguish individual mice, which potentially improving the training efficiency and facilitating research of social behaviors [77]. Notably, it has shown that simultaneous training of group-housed mice, without individual differentiation, can still achieve criterion performance [25].”

      (2) A major issue with continuous self-paced tasks such as the autonomous d2AFC used by the authors is that the inter-trial intervals can vary significantly. Mice may do a few trials, lose interest, and disengage from the task for several hours. This is problematic for data analysis that relies on trial duration to be similar between trials (e.g., reinforcement learning algorithms). It would be useful to see the task engagement of the mice across a 24-hour cycle (e.g., trials started, trials finished across a 24-hour period) and approaches for overcoming this issue of varying inter-trial intervals. 

      Thank you for your insightful comment regarding the variability in inter-trial intervals and its potential impact on data analysis. We agree that this is an important consideration for continuous self-paced tasks.

      In our original manuscript, we have showed the general task engagement across 24-hour cycle (Fig. 2K), which revealed two peaks of engagements during the dark cycle with relatively fewer trials during the light cycle. To facilitate analyses requiring consistent trial durations, we defined trial blocks as sequences between two no-response trials. Notably, approximately 66.6% of trials occurred within blocks of >5 consecutive trials (Fig. 2L), which may be particularly suitable for such analyses.

      In the revised manuscript, we also added the analysis of the histogram of inter-trial-interval for both the autonomous and manual training paradigms in HABITS (Fig. S2H), which shows that around 55.2% and 77.5% of the intervals are less than 2 seconds in autonomous and manual training, respectively.

      Results “… We found more than two-third of the trials was done in >5-trial blocks (Fig. 2L left) which resulted in more than 55% of the trials were with inter-trial-interval less than 2 seconds (Fig. S2H).”

      Regarding the approaches to mitigate the issue of varying inter-trial interval, we observed that manual training (i.e., manually transferring to HABITS for ~2 hr/day) in Fig. S2H resulted in more trials with short inter-trial-interval, suggesting that constrained access time promotes task engagement and reduces interval variability. Fig. 2L also indicated that the averaged correct rate increased and the earlylick rate decreased as the length of block increased. This approach could be valuable for studies where consistent trial timing is critical. In the context of our study, we could actually introduce a light, for example, to serve as the cue that prompt the animals to engage during a fixed time duration in a day.

      Discussion “… In contrast, the self-paced nature of autonomous training may permit greater variability in attentional engagement 83 and inter-trial-intervals, which could be problematic for data analysis relaying on consistent intervals and/or engagements. Future studies should explore how controlled contextual constraints enhance learning efficiency and whether incorporating such measures into HABITS could optimize its performance.”

      (3) Movies - it would be beneficial for the authors to add commentary to the video (hit, miss trials). It was interesting watching the mice but not clear whether they were doing the task correctly or not. 

      Thanks for the reminder. We have added subtitles to both of the videos. Since the supplementary video1 was not recorded with sound, the correctness of the trials was hard to judge. We replaced the video with another one with clear sound recordings, and the subtitles were commented in detail.

      (4) The strength of this paper (from my perspective) is the potential utility it has for other investigators trying to get mice to do behavioral tasks. However, not enough information was provided about the construction of the boxes, interface, and code for running the boxes. If the authors are not willing to provide this information through eLife, GitHub, or their own website then my evaluation of the impact and significance of this paper would go down significantly. 

      Thanks for this important comment. We would like to clarify that the construction methods, GUI, code for our system, PCB and CAD files (newly uploaded) have already been made publicly available on https://github.com/Yaoyao-Hao/HABITS. Additionally, we have open-sourced all the codes and raw data for all training protocols (https://doi.org/10.6084/m9.figshare.27192897). We will continue to maintain these resources in the future.

      Minor concerns: 

      (5) Learning rate is confusing for Figure 3 results as it actually refers to trials to reach the criterion, and not the actual rate of learning (e.g., slope).

      Thanks for pointing this out. The ‘learning rate’ which refers to trial number to reach criterion has been changed to ‘the number of trials to reach criterion’.

      Reviewer #3 (Public review): 

      Summary: 

      In this set of experiments, the authors describe a novel research tool for studying complex cognitive tasks in mice, the HABITS automated training apparatus, and a novel "machine teaching" approach they use to accelerate training by algorithmically providing trials to animals that provide the most information about the current rule state for a given task. 

      Strengths: 

      There is much to be celebrated in an inexpensively constructed, replicable training environment that can be used with mice, which have rapidly become the model species of choice for understanding the roles of distinct circuits and genetic factors in cognition. Lingering challenges in developing and testing cognitive tasks in mice remain, however, and these are often chalked up to cognitive limitations in the species. The authors' findings, however, suggest that instead, we may need to work creatively to meet mice where they live. In some cases, it may be that mice may require durations of training far longer than laboratories are able to invest with manual training (up to over 100k trials, over months of daily testing) but the tasks are achievable. The "machine teaching" approach further suggests that this duration could be substantially reduced by algorithmically optimizing each trial presented during training to maximize learning. 

      Weaknesses: 

      (1) Cognitive training and testing in rodent models fill a number of roles. Sometimes, investigators are interested in within-subjects questions - querying a specific circuit, genetically defined neuron population, or molecule/drug candidate, by interrogating or manipulating its function in a highly trained animal. In this scenario, a cohort of highly trained animals that have been trained via a method that aims to make their behavior as similar as possible is a strength. 

      However, often investigators are interested in between-subjects questions - querying a source of individual differences that can have long-term and/or developmental impacts, such as sex differences or gene variants. This is likely to often be the case in mouse models especially, because of their genetic tractability. In scenarios where investigators have examined cognitive processes between subjects in mice who vary across these sources of individual difference, the process of learning a task has been repeatedly shown to be different. The authors do not appear to have considered individual differences except perhaps as an obstacle to be overcome. 

      The authors have perhaps shown that their main focus is highly-controlled within-subjects questions, as their dataset is almost exclusively made up of several hundred young adult male mice, with the exception of 6 females in a supplemental figure. It is notable that these female mice do appear to learn the two-alternative forced-choice task somewhat more rapidly than the males in their cohort.

      Thank you for your insightful comments and for highlighting the importance of considering both within-subject and between-subject questions in cognitive training and testing in rodent models. We acknowledge that our study primarily focused on highly controlled within-subject questions. However, the datasets we provided did show preliminary evidences for the ‘between-subject’ questions. Key observations include:

      The large variability in learning rates among mice observed in Fig. 2I;

      The overall learning rate difference between male and female subjects (Fig. 2D vs. Fig. S2G);

      The varying nocturnal behavioral patterns (Fig. 2K), etc.

      We recognize the value of exploring between-subjects differences in mouse model and discussed more details in the Discussion part.

      Discussion “Our study was designed to standardize behavior for the precise interrogation of neural mechanisms, specifically addressing within-subject questions. However, investigators are often interested in between-subject differences—such as sex differences or genetic variants—which can have long-term behavioral and cognitive implications [72,74]. This is particularly relevant in mouse models due to their genetic tractability [75]. Although our primary focus was not on between-subject differences, the dataset we generated provides preliminary evidence for such investigations. Several behavioral readouts revealed individual variability among mice, including large disparities in learning rates across individuals (Fig. 2I), differences in overall learning rates between male and female subjects (Fig. 2D vs. Fig. S2G), variations in nocturnal behavioral patterns (Fig. 2K), etc.”

      (2) Considering the implications for mice modeling relevant genetic variants, it is unclear to what extent the training protocols and especially the algorithmic machine teaching approach would be able to inform investigators about the differences between their groups during training. For investigators examining genetic models, it is unclear whether this extensive training experience would mitigate the ability to observe cognitive differences, or select the animals best able to overcome them - eliminating the animals of interest. Likewise, the algorithmic approach aims to mitigate features of training such as side biases, but it is worth noting that the strategic uses of side biases in mice, as in primates, can benefit learning, rather than side biases solely being a problem. However, the investigators may be able to highlight variables selected by the algorithm that are associated with individual strategies in performing their tasks, and this would be a significant contribution.

      Thank you for the insightful comments. We acknowledge that the extensive training experience, particularly through the algorithmic machine teaching approach, could potentially influence the ability to observe cognitive differences between groups of mice with relevant genetic variants. However, our study design and findings suggest that this approach can still provide valuable insights into individual differences and strategies used by the animals during training. First, the behavioral readout (including learning rate, engagement pattern, etc.) as mentioned above, could tell certain number of differences among mice. Second, detailed modelling analysis (with logistical regression modelling) could further dissect the strategy that mouse use along the training process (Fig. S2B). We have actually highlighted some variables selected by the regression that are associated with individual strategies in performing their tasks (Fig. S2C) and these strategies could be different between manual and autonomous training groups (Fig. S2D). We included these comments in the Discussion part for further clearance.

      Discussion “… Furthermore, a detailed logistic regression analysis dissected the strategies mice employed during training (Fig. S2B). Notably, the regression identified variables associated with individual task-performance strategies (Fig. S2C), which also differed between manually and autonomously trained groups (Fig. S2D). Thus, our system could facilitate high-throughput behavioral studies exploring between-subject differences in the future.”

      (3) A final, intriguing finding in this manuscript is that animal self-paced training led to much slower learning than "manual" training, by having the experimenter introduce the animal to the apparatus for a few hours each day. Manual training resulted in significantly faster learning, in almost half the number of trials on average, and with significantly fewer omitted trials. This finding does not necessarily argue that manual training is universally a better choice because it leads to more limited water consumption. However, it suggests that there is a distinct contribution of experimenter interactions and/or switching contexts in cognitive training, for example by activating an "occasion setting" process to accelerate learning for a distinct period of time. Limiting experimenter interactions with mice may be a labor-saving intervention, but may not necessarily improve performance. This could be an interesting topic of future investigation, of relevance to understanding how animals of all species learn.

      Thank you for your insightful comments. We agree that the finding that manual training led to significantly faster learning compared to self-paced training is both intriguing and important. One of the possible reasons we think is due to the limited duration of engagement provided by the experimenter in the manual training case, which forced the mice to concentrate more on the trials (thus with fewer omitting trials) than in autonomous training. Your suggestion that experimenter interactions might activate an "occasion setting" process is particularly interesting. In the context of our study, we could actually introduce, for example, a light, serving as the cue that prompt the animals to engage; and when the light is off, the engagement was not accessible any more for the mice to simulate the manual training situation. We agree that this could be an interesting topic for future investigation that might create a more conducive environment for learning, thereby accelerating the learning rate.

      Discussion “… Lastly, while HABITS achieves criterion performance in a similar or even shorter overall days compared to manual training, it requires more trials to reach the same learning criterion (Fig. 2G). We hypothesize that this difference in trial efficiency may stem from the constrained engagement duration imposed by the experimenter in manual training, which could compel mice to focus more intensely on task execution, resulting in less trial omissions (Fig. 2F). In contrast, the self-paced nature of autonomous training may permit greater variability in attentional engagement 83 and inter-trial-intervals, which could be problematic for data analysis relaying on consistent intervals and/or engagements. Future studies should explore how controlled contextual constraints enhance learning efficiency and whether incorporating such measures into HABITS could optimize its performance.”

      Reviewer #2 (Recommendations for the authors):

      As I mentioned in the weaknesses, I did not see code or CAD drawings for their home cages and how these interact with a computer.

      Thanks for the comment. We would like to clarify that the construction methods, GUI, code for our system, PCB and CAD files (newly uploaded) have already been made publicly available on https://github.com/Yaoyao-Hao/HABITS.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      The authors use electrophysiological and behavioral measurements to examine how animals could reliably determine odor intensity/concentration across repeated experiences. Because stimulus repetition leads to short-term adaptation evidenced by reduced overall firing rates in the antennal lobe and firing rates are otherwise concentration-dependent, there could be an ambiguity in sensory coding between reduced concentration or more recent experience. This would have a negative impact on the animal's ability to generate adaptive behavioral responses that depend on odor intensities. The authors conclude that changes in concentration alter the constituent neurons contributing to the neural population response, whereas adaptation maintains the 'activated ensemble' but with scaled firing rates. This provides a neural coding account of the ability to distinguish odor concentrations even after extended experience. Additional analyses attempt to distinguish hypothesized circuit mechanisms for adaptation but are inconclusive. A larger point that runs through the manuscript is that overall spiking activity has an inconsistent relationship with behavior and that the structure of population activity may be the more appropriate feature to consider.

      To my knowledge, the dissociation of effects of odor concentration and adaptation on olfactory system population codes was not previously demonstrated. This is a significant contribution that improves on any simple model based on overall spiking activity. The primary result is most strikingly supported by visualization of a principal components analysis in Figure 4. However, there are some weaknesses in the data and analyses that limit confidence in the overall conclusions.

      We thank the reviewer for evaluating our work and highlighting its strengths and deficiencies. We have revised the manuscript with expanded behavioral datasets and additional analyses that we believe convincingly support our conclusion. 

      (1) Behavioral work interpreted to demonstrate discrimination of different odor concentrations yields inconsistent results. Only two of the four odorants follow the pattern that is emphasized in the text (Figure 1F). Though it's a priori unlikely that animals are incapable of distinguishing odor concentrations at any stage in adaptation, the evidence presented is not sufficient to reach this conclusion.

      We have expanded our dataset and now show that the behavioral response is significantly different for high and low concentration exposures of the same odorant. This was observed for all four odorants in our study (refer to Revised Fig. 1F).

      (2) While conclusions center on concepts related to the combination of activated neurons or the "active ensemble", this specific level of description is not directly demonstrated in any part of the results. We see individual neural responses and dimensional reduction analyses, but we are unable to assess to what extent the activated ensemble is maintained across experience.

      We have done several additional analyses (see provisional response). Notably, we have corroborated our dimensionality reduction and correlation analysis results with a quantitative classification analysis that convincingly demonstrates that odor identity and intensity of the odorant can be decoded from the ensemble neural activity, and this could be achieved in an adaptation-invariant fashion (refer to Revised Supplementary Fig. 4). 

      (3) There is little information about the variance or statistical strength of results described at the population level. While the PCA presents a compelling picture, the central point that concentration changes and adaptation alter population responses across separable dimensions is not demonstrated quantitatively. The correlation analysis that might partially address this question is presented to be visually interpreted with no additional testing.

      We have included a plot that compares the odor-evoked responses across all neurons (mean ± variance) at both intensity levels for each odorant (Revised Supplementary Fig. 5). This plot clearly shows how the ensemble neural activity profile varies with odor intensity and how these response patterns are robustly maintained across trials. 

      (4) Results are often presented separately for each odor stimulus or for separate datasets including two odor stimuli. An effort should be made to characterize patterns of results across all odor stimuli and their statistical reliability. This concern arises throughout all data presentations.

      We had to incorporate a 15-minute window between presentations of odorants to reset adaptation. Due to this, we were unable to extracellularly record from all four odorants at two intensities from a single experiment (~ 3.5 hours of recording for just 2 odorants at two intensities with one odorant at higher intensity repeated at the end; Fig. 2a). Therefore, we recorded two datasets. Each dataset captured the responses of ~80 PNs to two odorants at two intensities, one odorant at the higher concentration repeated at the end of the experiment to show repeatability of changes due to adaptation. 

      (5) The relevance of the inconclusive analysis of inferred adaptation mechanisms in Figure 2d-f and the single experiment including a complex mixture in Figure 7 to the motivating questions for this study are unclear.

      Figure 2d-f has been revised. While we agree that the adaptation mechanisms are not fully clear, there is a trend that the most active PNs are the neurons that change the most across trials. This change and the response in the first trial are negatively correlated, indicating that vesicle depletion could be an important contributor to the observed results. However, neurons that adapt strongly at higher intensities are not the ones that adapt at lower intensities. This complicates the understanding of how neural responses vary with intensities and the adaptation that happens due to repetition. This has been highlighted in the revised manuscript. 

      Regarding Figure 7, we wanted to examine the odor-specificity of the changes that happen due to repeated encounters of an odorant. Specifically, wondered if the neural response reduction and behavioral enhancements were a global, non-specific state change in the olfactory system brought about by the repetition of any odorant, or are the observed neural and behavioral response changes odor-specific.

      (6) Throughout the description of the results, typical standards for statistical reporting (sample size, error bars, etc.) are not followed. This prevents readers from assessing effect sizes and undermines the ability to assign a confidence to any particular conclusion.

      We have revised the manuscript to fix these issues and included sample size and error bars in our plots.  

      Reviewer #2 (Public Review):

      Summary:

      The authors' main goal was to evaluate how both behavioral responses to odor, and their early sensory representations are modified by repeated exposure to odor, asking whether the process of adaptation is equivalent to reducing the concentration of an odor. They open with behavioral experiments that actually establish that repeated odor presentation increases the likelihood of evoking a behavioral response in their experimental subjects - locusts. They then examine neural activity patterns at the second layer of the olfactory circuit. At the population level, repeated odor exposure reduces total spike counts, but at the level of individual cells there seems to be no consistent guiding principle that describes the adaptation-related changes, and therefore no single mechanism could be identified.

      Both population vector analysis and pattern correlation analysis indicate that odor intensity information is preserved through the adaptation process. They make the closely related point that responses to an odor in the adapted state are distinct from responses to lower concentration of the same odor. These analyses are appropriate, but the point could be strengthened by explicitly using some type of classification analysis to quantify the adaptation effects. e.g. a confusion matrix might show if there is a gradual shift in odor representations, or whether there are trials where representations change abruptly.

      Strengths:

      One strength is that the work has both behavioral read-out of odor perception and electrophysiological characterization of the sensory inputs and how both change over repeated stimulus presentations. It is particularly interesting that behavioral responses increase while neuronal responses generally decrease. Although the behavioral effect could occur fully downstream of the sensory responses the authors measure, at least those sensory responses retain the core features needed to drive behavior despite being highly adapted.

      Weaknesses:

      Ultimately no clear conceptual framework arises to understand how PN responses change during adaptation. Neither the mechanism (vesicle depletion versus changes in lateral inhibition) nor even a qualitative description of those changes. Perhaps this is because much of the analysis is focused on the entire population response, while perhaps different mechanisms operate on different cells making it difficult to understand things at the single PN level.

      From the x-axis scale in Fig 2e,f it appeared to me that they do not observe many strong PN responses to these stimuli, everything being < 10 spikes/sec. So perhaps a clearer effect would be observed if they managed to find the stronger responding PNs than captured in this dataset.

      We thank the reviewer for his/her evaluation of our work. Indeed, our work does not clarify the mechanism that underlies the adaptation over trials, and how this mechanism accounts for adaptation that is observed at two different intensities of the same odorant. However, as we highlight in the revised manuscript, there is some evidence for the vesicle depletion hypothesis. For the plots shown in Fig. 2, the firing rates were calculated after averaging across time bins and trials. Hence, the lower firing rates. The peak firing rates of the most active neurons are ~100 Hz. So, we are certain that we are collecting responses from a representative ensemble of neurons in this circuit.

      Reviewer #3 (Public Review):

      Summary:

      How does the brain distinguish stimulus intensity reduction from response reductions due to adaptation? Ling et al study whether and how the locust olfactory system encodes stimulus intensity and repetition differently. They show that these stimulus manipulations have distinguishable effects on population dynamics.

      Strengths:

      (1) Provides a potential strategy with which the brain can distinguish intensity decrease from adaptation. -- while both conditions reduce overall spike counts, intensity decrease can also changes which neurons are activated and adaptation only changes the response magnitude without changing the active ensemble.

      (2) By interleaving a non-repeated odor, they show that these changes are odor-specific and not a non-specific effect.

      (3) Describes how proboscis orientation response (POR) changes with stimulus repetition., Unlike the spike counts, POR increases in probability with stimulus. The data portray the variability across subjects in a clear way.

      We thank the reviewer for the summary and for highlighting the strengths of our work.

      Weaknesses:

      (1) Behavior

      a. While the "learning curve" of the POR is nicely described, the behavior itself receives very little description. What are the kinematics of the movement, and do these vary with repetition? Is the POR all-or-nothing or does it vary trial to trial?

      The behavioral responses were monitored in unconditioned/untrained locusts. Hence, these are innate responses to the odorants. These innate responses are usually brief and occur after the onset of the stimulus. However, there is variability across locusts and trials (refer Revised Supplementary Fig. 1). When the same odorant is conditioned with food reward, the POR responses become more stereotyped and occur rapidly within a few hundred milliseconds. 

      Author response image 1.

      POR response dynamics in a conditioned locust. The palps were painted in this case (left panel), and the distance between the palps was tracked as a function of time (right panel).

      b. What are the reaction times? This can constrain what time window is relevant in the neural responses. E.g., if the reaction time is 500 ms, then only the first 500 ms of the ensemble response deserves close scrutiny. Later spikes cannot contribute.

      This is an interesting point. We had done this analysis for conditioned POR responses. For innate POR, as we noted earlier, there is variability across locusts. Many responses occur rapidly after odor onset (<1 s), while some responses do occur later during odor presentation and in some cases after odor termination. It is important to note that these dynamical aspects of the POR response, while super interesting, should occur at a much faster time scale compared to the adaptation that we are reporting across trials or repeated encounters of an odorant.

      c. The behavioral methods are lacking some key information. While references are given to previous work, the reader should not be obligated to look at other papers to answer basic questions: how was the response measured? Video tracking? Hand scored?

      We agree and apologize for the oversight. We have revised the methods and added a video to show the POR responses. Videos were hand-scored. 

      d. Can we be sure that this is an odor response? Although airflow out of the olfactometer is ongoing throughout the experiment, opening and closing valves usually creates pressure jumps that are likely to activate mechanosensors in the antennae.

      Interesting. We have added a new Supplementary Fig. 2 that shows that the POR to even presentations of paraffin oil (solvent; control) is negligible.  This should confirm that the POR is a behavioral response to the odorant. 

      Furthermore, all other potential confounds identified by the reviewer are present for every odorant and every concentration presented.  However, the POR varies in an odor-identity and intensity-specific manner. 

      e. What is the baseline rate of PORs in the absence of stimuli?

      Almost zero. 

      f. What can you say about the purpose of the POR? I lack an intuition for why a fly would wiggle the maxillary palps. This is a question that is probably impossible to answer definitively, but even a speculative explanation would help the reader better understand.

      The locusts use these finger-like maxillary palps to grab a grass blade while eating. Hence, we believe that this might be a preparatory response to feeding. We have noted that the PORs are elicited more by food-related odorants. Hence, we think it is a measure of odor appetitiveness. This has been added to the manuscript. 

      (2) Physiology

      a. Does stimulus repetition affect "spontaneous" activity (i.e., firing in the interstimulus interval? To study this question, in Figures 2b and c, it would be valuable to display more of the prestimulus period, and a quantification of the stability or lability of the inter-stimulus activity.

      Done. Yes, the spontaneous activity does appear to change in an odor-specific manner. We have done some detailed analysis of the same in this preprint:

      Ling D, Moss EH, Smith CL, Kroeger R, Reimer J, Raman B, Arenkiel BR. Conserved neural dynamics and computations across species in olfaction. bioRxiv [Preprint]. 2023 Apr 24:2023.04.24.538157. doi: 10.1101/2023.04.24.538157. PMID: 37162844; PMCID: PMC10168254

      b. When does the response change stabilize? While the authors compare repetition 1 to repetition 25, from the rasters it appears that the changes have largely stabilized after the 3rd or 4th repetition. In Figure 5, there is a clear difference between repetition 1-3 or so and the rest. Are successive repetitions more similar than more temporally-separated repetitions (e.g., is rep 13 more similar to 14 than to 17?). I was not able to judge this based on the dendrograms of Figure 5. If the responses do stabilize at it appears, it would be more informative to focus on the dynamics of the first few repetitions.

      The reviewer makes an astute observation. Yes, the changes in firing rates are larger in the first three trials (Fig. 3c). The ensemble activity patterns, though, are relatively stable across all trials as indicated by the PCA plots and classification analysis results.

      Author response image 2.

      Correlation as a function of trial number. All correlations were made with respect to the odor-evoked responses in the last odor trial of hex(H) and bza(H).

      c. How do temporal dynamics change? Locust PNs have richly varied temporal dynamics, but how these may be affected is not clear. The across-population average is poorly suited to capture this feature of the activity. For example, the PNs often have an early transient response, and these appear to be timed differently across the population. These structures will be obscured in a cross population average. Looking at the rasters, it looks like the initial transient changes its timing (e.g., PN40 responses move earlier; PN33 responses move later.). Quantification of latency to first spike after stimulus may make a useful measure of the dynamics.

      As noted earlier, to keep our story simple in this manuscript, we have only focused on the variations across trials (i.e., much slower response dynamics). We did this as we are not recording neural and behavioral responses from the same locust. We plan to do this and directly compare the neural and behavioral dynamics in the same locust.

      d.How legitimate is the link between POR and physiology? While their changes can show a nice correlation, the fact the data were taken from separate animals makes them less compelling than they would be otherwise. How feasible is it to capture POR and physiology in the same prep?

      This would be most helpful, but I suspect may be too technically challenging to be within scope.

      The antennal lobe activity in the input about the volatile chemicals encountered by the locust. The POR is a behavioral output. Hence, we believe that examining the correlation between the olfactory system's input and output is a valid approach. However, we have only compared the mean trends in neural and behavioral datasets, and dynamics on a much slower timescale. We are currently developing the capability to record neural responses in behaving animals. This turned out to be a bit more challenging than we had envisioned. We plan to do fine-grained comparisons of the neural and behavioral dynamics, recommended by this reviewer, in those preparations.

      Further, we will also be able to examine whether the variability in behavioral responses could be predicted from neural activity changes in that prep.

    1. ABSTRACTSingle-cell RNA sequencing (scRNA-seq) has revolutionized the study of cellular heterogeneity, but the rapid expansion of analytical tools has proven to be both a blessing and a curse, presenting researchers with significant challenges. Here, we present SeuratExtend, a comprehensive R package built upon the widely adopted Seurat framework, which streamlines scRNA-seq data analysis by integrating essential tools and databases. SeuratExtend offers a user-friendly and intuitive interface for performing a wide range of analyses, including functional enrichment, trajectory inference, gene regulatory network reconstruction, and denoising. The package seamlessly integrates multiple databases, such as Gene Ontology and Reactome, and incorporates popular Python tools like scVelo, Palantir, and SCENIC through a unified R interface. SeuratExtend enhances data visualization with optimized plotting functions and carefully curated color schemes, ensuring both aesthetic appeal and scientific rigor. We demonstrate SeuratExtend’s performance through case studies investigating tumor-associated high-endothelial venules and autoinflammatory diseases, and showcase its novel applications in pathway-Level analysis and cluster annotation. SeuratExtend empowers researchers to harness the full potential of scRNA-seq data, making complex analyses accessible to a wider audience. The package, along with comprehensive documentation and tutorials, is freely available at GitHub, providing a valuable resource for the single-cell genomics community.Practitioner PointsSeuratExtend streamlines scRNA-seq workflows by integrating R and Python tools, multiple databases (e.g., GO, Reactome), and comprehensive functional analysis capabilities within the Seurat framework, enabling efficient, multi-faceted analysis in a single environment.Advanced visualization features, including optimized plotting functions and professional color schemes, enhance the clarity and impact of scRNA-seq data presentation.A novel clustering approach using pathway enrichment score-cell matrices offers new insights into cellular heterogeneity and functional characteristics, complementing traditional gene expression-based analyses.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf076), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer name: Daniel A. Skelly

      Overall, this is a very nice writeup of a useful package that extends the Seurat package to expand possibilities for single cell analysts in R. I liked the visualization options, the ability to try certain python-based tools easily in R which was not previously easy, and some of the authors' new innovations like their use of pathway enrichment scores in broad ways. Kudos to the authors for releasing a package with really excellent documentation and tutorials!

      I think this paper could be made better if the authors stressed with a little more clarity how specifically their work is innovative. The text in the present manuscript is fine but reads like a bit of a grab bag of functionality. For example, from the abstract: "SeuratExtend offers a user-friendly and intuitive interface for performing a wide range of analyses, including functional enrichment, trajectory inference, gene regulatory network reconstruction, and denoising. The package integrates multiple databases, … and incorporates popular Python tools … [We] showcase its novel applications in pathway-level analysis and cluster annotation. SeuratExtend enhances data visualization …"

      How could they be more clear or specific? One example could be by categorizing what SeuratExtend can do that other packages can't. For example, I see innovations in perhaps three general areas: 1. Making single cell analyses easier/faster/prettier (i.e. visualizations, pathway enrichment) 2. Making previously published single cell tools more broadly accessible (e.g. first option to bring certain python tools to R) 3. New innovations (e.g. dimensionality reduction and clustering based on pathway enrichment scores; may not be completely new but I don't recall seeing this elsewhere) If this was added I feel the paper would more clearly communicate to readers the information necessary for them to choose whether they want to try the package.

      I have the following additional significant comments: * Integration of multiple databases for GSEA — these methods are good, but what about in a few years when those databases have been updated? Do the authors intend to continue updating? Could they provide a function for users to use their own database (e.g. .gaf and .obo files, for example for another model organism)? Similar comment about gene identifer conversion, which may need to be updated every few years. * "While the Python ecosystem has benefited greatly from the comprehensive scverse project [7], which utilizes the universal AnnData format to connect various tools and algorithms, a comparable integrated solution has been lacking in the R community. SeuratExtend addresses this gap by providing a unified framework centered around the Seurat object, effectively becoming the R counterpart to scverse." —> some might argue that SeuratWrappers is this solution. The authors should more clearly and explicitly comment on what SeuratExtend does differently/better than SeuratWrappers. * I'm not particularly convinced by the authors' example studies that used SeuratExtend. For example, they describe Hua-Vella et al. (2022) and Hua et al. (2023). These are very nice studies and I have no doubt they made use of SeuratExtend in their analyses. But I don't see anything these authors describe those authors doing as being uniquely possible with SeuratExtend. Perhaps SeuratExtend made their analyses easier, or faster. But it would be better if we had some further concrete details. For example, something communicating a message like one of the following: (1) the authors only tested method X on a whim because it was so easy to run in SeuratExtend, and found that it revealed unexpected biology Y; or (2) the authors were able to bring together method X which runs in R and method Y which runs in python and the joint inference — not possible in other packages — revealed key result Z. If the authors of this manuscript can't point to those sorts of examples, then I'm not sure it adds much to include this discussion in the present paper. * I really liked the section "Novel Applications of SeuratExtend in Pathway-Level Analysis and Cluster Annotation", especially "Exploring and Analyzing Single-Cell Data at the Pathway Level". I thought these applications could perhaps be stressed a bit more strongly or made more prominent earlier in the paper. * Figures 2 and 3 are showing example plots from which we don't actually need to infer any important biology. I thought these figures could be combined and each individual plot type only shown once. (This is for clarity and I don't see anything incorrect about the authors' current plots. * There may be some issues with dependencies for some users. For example, it prompted me to install viridis and loomR as I went through the Quickstart. I ended up encountering an error there is no package called 'loomR' while trying. I had to manually install with remotes::install_github(repo = "mojaveazure/loomR"). Maybe provide an explicit dependencies list/list of recommended packages to install? * I had an error the first time calling Palantir.RunDM(). I hadn't created a seuratextend environment. I found that I could do this manually using create_condaenv_seuratextend(), but that this wasn't supported for Apple Silicon chips. I would suggest that the authors do try to find a way to get this working on newer Apple chips, because Mac machines are very common among bioinformaticians in my experience. * While the writing is largely quite clear, I found it to be a bit voluminous. If the authors are able to cut down on text length that may help in emphasizing the key points that make their package valuable to users.

      I had these minor comments: * "Moreover, mainstream scRNA-seq analysis tools are primarily developed for either the R or Python platforms, with additional options like Nextflow and Snakemake" — I suggest revising this sentence. The tools are developed in R or python languages, which I would not call platforms. I would reword that Nextflow and Snakemake are workflow management systems that provide additional options for pipeline automation * "the R ecosystem surrounding Seurat appears relatively limited" — I'm not sure I would agree with this. I counted wrappers for 17 methods currently. Yes it is true that there are more packages in scverse. However, I suggest moderating your claims about Seurat being limited. * Suggest removing snakemake from Table 1 — it is really different from the other tools listed there

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer comment: *“The authors did not clarify whether the observed protection to PTZ-induced convulsions after mild TBI is due to the reduced size of gap junctions and/or increased activity in hemichannels.” And “The super-resolution imaging only assesses Cx43 gap junction plaque size and density but not the non-junctional portion of Cx43.” *

      Response and planned revision: To determine whether seizure protection in Cx43 S368A mice is due to reduced gap junction plaque density or reduced hemichannel function, we will conduct solubility assays to assess the ratio of insoluble (junctional) to soluble (cytoplasmic/hemichannel) Cx43 in Cx43S368A and C57BL/6 control mice after TBI/sham (as in Fig. 2A-D currently only in C57BL/6 control mice). In parallel, we will perform EtBr uptake assays in acute brain slices from Cx43S368A and C57BL/6 control animals to assess hemichannel function.

      Additionally, we will include super-resolution images without background subtraction, which show diffuse staining indicative of soluble Cx43. Of note, even at super-resolution individual gap junctions or hemichannels cannot be resolved. They appear as diffuse signal (currently not visible in our super-resolution images due to image deconvolution and background substration performed to isolate Cx43 plaques). Super-resolution imaging was used to count Cx43 gap junction plaque densities and size. Cx43 gap junction plaques are dense accruals of Cx43 immunostaining reminiscent functional and closed gap junctions. Complimentary experiments measured soluble (cytoplasmic Cx43 and hemichannels) and insoluble Cx43 (gap junctions) using biochemistry (Fig. 2A-D).

      Reviewer comment: “The immunofluorescent images for Fig. 2E and Fig. 5 were not counterstained for astrocytes or cell membrane. How can the authors be sure that these are expressed by astrocytes and not other cells in the brain?”

      Response and planned revision: Cx43 is predominantly expressed in astrocytes, with expression levels 10–100 times higher than in brain endothelial cells (e.g., Zhang et al., 2014; Vanlandewijck et al., Nature, 2018). As shown in Supplementary Fig. 2, our immunohistochemistry data reveal no overlap between Cx43 and endothelial cell markers, confirming that our staining protocol does not detect Cx43 in endothelial cells. Instead, the apparent localization of Cx43 along blood vessels reflects expression in astrocytic endfeet, which closely ensheath the vasculature. To further support this conclusion, we will conduct quantitative co-localization analyses of Cx43 with markers for neurons, microglia, oligodendrocytes, and NG2 glia in both Cx43S368A and C57BL/6 control mice. Additionally, we will include plots generated from publicly available single-cell RNA sequencing datasets to show that Cx43 mRNA is highly enriched in astrocytes and present at much lower levels in endothelial cells of the brain vasculature.

      • *

      Reviewer comment about developmental contributions to the phenotype of Cx43 S368A animals.

      Response: We cannot exclude a potential developmental component to the observed seizure protection in Cx43S368A mice. We included discussion of this possibility in the revised manuscript.

      Reviewer comments indicative of a lack of clarity around rationale and intent of specific experiments.

      Response: We thoroughly revised the Results section to explicitly state the rationale and purpose of each experiment. For example:

      Reviewer comment: “The immunofluorescent images for Fig. 1D and E were taken at low resolution compared to the Cx43 puncta size. This does not allow accurate quantification of the Cx43 GJs or HCs.”

      Response: The purpose of this experiment was to assess the heterogeneity of Cx43 expression (both junctional and non-junctional portions) with spatial resolution across a larger brain area. Complementary experiments here are quantification of protein amounts using western blot (Fig. 1B), quantification of junctional versus non-junctional Cx43 using the solubility assay and quantification of Cx43 plaques using super-resolution imaging (Fig. 2).

      Reviewer comment: “TBI did not change Cx43 plaque size or density (Fig. 5). What was the rationale for examining the effects in the S368A mutant?”

      Response: We found an increase in phosphorylated Cx43 at ____S____368 after TBI and Cx43__S368A mutants are protected from seizures after administration of PTZ suggesting an important role for this specific Cx43 phosphorylation site in pathology. __We discussed in the manuscript that “in cardiovascular infection/disease has demonstrated maintenance of gap junction coupling (Gy et al., 2011; Padget et al., 2024) while reduced hemichannel opening probability was reported (Hirschhäuser et al., 2021) in Cx43S368A mice”, suggesting that the protective phenotype is likely due to modification of either Cx43 gap junctions or hemichannels. However, functional consequences on Cx43 biology upon phosphorylation at S368 or lack thereof in the Cx43S368A mutant remain unexplored in the brain. Cx43 plaque size and density are reflective of Cx43 gap junctions and was therefore examined in Cx43S368A mice to reveal potential mechanism by which this mouse mutant is protected from seizures (even in the absence of TBI).

      Reviewer comment: * “The IC50 for Tat-Gap19 for Cx43 HC is ~7 μM (Tocris). How can using it at 2 μM be effective?”*

      Response: We reviewed our lab records and confirmed that 2 μM was a typographical error. The actual concentration used was 200 μM. This is consistent with the dose-response literature for astrocytes (e.g., Walrave et al., Glia 2018; Abudara et al., Front. Cell. Neurosci. 2014). We now included these references in the manuscript.

      Reviewer comment: “Unclear whether mice in Fig. 4C received TBI.”

      Response: We clarified that these mice were naïve, i.e. not subjected to TBI or sham procedures. This is now explicitly stated in both the Methods and the Results.

      Reviewer comment: “CBX or Tat-Gap19 do not affect the phosphorylation state of Cx43.”

      Response: We clarified that we used CBX and Tat-Gap19 as established gap junction and hemichannel blockers, irrespective of phosphorylation state. We now noted that Tat-GAP19 is a Cx43 mimetic peptide to specifically block Cx43 hemichannels.

      Reviewer comment: “It is unclear whether the EtBr quantification in Fig. 3D is for S100β+ astrocytes.”

      Response: We clarified that the quantification in Fig. 3D was performed exclusively in S100β+ astrocytes. Although neurons may take up EtBr under inflammatory conditions, they do not express Cx43 (as will be shown in Fig. 1 and Supplementary Data).

      Reviewer comment: “I believe that the 'W.' in ref 'W. Chen et al., 2018' is unnecessary.”

      Response: We will use the journal citation style implemented by a reference manager in the final version of the manuscript.

      Reviewer request to include two references related to phosphorylation and hemichannel permeability and the role of gap junctional coupling in epilepsy.

      Response: The PNAS reference was added to the manuscript.

      That reduction in gap junctional communication is a relevant factor in epilepsy is discussed in the introduction where we also cite original literature of the authors of the proposed review article: “Many pathologies (Gajardo-Gómez et al., 2017; Masaki, 2015; Orellana et al., 2011; Sarrouilhe et al., 2017; Vis et al., 1998; Wang et al., 2018), including traumatic brain injury (TBI) (B. Chen et al., 2017; W. Chen et al., 2019; Wu et al., 2013; Xia et al., 2024) and acquired epilepsy (Bedner et al., 2015; Deshpande et al., 2017; Walrave et al., 2018) present with altered Cx43 regulation, and are often equated with GJ dysfunction.”

      We feel that citing the original manuscripts more accurately reflect the current knowledge around the role of Cx43 in the context of epilepsy and other pathologies. Reader’s access to the original literature also highlights the gaps in knowledge more precisely that this manuscript seeks to close.

      Reviewer comment: “I think the data of this manuscript is missing a control animal that would present all the compensation changes that occur during development that occur in mice carrying the mutated Cx43. Alternatively, a doable experiment would be the use of inducible KO/KI.”

      Response: Previous studies investigating the role of Cx43 in neuronal excitability have primarily used full or conditional knockout models, as described in our introduction. Interestingly, these studies report that global deletion of Cx43 increases seizure susceptibility. However, such models eliminate all Cx43-dependent functions—both junctional and non-junctional—making it difficult to pinpoint the specific mechanisms underlying the observed effects. They do not distinguish whether increased excitability results from loss of gap junction coupling, disruption of hemichannel function, or depletion of cytoplasmic Cx43 signaling. In contrast, our current study does not aim to eliminate Cx43, but instead employs a targeted approach to interrogate the functional significance of a regulatory phosphorylation site, S368. This site is dynamically phosphorylated following TBI and has been previously associated—albeit only through correlative data—with seizure activity and other neuropathologies. By isolating the contribution of this post-translational modification while preserving overall Cx43 expression, our study provides novel mechanistic insight into how phosphorylation modulates Cx43 function and astrocyte-mediated regulation of brain excitability.

      We appreciate the thoughtful suggestion to generate a conditional knock-in model to isolate developmental from acute effects of the Cx43 S368A mutation. However, the GJA1 gene locus is not amenable to this type of targeting (we explored this possibility with a . We also considered AAV-mediated CRISPR/dCas9 editing as an alternative, but current limitations in CNS transduction efficiency, promoter specificity, and guide RNA availability for precise point mutation insertion make this approach similarly unfeasible at this stage. Thus, while we acknowledge the developmental caveat (which we now discuss in the manuscript), the current manuscript provides novel and meaningful insight into the role of the Cx43S368 regulatory phosphorylation site in the context of astrocyte biology and seizure susceptibility and forms a strong foundation for future studies.

      Thank you again for the opportunity to revise and strengthen our manuscript. We believe these planned experiments and clarifications address the reviewers' concerns in a thorough and scientifically rigorous manner.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      Summary The authors focused on medaka retinal organoids to investigate the mechanism underlying the eye cup morphogenesis. The authors succeeded to induce lens formation in fish retinal organoids using 3D suspension culture with minimal growth factor-containing media containing the Hepes. At day 1, Rx3:H2B-GFP+ cells appear in the surface region of organoids. At day 1.5, Prox1+cells appear in the interface area between the organoid surface and the core of central cell mass, which develops a spherical-shaped lens later. So, Prox1+ cells covers the surface of the internal lens cell core. At day 2, foxe3:GFP+ cells appear in the Prox1+ area, where early lens fiber marker, LFC, starts to be expressed. In addition, foxe3:GFP+ cells show EdU+ incorporation, indicating that foxe3:GFP+ cells have lens epithelial cell-characters. At day 4, cry:EGFP+ cells differentiate inside the spherical lens core, whose the surface area consists of LFC+ and Prox1+ cells. Furthermore, at day 4, the lens core moves towards the surface of retinal organoids to form an eye-cup like structure, although this morphogenesis "inside out" mechanism is different from in vivo cellular "outside -in" mechanism of eye cup formation. From these data, the authors conclude that optic cup formation, especially the positioning of the lens, is established in retinal organoids though the different mechanism of in vivo morphogenesis.

      Overall, manuscript presentation is nice. However, there are still obscure points to understand background mechanism. My comments are shown below.

      Major comments 1) At the initial stage of retinal organoid morphogenesis, a spherical lens is centrally positioned inside the retinal organoids, by covering a central lens core by the outer cell sheet of retinal precursor cells. I wonder if the formation of this structure may be understood by differential cell adhesive activity or mechanical tension between lens core cells and retinal cell sheet, just like the previous study done by Heisenberg lab on the spatial patterning of endoderm, mesoderm and ectoderm (Nat. Cell Biol. 10, 429 - 436 (2008)). Lens core cells may be integrated inside retinal cell mass by cell sorting through the direct interaction between retinal cells and lens cells, or between lens cells and the culture media. After day 1, it is also possible to understand that lens core moves towards the surface of retinal organoids, if adhesive/tensile force states of lens core cells may be change by secretion of extracellular matrix. I wonder if the authors measure physical property, adhesive activity and solidness, of retinal precursor cells and lens core cells. If retinal organoids at day 1 are dissociated and cultured again, do they show the same patterning of internal lens core covering by the outer retinal cell sheet? *Response: The question, whether different adhesive activity is involved in cell sorting and lens formation is indeed very intriguing. To address this point, we will include additional experiment (see Revision Plan, experiment 1). This experiment will be based on the dissociation and re-aggregation of lens-forming organoids as suggested by the reviewer. To monitor cell type specific sorting, we will employ a lens progenitor reporter line Foxe3::GFP and the retina-specific Rx2::H2B-RFP. If different adhesive activities of lens and retinal progenitor cells are involved and drive the process of cell sorting, dissociation and re-aggregation will result in cell sorting based on their identity. *

      2) Optic cup is evaginated from the lateral wall of neuroepithelium of the diencephalon. In zebrafish, cell movement occurs from the pigment epithelium to the neural retina during eye morphogenesis in an FGF-dependent manner. How the medaka optic cup morphogenesis is coordinated? I also wonder if the authors conduct the tracking of cell migration during optic cup morphogenesis to reveal how cell migration and cell division are regulated in lens of the Medaka retinal organoids. It is also interesting to examine how retinal cell movement is coordinated during Medaka retinal organoids. Response: Looking into the detail of how optic cup-looking tissue arrangement of ocular organoids is achieved on cellular level is of course interesting. Our previous study showed that optic vesicles of medaka retinal organoids do not form optic cups (for details please see Zilova et al., 2021, eLIFE). We assume that the formation of cup-looking structure of the ocular organoids is mediated by the following processes: establishment of retina and lens domains at the specific region of the organoid – retina on the surface and lens in the center (see Figure S2 d and Figure 3e, and Figure 4). Further dislocation of the centrally formed lens towards the organoid periphery through the retina layer, places the lens to the periphery while retinal cells stay static. We assume that the “cup-like” shape is acquired by extrusion of the lens from the center of the organoid. To clarify this process with respect to tissue rearrangements and cell movements, we will include additional experiments (see Revision Plan, experiment 2) and follow lens- and retina-fated cells (by employing lens-specific Foxe3::GFP and retina-specific Rx2::H2B-RFP reporter lines) through the process of lens extrusion to dissect individual contribution of retinal/lens cells to this process (cross-reference with Reviewer #2).

      3) The authors showed that blockade of FGF signaling affects lens fiber differentiation in day 1-2, whereas lens formation seems to be intact in the presence of FGF receptor inhibitor in day 0-1. I suggest the authors to examine which tissue is a target of FGF signaling in retinal organoids, using markers such as pea3, which is a downstream target of ERK branch of FGF signaling. Since FGF signaling promotes cell proliferation, is the lens core size normal in SU5402-treated organoids from day 0 to day 1?

      Response: Assessing the activity of FGF signaling (cross-reference to Reviewer #3) in the organoids is indeed an important point. To address which tissue is the target of FGF signaling we will include additional experiments and assess the phosphorylation status of ERK (pERK) and expression of the ERK downstream target pea3, as suggested by the reviewer (see Revision Plan, experiment 3). That will allow to identify the tissue within the organoid responding to the Fgf signaling.

      Lens core size of organoids treated with SU5402 from day 0 to day 1 is fully comparable to the control (please see Figure 6b).

      • *

      4) Fig. 3f and 3g indicate that there is some cell population located between foxe3:GFP+ cells and rx2:H2B-RFP+ cells. What kind of cell-type is occupied in the interface area between foxe3:GFP+ cells and rx2:H2B-RFP+ cells?

      Response: That is for sure an interesting question. We are aware of this population of cells. We currently do not have data that would with certainty clarify the fate of those cells. We are currently following up on that question with the use of scRNA sequencing, however we will not be able to address this question in the current manuscript.* * 5) Fig. 5e indicates the depth of Rx3 expression at day 1. Is the depth the thickness of Rx3 expressing cell sheet, which covers the central lens core in the organoids? If so, I wonder if total cell number of Rx3 expressing cell sheet may be different in each seeded-cell number, because thickness is the same across each seeded-cell number, but the surface area size may be different depending on underneath the lens core size. Please clarify this point.

      *Response: Yes. Figure 5e indicates the thickness of the cell sheet expressing Rx3 that lies on the surface of the organoid. Indeed, the number of Rx3-expressing cells (and lens cells) scales with the size of the organoid as stated in the submitted manuscript. *

      • *

      6) Noggin application inhibits lens formation at day 0-1. BMP signaling regulates formation of lens placode and olfactory placode at the early stage of development. It is interesting to examine whether Noggin-treated organoid expands olfactory placode area. Please check forebrain territory markers.

      Response: What tissue differentiates at the expense of the lens in BMP inhibitor-treated organoids is of course an intriguing question. To address the identity of cells differentiated under this condition we will include an additional experiment (see Revision Plan, experiment 4 as suggested by the reviewer). We will check for the expression of Lhx2, Otx2 and Huc/D to address this point.

      I have no minor comments

      **Referees cross-commenting**

      I agree that all reviewers have similar suggestions, which are reasonable and provided the same estimated time for revision.

      Reviewer #1 (Significance (Required)):

      Strength: This study is unique. The authors examined eye cup morphogenesis using fish retinal organoids. Eye cup normally consists of the lens, the neural retina, pigment epithelium and optic stalk. However, retinal organoids seem to be simple and consists of two cell types, lens and retina. Interestingly, a similar optic cup-like structure is achieved in both cases; however, underlying mechanism is different. It is interesting to investigate how eye morphogenesis is regulated in retinal organoids,under the unconstrained embryo-free environment.

      Limitation: Description is OK, but analysis is not much profound. It is necessary to apply a bit more molecular and cellular level analysis, such as tracking of cell movement and visualization of FGF signnaling in organoid tissues.

      Advancement: The current study is descriptive. Need some conceptual advance, which impact cell biology field or medical science.

      Audience: The target audience of current study are still within ophthalmology and neuroscience community people, maybe translational/clinical rather than basic biology. To beyond specific fields, need to formulate a general principle for cell and developmental biology.



      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      In this study from Stahl et al., the authors demonstrate that medaka pluripotent embryonic cells can self-organise into eye organoids containing both retina and lens tissues. While these organoids can self-organize into an eye structure that resembles the vertebrate eye, they are built from a fundamentally different morphogenetic process - an "inside-out" mechanism where the lens forms centrally and moves outward, rather than the normal "outside-in" embryonic process. This is a very interesting discovery, both for our understanding of developmental biology and the potential for tissue engineering applications. The study would benefit from some additional experiments and a few clarifications.

      The authors suggest that the lens cells are the ones that move from the central to a more superficial position. Is this an active movement of lens cells or just the passive consequence of the retina cells acquiring a cup shape? Are the retina cells migrating behind the lens or the lens cells pushing outwards? High-resolution imaging of organoid cup formation, tracking retina cells in combination with membrane labeling of all cells would help elucidate the morphogenetic processes occurring in the organoids. Membrane labeling would also be useful as Prox1 positive lens cells appear elongated in embryos while in the organoids, cell shapes seem less organised, less compact and not elongated (for example as shown in Fig 3f,g).

      Response: Looking into the detail of how optic cup-looking tissue arrangement of ocular organoids is achieved on cellular level is of course interesting. We assume that the formation of cup-looking structures of the ocular organoids is mediated by following processes: establishment of retina and lens domains at a specific region of the organoid – retina on the surface and lens in the center (see Figure S2 d and Figure 3e, and Figure 4). Further dislocation of centrally formed lenses towards the organoid periphery through the retina layer, place the lens to the periphery while retinal cells stay static. We assume that the “cup-like” shape is acquired by extrusion of the lens. To clarify this process with respect to tissue rearrangements and cell movements, we will include additional experiments (see Revision Plan, experiment 2). We will follow lens- and retina-fated cells (by employing lens-specific Foxe3::GFP and retina-specific Rx2::H2B-RFP reporter lines) through the process of lens extrusion to dissect the individual contribution of retinal/lens cells to this process (cross-reference with Reviewer #1).

      The organoids could be a useful tool to address how cell fate is linked to cell shape acquisition. In the forming organoids, retinal tissue initially forms on the outside, while non-retinal tissue is located in the centre; this central tissue later expresses lens markers. Do the authors have any insights into why fate acquisition occurs in this pattern? Is there a difference in proliferation rates between the centrally located cells and the external ones? Could it be that highly proliferative cells give rise to neural retina (NR), while lower proliferating cells become lens? *Response: The question how is the retinal and lens domain established in this specific manner is indeed intriguing and very interesting. We dedicated a part of the discussion to this topic. We discuss the role of the diffusion limit and the potential contribution of BMB and FGF signaling to this arrangement. Additional experiments (see Revision Plan, experiment 3) addressing the source and target tissues of FGF and BMP signaling in the organoid will ultimately bring more clarity to our understanding of the tissue arrangements in the organoid. *

      *Although analysis of the proliferation rate of the cells at the surface and in the central region of the organoid might possibly show some differences in the proliferation rates between lens and retinal cells, we do not have any indications, that the proliferation rate itself would be instructive or superior to the cell fate decisions. *

      What happens in organoids that do not form lenses? Do these organoids still generate foxe3 positive cells that fail to develop into a proper lens structure? And in the absence of lens formation, does the retina still acquire a cup shape?

      *Response: Lens formation is primarily dependent on acquisition/specification of Foxe3-expressing lens placode progenitors. If those are not present, a lens does not develop. Once Foxe3-expressing progenitors are established, a lens is formed in unperturbed conditions (measured by the presence of expression of crystallin proteins). In such conditions, organoids that do not have a lens, do not carry Foxe3-expressing cells. *

      *In the absence of the lens, the organoid is composed of retinal neuroepithelium, that does not form an optic cup (for details of such phenotypes please see Zilova et al., 2021, eLIFE). *

      The author suggest that lens formation occurs even in the absence of Matrigel. Is the process slower in these conditions? Are the resulting organoids smaller? While there are indeed some LFC expressing cells by day2, these cells are not very well organised and the pattern of expression seems dotty. Moreover, LFC staining seems to localise posterior to the LFC negative, lens-like structure (e.g. Fig.S1 3o'clock). How do these organoids develop beyond day 4? Do they maintain their structural integrity at later stages? The role of HEPES in promoting organoid formation is intriguing. Do the authors have any insights into why it is important in this context? Have the authors tried other culture conditions and does culture condition influence the morphogenetic pathways occurring within the organoids? *Response: We thank the reviewer for pointing this out. We were not clear in the wording and describing of our observation. Indeed, Matrigel is not required for acquisition of lens fate, which can be demonstrated with the expression of lens-specific markers. However, the presence of Matrigel has a profound impact on the structural aspects of organoid formation. Matrigel is essential for organization of retinal-committed cells into the retinal epithelium (Zilova et al., 2021, eLIFE). The absence of the structure of the retinal epithelium can indeed negatively impact on the cellular organization and the overall lens structure. To clarify the contribution of the Matrigel to the speed of organoid lens development and to the overall structure of the organoid lens we will perform additional experiments (see Revision Plan, experiment 5). With the use of Foxe3::GFP reporter line we will measure the onset of the lens-specific gene expression. In addition, we will use the immunohistochemistry to assess the gross morphology and size of the organoids grown without the Matrigel (cross-reference with Reviewer #3). *

      *The role of the HEPES in lens formation is indeed very intriguing and currently under investigation. As HEPES is mainly used to regulate pH of the culture media and pH might have an impact on multiple cellular processes, it will require significant time investment to dissect molecular mechanism underlying the effect of HEPES on the process of lens formation (cross reference with Reviewer #3) and therefore cannot be addressed in the current manuscript. *

      **Referees cross-commenting** Pleased to see that all the other reviewers are positive about the study and raise similar concerns and comments

      Reviewer #2 (Significance (Required)):

      This is a very interesting paper, and it will be important to determine whether this alternative morphogenetic process is specific to medaka or if similar developmental routes can be recapitulated in organoid cultures from other vertebrate species.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      Summary: The manuscript by Stahl and colleagues reports an approach to generate ocular organoids composed of retinal and lens structures, derived from Medaka blastula cells. The authors present a comprehensive characterisation of the timeline followed by lens and retinal progenitors, showing these have distinct origins, and that they recapitulate the expression of differentiation markers found in vivo. Despite this molecular recapitulation, morphogenesis is strikingly different, with lens progenitors arising at the centre of the organoid, and subsequently translocating to the outside.

      Comments:

      -The manuscript presents a beautiful set of high quality images showing expression of lens differentiation markers over time in the organoids. The set of experiments is very robust, with high numbers of organoids analysed and reproducible data. The mechanism by which lens specification is promoted in these organoids is, however, poorly analysed, and the reader does not get a clear understanding of what is different in these experiments, as compared to previous attempts, to support lens differentiation. There is a mention to HEPES supplementation, but no further analysis is provided, and the fact that the process is independent of ECM contradicts, as the authors point out, previous reports. The manuscript would benefit from a more detailed analysis of the mechanisms that lead to lens differentiation in this setting.

      *Response: The role of the HEPES in lens formation is indeed very intriguing and under current investigation. As HEPES is mainly used to regulate pH of the culture media and pH might have an impact on multiple cellular processes it will require a significant time investment to dissect molecular mechanism underlying the effect of HEPES on the process of lens formation (cross reference with Reviewer #2) and therefore unfortunately cannot be addressed in the current manuscript. *

      *To clarify the contribution of the Matrigel to the organoid lens development we will perform additional experiments (see Revision Plan, experiment 5). With the use of Foxe3::GFP reporter line we will measure the onset of the lens-specific gene expression. In addition, we will use the immunohistochemistry to assess the gross morphology and size of the organoids grown without the Matrigel (cross-reference with Reviewer #2). * -The markers analysed to show onset of lens differentiation in the organoids seem to start being expressed, in vivo, when the lens placode starts invaginating. An analysis of earlier stages is not presented. This would be very informative, allowing to determine whether progenitors differentiate as placode and neuroepithelium first, to subsequently continue differentiating into lens and retina, respectively. Could early placodal and anterior neural plate markers be analysed in the organoids? This would provide a more complete sequence of lens vs retina differentiation in this model.

      Response: Yes. The figures show the expression of lens and retinal markers in the embryo in later developmental stages and the timing of their expression can be documented with higher temporal resolution. In the revised version of the manuscript, we will provide the information about the onset of expression of Rx3::H2B-GFP (retina) and Foxe3::GFP (lens) (see attached figure). Rx3 represents one of the earlies markers labeling the presumptive eye field within the region of the anterior neural plate (S16, late gastrula). FoxE3::GFP expression can be detected within the head surface ectoderm before the lens placode is formed showing that Foxe3 is a suitable marker of placodal progenitors in medaka.

      *We are convinced that the onset of Rx3 and Foxe3-driven reporters is early enough to make the claim about the separate origin of the lens (placodal) and retinal (anterior neuroectoderm) tissues within the ocular organoids. *

      -The analysis of BMP and Fgf requirement for lens formation and differentiation is suggestive, but the source of these signals is not resolved or mentioned in the manuscript. Are BMP4 and Fgf8 expressed by the organoids? Where are they coming from?

      Response: Indeed, addressing the source of BMP and FGF activation would bring more clarity in understanding the mechanism of retina/lens specification within the ocular organoids (cross reference with Reviewer #1). To address this point, we will include additional experiments (see Revision Plan, experiment 3). We will analyze the expression of respective ligands (Bmp4 and Fgf8) and activation of downstream effectors of BMP and FGF signaling pathways within the ocular organoids as suggested by Reviewer #1 and Reviewer #3.

      • *

      -The fact that the lens becomes specified in the centre of the organoid is striking, but it is for me difficult to visualise how it ends up being extruded from the organoid. Did the authors try to follow this process in movies? I understand that this may be technically challenging, but it would certainly help to understand the process that leads to the final organisation of retinal and lens tissues in the organoid. There is no discussion of why the morphogenetic mechanism is so different from the in vivo situation. The manuscript would benefit from explicitly discussing this. Response: Following the extruding lens in vivo is indeed very relevant suggestion. To clarify the process of ocular organoid formation in the respect of tissue rearrangements and cell movements, we will include additional experiment (see Revision Plan, experiment 2). We will follow lens- and retina-fated cells (by employing lens-specific Foxe3::GFP and retina-specific Rx2::H2B-RFP reporter lines) through the process of lens extrusion (cross-reference with Reviewer #1 and Reviewer #2).

      **Referees cross-commenting**

      We all seem to have similar comments and concerns. I think overall the suggestions are feasible and realistic for the timeframe provided.

      Reviewer #3 (Significance (Required)):

      This study describes a reproducible approach to differentiate ocular organoids composed of lens and retinal tissues. The characterisation of lens differentiation in this model is very detailed, and despite the morphogenetic differences, the molecular mechanisms show many similarities to the in vivo situation. The manuscript however does not highlight, in my opinion, why this model may be relevant. Clearly articulating this relevance, particularly in the discussion, will enhance the study and provide more clarity to the readers regarding the significance of the study for the field of organoid research, ocular research and regenerative studies.

    2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #3

      Evidence, reproducibility and clarity

      Summary: The manuscript by Stahl and colleagues reports an approach to generate ocular organoids composed of retinal and lens structures, derived from Medaka blastula cells. The authors present a comprehensive characterisation of the timeline followed by lens and retinal progenitors, showing these have distinct origins, and that they recapitulate the expression of differentiation markers found in vivo. Despite this molecular recapitulation, morphogenesis is strikingly different, with lens progenitors arising at the centre of the organoid, and subsequently translocating to the outside.

      Comments:

      • The manuscript presents a beautiful set of high quality images showing expression of lens differentiation markers over time in the organoids. The set of experiments is very robust, with high numbers of organoids analysed and reproducible data. The mechanism by which lens specification is promoted in these organoids is, however, poorly analysed, and the reader does not get a clear understanding of what is different in these experiments, as compared to previous attempts, to support lens differentiation. There is a mention to HEPES supplementation, but no further analysis is provided, and the fact that the process is independent of ECM contradicts, as the authors point out, previous reports. The manuscript would benefit from a more detailed analysis of the mechanisms that lead to lens differentiation in this setting.
      • The markers analysed to show onset of lens differentiation in the organoids seem to start being expressed, in vivo, when the lens placode starts invaginating. An analysis of earlier stages is not presented. This would be very informative, allowing to determine whether progenitors differentiate as placode and neuroepithelium first, to subsequently continue differentiating into lens and retina, respectively. Could early placodal and anterior neural plate markers be analysed in the organoids? This would provide a more complete sequence of lens vs retina differentiation in this model.
      • The analysis of BMP and Fgf requirement for lens formation and differentiation is suggestive, but the source of these signals is not resolved or mentioned in the manuscript. Are BMP4 and Fgf8 expressed by the organoids? Where are they coming from?
      • The fact that the lens becomes specified in the centre of the organoid is striking, but it is for me difficult to visualise how it ends up being extruded from the organoid. Did the authors try to follow this process in movies? I understand that this may be technically challenging, but it would certainly help to understand the process that leads to the final organisation of retinal and lens tissues in the organoid. There is no discussion of why the morphogenetic mechanism is so different from the in vivo situation. The manuscript would benefit from explicitly discussing this.

      Referees cross-commenting

      We all seem to have similar comments and concerns. I think overall the suggestions are feasible and realistic for the timeframe provided.

      Significance

      This study describes a reproducible approach to differentiate ocular organoids composed of lens and retinal tissues. The characterisation of lens differentiation in this model is very detailed, and despite the morphogenetic differences, the molecular mechanisms show many similarities to the in vivo situation. The manuscript however does not highlight, in my opinion, why this model may be relevant. Clearly articulating this relevance, particularly in the discussion, will enhance the study and provide more clarity to the readers regarding the significance of the study for the field of organoid research, ocular research and regenerative studies.

    1. In this article, the authors present a study using different networks from various data sources to measure differences in gathering scholarly document topics and to show which networks provide the best information to represent the scientific topics considered appropriately. The work is built on a previous contribution and analyses networks obtained from six sources: scholarly document authors, Facebook users, Twitter users and conversations, patents, and policy documents. These networks are also accompanied by other networks, i.e. the text similarity network and the citation network, that are mainly used for comparison purposes.

      The work particularly interests the scholarly community, aiming to work with science map generation. However, some passages need further explanation to be clear to the reader.

      1. In the abstract, there is a mention of traditional and non-traditional data sources. While in the text of the article there are, indeed, some clarifications, it would be ideal to briefly explain in the abstract what the authors refer to these terms, since it is not immediately clear what is a traditional data source in the context of topic identification.

      2. In the introduction, the authors anticipate the outcomes of a previous work they have conducted on a similar topic. They claim that some topics are well-represented in maps based on citation links and text similarity, while others are not. However, it is not clear which sources they have used to get to this claim, and it is also not evident what the main difference is that characterises the current work compared to the previous one.

      3. In section 3, the authors introduce all the methods and materials used for their analysis. Despite the fact that some of the material cannot be shared since it is behind a paywall (e.g. the Web of Science data), by reading the section, it is not clear that all the code developed and the data obtained from the analysis have been published on Zenodo. While it is okay to address this aspect in the appropriate section at the end of the article, I would suggest to anticipate this information at the beginning of section 3, citing the Zenodo record appropriately and clarifying which of material is not included in that record, thus explaining that the full reproducibility of the experiment cannot be conducted.

      4. Considering all the external sources of networks, it is not clear what the datetime window of each source is - are all these sources containing information from the year of publication of the oldest article in the document set considered to 2024?

      5. As far as I understood from the formula in section 3.7.1, the Purity is always calculated against a particular topic M. Thus, why not refer to such "M" in the formula definition, defining it in a function-like way Purity(N, M)? In addition, still in this section, it is not clear how the N clusters considered are selected. A running example of Purity calculation would probably help the reader here.

      6. In section 3.7.2, the denominator of the formula is set to 5. However, it is unclear why such a number is sensitive for the calculation presented. Why not 6 or 7? Why not 3? I think the authors should clearly justify the choice of such a denominator by bringing in explicit evidence.

      7. In section 3.7.3, it is not entirely clear what the difference is between topics and topic categories.

      8. In the discussions, it would be good to extend a bit on the work's limitation and envision possible paths for future works in the area. A few points that I would love to see discussed in detail:

        • The analysis has been done by using sources that may have changed drastically in the past months/years - e.g. Twitter that, after becoming X, has seen a series of abandons from the academics towards more open (in a broad sense) platforms and networks (e.g. Mastodon and, more recently, BlueSky). Would it be possible to gather the necessary data from these platforms to run the study again? If yes, would it be possible to download them? If not, should we consider these sources unreliable for scientific purposes and, if so, what preconditions should be in place for their reliability? Considering the present situation, what is the relevance of the results obtained with the data gathered from Twitter (now X)?

        • The authors transparently claim that some of the data used (e.g. Web of Science data) are not freely available to the reader, thus preventing the full replication of the study. Is it possible to substitute these closed sources with others offering open research information? For instance, OpenCitations for gathering the citation network (full disclosure: I'm director of OpenCitations), PubMed and PubMed Central for gathering titles and abstracts of the article considered, etc.?

        • The core set of scholarly documents considered are primarily from the biomedical domain since the authors considered only those with a PubMed identifier specified. While the results shown are sensitive for this domain, how much does the approach the authors presented scale also in other scholarly areas, e.g. Social Science and Humanities? Is it possible to speculate that the approach presented is discipline-agnostic? Is there any evidence for such a claim?

      Some final remarks:

      A. The figures should be closer (i.e. maximum on the next page) to the place they are mentioned the very first time.

      B. The research question introduced in the article is introduced in section 1, and then it is not explicitly mentioned anymore in the text. It would be ideal to add an explicit reference to that question when the authors present appropriate evidence to answer it (e.g. in section 4) and to recall the answer to that question in the conclusion of the paper.

  4. Jul 2025
    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      __We thank the reviewers for the supportive suggestions and comments. We have addressed all comments underneath the original text in red. As suggested, we added to line numbers to the text and use these numbers to refer to the changes made. __

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      The manuscript is well written and presents solid data, most of which is statistically analyzed and sound. Given that the author's previous comprehensive publications on seipin organization and interactions, it might be beneficial (particularly in the title and abstract) to emphasize that this manuscript focuses on the metabolic regulation of lipid droplet assembly by Ldb16, to distinguish it from previous work. Perhaps one consideration, potentially interesting, involves changes in lipid droplet formation under the growth conditions used for galactose-mediated gene induction.

      We thank the reviewer for the supportive comments and suggestions.

      Comments: (1) Fig. 3 and 4. The galactose induction of lipid droplet biogenesis in are1∆/2∆ dga1∆ lro1∆ cells though activation of a GAL1 promoter fusion to DGA1 is a sound approach for regulating lipid droplet formation. Although unlikely, carbon sources can impact lipid droplet proliferation and (potentially interesting) metabolic changes under growth in non-fermentable carbon sources may impact lipid droplet biogenesis; in fact, oleate has significant effects (e.g. PMID: 21422231; PMID: 21820081). The GAL1 promoter is a very strong promoter and the overexpression of DGA1 via this heterologous promoter might itself cause unforeseen changes. Affirmation of the results using another induction system might be beneficial.

      We thank the reviewer for these suggestions. In this study we focused on the organisation of the yeast seipin complex during the process of LD formation. We chose to use galactose-based induction of Dga1 because this is a well-established and widely used assay in the field, extensively characterized by many groups over the years. The tight control it provides, enabling synchronous and rapid LD induction, makes it the method of choice for many researchers. Importantly, the LDs formed using this assay are morphologically normal and involve the same components as LDs formed under other conditions.

      Regarding the role of metabolism in LD formation, it is worth noting that galactose is metabolized by yeast primarily through fermentation, following its conversion to UDP-glucose. Therefore, its use does not involve drastic metabolic changes. The impact of metabolism in LD biogenesis is an interesting question but it falls beyond the scope of the current study.

      (2) Fig. 3B. Although only representative images are shown, the panel convincingly shows that lipid droplets do form upon galactose induction of DGA1 in are1∆/2∆ dga1∆ lro1∆ cells. However, it does not show to what extent. Are lipid droplets synthesized at WT levels? How many cells were counted? How many lipid droplets per cell? Is there a statistical difference with respect to WT cells?

      We did not assess these parameters in this study. The aim of the study was to assess the relations between components of the seipin complex with and without lipid droplets. For this purpose, inducing lipid droplet formation over a 4-hour period was sufficient to address that specific question. As mentioned above, LDs formed using this assay are morphologically normal and involve the same components as LDs formed under other conditions. This being said, it is known that prolonged overexpression of Dga1 (> 12hours) can lead to enlarged LDs.

      (3) Fig. 2D. It is not clear how standard deviation can be meaningfully applied to two data points, let alone providing a p-value. For some of these experiments, triplicate trials might provide a more robust statistical sampling.

      We thank the reviewer for this suggestion. We have added 2 more repeats to the Co-IP in figure 2.

      Reviewer #1 (Significance (Required)):

      Klug and Carvalho report on the lipid droplet architecture of the yeast seipin complex. Specifically, the mechanism of yeast seipin Sei1 binding to Ldo16 and the subsequent recruitment of Ldb45 is analyzed. These results follow from a recent publication (PMID: 34625558) from the same authors and aims to define a more precise role for the components of the seipin complex. Using photo-crosslinking, Ldo45 and Ldo16 interactions are analyzed in the context of lipid droplet assembly.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      Summary:

      Klug and Carvalho apply a photo-crosslinking approach, which has been extensively used in the Carvalho group, to investigate the subunit interactions of the seipin complex in yeast. The authors apply this approach to further study possible changes within the seipin complex following induction of neutral lipid synthesis and lipid droplet (LD) formation. The authors propose that Ldo45 makes contact with Ldb16 and that the seipin complex subunits assemble even in the absence of LDs.

      Major comments:

      Overall, this is a focused and well-executed study on one of the fundamental structural components of LDs. The study addresses the subunit interactions of the seipin complex but does not look into their functional consequences, for example how the mutations on Ldb16 that affect its interaction with Ldo45, influence LD formation; similarly, the authors make the interesting observation that Ldo16 may be differentially affected by the lack of neutral lipids (Fig. 3A) but this observation is not explored.

      We thank the reviewer for this comment. The Ldb16 mutations analyzed in this study have been previously characterized by us (see Klug et al., 2021 – Figure 3) and exhibit a mild defect in lipid droplet (LD) formation. This phenotype is unlikely to result from impaired Ldo16/45 recruitment, as deletion of Ldo proteins causes only a very mild effect on LD formation (as shown in Teixeira et al., 2018 and Eisenberg-Bord et al., 2018).

      We agree that the differential effect on Ldo proteins by the absence of neutral lipids is particularly interesting. However, its exploration falls outside of the scope of the current study and should be thoroughly investigated in the future.

      1. For the crosslinking pull-downs (Fig. 1), it seems that the authors significantly overexpress (ADH1 promoter) the Ldb16 subunit that carries the various photoreactive amino acid residues, while keeping the other (tagged) seipin complex members at endogenous levels. Would not this imbalance affect the assembly of the complex and therefore the association of the different subunits with each other?

      We thank the reviewer for this comment. The in vivo site-specific crosslinking is highly sensitive methodology to detect protein-protein interactions in a position-dependent manner. However, one of the caveats of the approach is the low efficiency of amber stop codon suppression and BPA incorporation. To mitigate this limitation, we (and others) induce the expression of the amber-containing protein (in this case Ldb16) from a strong constitutive promoter such as ADH1. Therefore, despite using a strong promoter, the overall levels of LDB16 remain comparable to endogenous levels due to the inherently low efficiency of amber suppression. Moreover, it is known that when not bound to Sei1, Ldb16 is rapidly degraded in a proteasome dependent manner (Wang, C.W. 2014), further preventing its accumulation.

      Although the authors do show delta4 cells with no LDs (Fig. 3B, 0h), galactose-inducible systems in yeast are known to be leaky. Given that the authors' conclusion that the complex is "pre-assembled" irrespective of the addition of galactose, I think it would be important to confirm biochemically that there is no neutral lipid at time point 0. Alternatively, it may be better to simply compare wt vs dga1 lro1 or are1are2 mutants - there is no need for GAL induction since the authors look at one time point only.

      Among the various regulable promoters, GAL1 shows a superior level of control. For example, expression of essential genes from GAL1 promoter frequently leads to cell death in glucose containing media, a condition that represses GAL1 promoter. Having said this, we cannot exclude that minute amounts of DGA1 are expressed prior galactose induction. However, if this is the case, the resulting levels of TAG are insufficient to be detected by sensitive lipid dyes and to induce LDs, as noted by the reviewer. Therefore, we believe our conclusions remain valid. This is consistent that we use in the text, where we refer to LD formation rather than complete loss of neutral lipids. To make this absolutely clear we replaced the word “presence” to “abundance” in line 236.

      Lastly, we do not agree with the reviewer that using double mutants (are1/2 or dga1/lro1 mutants) would be sufficient since these mutations are not sufficient to abolish LD formation – a key aspect of this study. The GAL1 system allows us to monitor 2 time points in the same cells –no LDs (time 0h) and with LDs (Time 4h). The system proposed by the reviewer would only allow a snap shot of steady state levels in different cells rather than within the same cell culture.

      Some methodological issues could be better detailed. For example, which of the three delta4 strains was used to induce neutral lipid in Fig. 4B? How exactly were the quantifications in Fig. 4D performed (I assume they were done under non-saturating band intensity conditions, as for some residues it is difficult to conclude whether the blot aligns with the quantification results).

      We thank the reviewer for these comments. We have clarified the strain number in the figure legend of figure 4B (strain yPC12630).

      We have also added the following text in rows 437-441 in the methods section: “Reactive bands were detected by ECL (Western Lightning ECL Pro, Perkin Elmer #NEL121001EA), and visualized using an Amersham Imager 600 (GE Healthcare Life Sciences). Data quantification was performed using Image Studio software (Li-Cor) to measure line intensity under non saturating conditions.”

      "our findings support the notion that Ldo45 is important for early steps of LD formation as previously proposed" I find this statement confusing given that the authors claim that Ldo45 is already bound to the complex before LD formation.

      We thank the reviewer for raising this important point. We believe that our findings support previous hypotheses on the role of Ldo45. It has been suggested that Ldo45 is important for the early stages of lipid droplet (LD) formation (Teixeira et al., 2018; Eisenberg-Bord et al., 2018). As such, Ldo45 would need to be recruited to the seipin complex before or at the onset of LD formation. The observation that Ldo45 is present at the complex prior to LD formation provides strong support for its role in the initial steps of this process.

      To clarify this idea in the manuscript, we have revised the sentence on line 310 as follows:

      “Irrespective of the mechanism, our findings support the notion that Ldo45 plays a role in the early steps of LD formation, as previously proposed…”

      The model in Fig. 5 is essentially the same as the one shown in Fig. 1G.

      To aid the reader and avoid confusion, we intentionally used a similar color scheme throughout the manuscript. This may contribute to the perception that the figures are very similar. However, there are clear distinctions between them. In Figure 1G, we summarize our findings regarding the positioning of Ldo45 within the complex and note that we do not yet have data on Ldo16. Building upon these findings, in Figure 5 we speculate where Ldo16 might interact with Ldb16 and highlight that recruitment of both Ldo16 and Ldo45 increases with neutral lipid availability.

      Therefore, we believe that both figures serve distinct and complementary purposes, and that each is useful for communicating our overall message.

      Minor comments

      In the pull-downs in Fig. 2C, it seems that full-length Ldb16 is not enriched after the FLAG IP. What is the reason of this?

      We thank the reviewer for raising this interesting aspect. We do not know why this occurs, but it is clear that full length Ldb16 is not efficiently pulled down. We could speculate that this has to do with access to the FLAG moiety at the C terminus that may become inaccessible due to interactions or folding in the long unstructured C-terminus of Ldb16. This might explain why when we truncate the C terminus in the 1-133 mutant we achieve a more efficient IP.

      At the blots at Fig. 2C and 3A, the anti-Dpm1 Ab seems to recognize in the IP fractions a band labelled as non-specific, however this band is absent from the input.

      We thank the reviewer for raising this. This non-specific band is the light chain of the antibody used in the pull down that detaches from the matrix during elution – thus not found in the input. This is a common non-specific band that appears in Co-IP blots.

      Reviewer #2 (Significance (Required)):

      Regulation of seipin function is essential for proper LD biogenesis in eukaryotes, so this study addresses a fundamental question in the field. As stated above some functional analysis that goes beyond the biochemistry would be beneficial. There is some overlap with a recently published paper from the Wang group that also examines the assembly of seipin in yeast.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      The manuscript by Klug and Carvalho investigates the interaction of the yeast seipin complex (Sei1 and Ldb16) with Ldo45 and Ldo16. Using a site-specific photocrosslinking approach, the authors map some residues of the seipin complex in contact Ldo45, demonstrating that Ldo45 likely binds to Ldb16 in the center of the Sei1-Ldb16 complex. They find that both Ldo45 and Ldo16 copurify with Ldb16. Complex assembly is demonstrated to occur independently of the presence of neutral lipids. An Ldb16 mutant, harbouring the transmembrane domain (1-133) but lacking the cytosolic region (previously shown to allow normal LD formation and still bind to Sei1) showed photocrosslinks with Ldo45, but not Ldo16. No crosslinks between Sei1 and either Ldo45 or Ldo16 were detected.

      Major: 1. Figure 2 shows CoIPs using different Ldb16 mutants/truncations to test for binding of Ldo45 and Ldo16. Both Ldo16 and Ldo45 copurify with full length Ldb16. Loss of the cytosolic part of Ldb16 strongly reduced binding of both Ldo45 and Ldo16, indicating that the TM-Helix-TM domain of Ldb16 (1-133) alone is not sufficient for proper binding of Ldo45 or Ldo16. The quantifications (2D and 2E) presented for this CoIP represent a n=2 with mean, standard deviation and statistics. To be a meaningful statistical analysis, the authors need to increase their n to at least n=3. In addition, they refer to the statistics they use here as "two-sided Fischer's T-test" in the respective Figure legend. To my knowledge, there is no such test, either it is Student's T-test or Fischer's exact test? Can the authors please clarify?

      We thank the reviewer for this comment and suggestions. We have now included 2 additional repeats for this experiment and the results essentially support our conclusion.

      The two-sided Fischer’s T-test is the name of the test in Graphpad- Prism. We wanted to acknowledge the test name so that the reader can trace the exact test we used in the program.

      1. Figure 2E shows the same data as 2D with different normalization to highlight the differences between binding to the domain 1-133 per se and binding to this domain when the linker helix is mutated. These mutations seem to cause a further decrease in binding of both Ldo45 and Ldo16. Still, effects are rather small, and the n=2 does not allow any meaningful statistical tests. To make this point, the authors should increase their sample number (at least n=3) to show that this difference is indeed meaningful and to allow statistical analysis.

      We thank the reviewer for this comment and suggestions. We have now included 2 additional repeats for this experiment and the results essentially support our conclusion.

      For Ldo16, no crosslinks were detected with Ldb16 TM-HelixTM domain (Figure 1). In line, CoIP demonstrated that the interaction between Ldo16 and Ldb16 was strongly reduced when the Ldb16 domain 1-133 was used for IP. Still, additional mutation of the linker helix in this 1-133 domain further reduced this interaction (to a similar extend as for Ldo45). Could the authors please clarify why the additional mutations in the linker helix region also decreased the binding of Ldo16, though the authors conclude from their crosslinking approach in Fig. 1 that Ldo16 does not interact with this region?

      We thank the reviewer for raising this point. Our negative crosslinking results for Ldo16 do not exclude the possibility of binding to that region; rather, they indicate that we were unable to detect Ldo16 there. Additionally, mutations in the linker helix may influence how Ldb16 interacts with seipin, including its positioning within the seipin ring and the membrane bilayer. These structural changes could, in turn, affect Ldo16 recruitment in ways that we do not fully understand.

      Similarly, also in 4D, a quantification with n=2 is presented, showing that some of the crosslinks are more prominently detectable when LD biogenesis is induced. The findings of this manuscript are completely based on results obtained with CoIP and photocrosslinking, and quantification of a sufficient n to allow statistical analysis will be essential.

      While we agree that additional experiments are useful for the Co-IP because of variability between experiments, this is less of a concern for the photocrosslinking experiments. In the case of photocrosslinking, we typically see much less variability and normally, for a given position, the effects are much more “black and white”- either there is a crosslink or not.

      Why is there nowhere a blot with crosslinked Ldb16 bands shown (but only non-crosslinked Ldb16, e.g. Fig. 1C)?

      We thank the reviewer for this comment. In all cases the amount of crosslinked product is very minor. This is particularly obvious in the case of Ldb16, where the non-crosslinked species dominates in the blots (as can be observed in figure S1B).

      Figure 3: The authors conclude that galactose-induced expression of either Dga1, Lro1 or Are1 in cells lacking all four enzymes for neutral lipid synthesis (quadruple deletion mutant) increases the levels of Ldb16. However, I do not see any difference on the FLAG-Ldb16 blot when comparing Ldb16 levels in the quadruple deletion mutant with or without Dga1, Lro1 or Are1, and no quantification is presented that might reveal very subtle differences not visible on the blot.

      We agree with the reviewer and modified the text to more accurately describe our results.

      OPTIONAL: Have the authors considered to assess which sites/domains of Ldo45 and Ldo16 are employed to bind to Ldb16?

      This is a logical next step that will be undertaken in a future study.

      Minor: 1. Page numbers would have been helpful to refer to specific text sections.

      Page numbers have been added

      1. Figure 3C: Unclear to me why the authors label a part of their immunoblot where they detected HA with OSW5?

      This was a mistake and has been corrected

      1. Figure 4D and corresponding figure legend could be improved in respect to labeling to clarify.

      we have added an X axis label and made extra clarifications in the legend

      1. Please correct his sentence: "These variants we expressed in cells where the other subunits of the Sei1 complex were epitope tagged to facilitate detection and expressed their endogenous loci."

      This sentence has been corrected

      Reviewer #3 (Significance (Required)):

      This is a short and interesting study completely based on UV-induced site-specific photocrosslinking and CoIPs that provides some new insights into the interaction surface between the Seipin complex and Ldo45 and the interaction between Ldo16 and Ldb16. Though in parts still premature, these findings will likely be of interest to the large community interested in lipid metabolism, expanding the role of Ldb16 from neutral lipid binding to regulator recruitment.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations for the authors):

      Thank you for your thorough review of our manuscript and your valuable suggestions. Here are our responses to each point you raised:

      (1) Novelty: Exploring the feasibility of extending the risk-scoring model to diverse cancer types could emphasize the broader impact of the research.

      Thank you so much for your thoughtful and insightful feedback. Your suggestion to explore extending the risk-scoring model to diverse cancer types is truly valuable and demonstrates your broad vision in this field. We deeply appreciate your interest in our research and the effort you put into providing such constructive input.

      After careful consideration, we have decided to focus our current study on the specific cancer type(s) we initially set out to explore. This decision was made to ensure that we can thoroughly address the research questions at hand, given our current resources, time constraints, and the complexity of the topic. By maintaining this focused approach, we aim to achieve more in-depth and reliable results that can contribute meaningfully to the understanding of this particular area.

      However, we fully recognize the potential significance of your proposed direction and firmly believe that it could be an excellent avenue for future research. We will definitely keep your suggestion in mind and may explore it in subsequent studies as our research progresses and evolves.

      (2) Improvement in Figure Presentation: The inconsistency in font formatting across figures, particularly in Figure 2 (A-D, E, F-H, I), Figure 3 (A-C, D-J, H, K), and the distinct style change in Figure 5, raises concerns about the professionalism of the visual presentation. It is recommended to standardize font sizes and styles for a more cohesive and visually appealing layout. This ensures that readers can easily follow and comprehend the graphical data presented in the article.

      The text in the picture has been revised as requested.

      (3) Enhancing Reliability of Immune Cell Infiltration Data: Address the potential limitations associated with relying solely on RNASeq data for immune cell infiltration analysis between ICD and ICD high groups in Figure 2. It is advisable to discuss the inherent challenges and potential biases in this methodology. To strengthen the evidence, consider incorporating bladder cancer single-cell sequencing data, which could provide a more comprehensive and reliable understanding of immune cell dynamics within the tumor microenvironment.

      Thank you very much for your meticulous review and the highly constructive suggestions. Your insight regarding the limitations of relying on RNASeq data for immune cell infiltration analysis and the proposal to incorporate bladder cancer single-cell sequencing data truly reflect your profound understanding of the field. We deeply appreciate your efforts in guiding our research and the valuable perspectives you've offered.

      After careful deliberation, given our current research scope, timeline, and available resources, we've decided to focus on further discussing and addressing the challenges and biases inherent in RNASeq-based immune cell infiltration analysis. By delving deeper into the methodological limitations and conducting more in-depth statistical validations, we aim to provide a comprehensive and reliable interpretation of the data within our study framework. This focused approach allows us to maintain the integrity of our original research design and deliver robust findings on the relationship between immune cell infiltration and ICD in the current context.

      However, we fully acknowledge the significant value of your proposed single-cell sequencing approach. It is indeed a powerful method that could offer more detailed insights into immune cell dynamics, and we believe it holds great promise for future research in this area. We will keep your suggestion in mind as an important direction for potential future studies, especially when we plan to expand and deepen our exploration of the tumor microenvironment.

      (4) Clarity in Data Sources and Interpretation of Figure 5: In the results section, provide a detailed and transparent explanation of the sources of data used in Figure 5. This includes specifying the databases or platforms from which the chemotherapy, targeted therapy, and immunotherapy data were obtained. Additionally, elucidate the rationale behind the chosen data sources and how they contribute to the overall interpretation of the study's findings. And, strangely, these immune-related genes are associated with cancer sensitivities to different targeted therapies.

      Thank you very much for your detailed and valuable feedback on Figure 5. We sincerely appreciate your careful review and insightful suggestions, which have provided us with important directions for improvement.

      Regarding the data sources in Figure 5, we used the pRRophetic algorithm to conduct a drug sensitivity analysis on the TCGA database. The reason for choosing these data sources is multi - faceted. Firstly, these databases and platforms are well - established and widely recognized in the field. They have strict data collection and verification processes, ensuring the accuracy and reliability of the data. For example, TCGA has a large - scale, long - term - accumulated chemotherapy case database, which can comprehensively reflect the clinical application and treatment effects of various chemotherapeutic drugs.

      Secondly, these data sources cover a wide range of cancer types and patient information, which can meet the requirements of our study's diverse sample size and variety. This comprehensiveness enables us to conduct a more in - depth and representative analysis of the relationships between different therapies and immune - related genes.

      In terms of the overall interpretation of the study's findings, the use of these data sources provides a solid foundation. The accurate chemotherapy, targeted therapy, and immunotherapy data help us clearly demonstrate the associations between immune - related genes and cancer sensitivities to different treatments. This allows us to draw more reliable conclusions and provides a scientific basis for understanding the complex mechanisms of cancer treatment from the perspective of immune - gene - therapy interactions.

      As for the unexpected association between immune - related genes and cancer sensitivities to different targeted therapies, this is indeed a fascinating discovery. In our analysis, we hypothesized that immune - related genes may affect the tumor microenvironment, thereby influencing the response of cancer cells to targeted therapies. Although this finding is currently beyond our initial expectations, it has opened up a new research direction for us. We will further explore and verify the underlying mechanisms in future research.

      Once again, thank you for your guidance. We will make corresponding revisions and improvements according to your suggestions to make our research more rigorous and complete.

      (5) Legends and Methods: Address the brevity and lack of crucial details in the figure legends and methods section. Expand the figure legends to include essential information, such as the number of samples represented in each figure. In the methods section, provide comprehensive details, including the release dates of databases used, versions of coding packages, and any other pertinent information that is crucial for the reproducibility and reliability of the study.

      We would like to express our sincere gratitude for your valuable feedback on the figure legends and methods section of our study. We highly appreciate your sharp observation of the issues regarding the brevity and lack of key details, which are crucial for further improving our research.

      We have supplemented the methods section with data including the number of samples, the release dates of the databases used, and the versions of the coding packages, etc. For TCGA samples: 421 tumor samples and 19 normal samples.Database release date: March 29, 2022, v36 versions.Coding package version: R version 4.1.1.We will immediately proceed to supplement these key details, making the research process and methods transparent. This will allow other researchers to reproduce our study more accurately and enhance the persuasiveness of our research conclusions.

      (6) Evidence Supporting Immunotherapy Response Rates: The importance of providing a robust foundation for the conclusion regarding lower immunotherapy response rates. Strengthen this section by offering a more detailed description of sample parameters, specifying patient demographics, and presenting any statistical measures that validate the observed trends in Figure 5Q-T. More survival data are required to conclude. Avoid overinterpretation of the results and emphasize the need for further investigation to solidify this aspect of the study.

      Thank you very much for your professional and meticulous feedback on the content related to immunotherapy response rates in our study! Your suggestions, such as providing a solid foundation for the conclusions and supplementing key information, are of great value in enhancing the quality of our research, and we sincerely appreciate them.

      The data in Figures 5Q to T are from the TCGA database, which has already been provided. The statistical measure used for Figures 5Q to T is the P-value, which has been marked in the figures. The survival data have been provided in Figure 3D.

      Reviewer #2 (Recommendations for the authors):

      Thank you for your thorough review of our manuscript and your valuable suggestions. Here are our responses to each point you raised:

      (1) There is no information on the samples studied. Are all TCGA bladder cancer samples studied? Are these samples all treatment naïve? Were any excluded? Even simply, how many samples were studied?

      Thank you so much for pointing out the lack of sample - related information. Your attention to these details has been extremely helpful in identifying areas for improvement in our study.

      All the samples in our study were sourced from the TCGA (The Cancer Genome Atlas) and TCIA (The Cancer Immunome Atlas) databases. It should be noted that the patient data in the TCIA database are originally from the TCGA database. Regarding whether the patients received prior treatment, this information was not specifically mentioned in our current report. Instead, we mainly relied on the scores of the prediction model for evaluation. Since all samples were obtained from publicly available databases, we understand the importance of clarifying their origin and characteristics.

      We sincerely apologize for the omission of the sample size and other relevant details. We will promptly supplement this crucial information in the revised version, including a detailed description of the sample sources and any relevant characteristics. This will ensure greater transparency and help readers better understand the basis of our research.

      For TCGA samples: 421 tumor samples and 19 normal samples.Database release date: March 29, 2022, v36 versions.Coding package version: R version 4.1.1.

      (2) What clustering method was used to divide patients into ICD high/low? The authors selected two clusters from their "unsupervised" clustering of samples with respect to the 34 gene signatures. A Delta area curve showing the relative change in area under the cumulative distribution function (CDF) for k clusters is omitted, but looking at the heatmap one could argue there are more than k=2 groups in that data. Why was k=2 chosen? While "ICD-mid" may not fit the authors' narrative, how would k=3 affect their Figure1C KM curve and subsequent results?

      Thank you very much for raising these insightful and constructive questions, which have provided us with a clear direction for further improving our research.

      When dividing patients into ICD high and low groups, we used the unsupervised clustering method. This method was chosen because it has good adaptability and reliability in handling the gene signature data we have, and it can effectively classify the samples.

      Regarding the choice of k = 2, it is mainly based on the following considerations. Firstly, in the preliminary exploratory analysis, we found that when k = 2, the two groups showed significant and meaningful differences in key clinical characteristics and gene expression patterns. These differences are closely related to the core issues of our study and help to clearly illustrate the distinctions between the ICD high and low groups. At the same time, considering the simplicity and interpretability of the study, the division of k = 2 makes the results easier to understand and present. Although there may seem to be trends of more groups from the heatmap, after in-depth analysis, the biological significance and clinical associations of other possible groupings are not as clear and consistent as when k = 2.

      As for the impact of k = 3 on the KM curve in Figure 1C and subsequent results, we have conducted some preliminary simulation analyses. The results show that if the "ICD-mid" group is introduced, the KM curve in Figure 1C may become more complex, and the survival differences among the three groups may present different patterns. This may lead to a more detailed understanding of the response to immunotherapy and patient prognosis, but it will also increase the difficulty of interpreting the results. Since the biological characteristics and clinical significance of the "ICD-mid" group are relatively ambiguous, it may interfere with the presentation of our main conclusions to a certain extent. Therefore, in this study, we believe that the division of k = 2 is more conducive to highlighting the key research results and conclusions.

      Thank you again for your valuable comments. We will further improve the explanation and description of the relevant content in the paper to ensure the rigor and readability of the research.

      (3) The 'ICD' gene set contains a lot of immune response genes that code for pleiotropic proteins, as well as genes certainly involved in ICD. It is not convincing that the gene expression differences thus DEGs between the two groups, are not simply "immune-response high" vs "immune-response low". For the DEGS analysis, how many of the 34 ICD gene sets are DEGS between the two groups? Of those, which markers of ICD are DEGs vs. those that are related to immune activation?

      a. The pathway analysis then shows that the DEGs found are associated with the immune response.

      b. Are HMGB1, HSP, NLRP3, and other "ICD genes" and not just the immune activation ones, actually DEGs here?

      c. Figures D, I-J are not legible in the manus.

      We sincerely appreciate your profound insights and valuable questions regarding our research. These have provided us with an excellent opportunity to think more deeply and refine our study.

      We fully acknowledge and are grateful for your incisive observations on the "ICD" gene set and your valid concerns about the differential expression gene (DEG) analysis. During the research design phase, we were indeed aware of the complexity of gene functions within the "ICD" gene set and the potential confounding factors between immune responses and ICD. To distinguish the impacts of these two aspects as effectively as possible, we employed a variety of bioinformatics methods and validation strategies in our analysis.

      Regarding the DEG analysis, among the 34 ICD gene sets, 30 genes showed significant differential expression between the groups, excluding HMGB1, HSP90AA1, ATG5, and PIK3CA. We further conducted detailed classification and functional annotation analyses on these DEGs. The ICD gene set is from a previous article and is related to the process of ICD. Relevant literature is in the materials section. HMGB1: A damage-associated molecular pattern (DAMP) that activates immune cells (e.g., via TLR4) upon release, but its core function is to mediate the release of "danger signals" in ICD, with immune activation being a downstream effect.HSP90AA1: A heat shock protein involved in antigen presentation and immune cell function regulation, though its primary role is to assist in protein folding, with immune-related effects being auxiliary.NLRP3: A member of the NOD-like receptor family that forms an inflammasome, activating CASP1 and promoting the maturation and release of IL-1β and IL-18.Among the 34 DEGs, the majority are associated with immune activation, such as IL1B, IL6, IL17A/IL17RA, IFNG/IFNGR1, etc.

      (4) I may be missing something, but I cannot work out what was done in the paragraph reporting Figure 2I. Where is the ICB data from? How has this been analysed? What is the cohort? Where are the methods?

      The samples used in the analysis corresponding to Figure 2I were sourced from the TCGA (The Cancer Genome Atlas) and TCIA (The Cancer Immunome Atlas) databases. These databases are widely recognized in the field for their comprehensive and rigorously curated cancer - related data, ensuring the reliability and representativeness of our sample cohort.

      Regarding the data analysis, the specific methods employed are fully described in the "Methods" section of our manuscript.

      (5) How were the four genes for your risk model selected? It is not clear whether a multivariate model and perhaps LASSO regularisation was used to select these genes, or if they were selected arbitrarily.

      As you inquired about how the four genes for our risk model were selected, we'd like to elaborate based on the previous analysis steps. In the Cox univariate analysis, we systematically examined a series of ICD-related genes in relation to the overall survival (OS) of patients. Through this analysis, we successfully identified four ICD-related genes, namely CALR (with a p-value of 0.003), IFNB1 (p = 0.037), IFNG (p = 0.022), and IF1R1 (p = 0.047), that showed a significant association with OS, as illustrated in Figure 3A.

      Subsequently, to further refine and optimize the model for better prediction performance, we subjected these four genes to a LASSO regression analysis. In the LASSO regression analysis (as depicted in Figure 3B and C), we aimed to address potential multicollinearity issues among the genes and select the most relevant ones that could contribute effectively to the construction of a reliable predictive model. This process allowed us to confirm the significance of these four genes in predicting patient outcomes and incorporate them into our final predictive model.

      (6) How related are the high-risk and ICD-high groups? It is not clear. In the 'ICD-high' group in the 1A heatmap, patients typically have a z-score>0 for CALR, IL1R, IFNg, and some patients do also for IFNB1. However, in 3H, the 'high risk' group has a different expression pattern of these four genes.

      Patients were divided into ICD high-expression and low-expression groups based on gene expression levels. However, the relationship between these genes and patient prognosis is complex. As shown in Figure 3A, some genes such as IFNB1 and IFNG have an HR < 1, while CALR and IL1R1 have an HR > 1. Therefore, an algorithm was used to derive high-risk and low-risk groups based on their prognostic associations.

      (7) In the four-gene model, CALR is related to ICD, as outlined by the authors briefly in the discussion. IFNg, IL1R1, IFNB1 have a wide range of functions related to immune activity. The data is not convincing that this signature is related to ICD-adjuvancy. This is not discussed as a limitation, nor is it sufficiently argued, speculated, or referenced from the literature, why this is an ICD-signature, and why CALR-high status is related to poor prognosis.

      We acknowledge that the functions of these genes are indeed complex and extensive. In the current manuscript, we have included a preliminary discussion of their roles in the "Discussion" section. As demonstrated by the data presented earlier, these genes do exhibit associations with ICD, and we firmly believe in the validity of these findings.

      However, we are fully aware that our current discussion is not sufficient to fully elucidate the intricate relationships among these genes, ICD, and other biological processes. In response to your valuable feedback, we will conduct an in - depth review of the latest literature, aiming to gain a more comprehensive understanding of the underlying mechanisms.

      (8) Score is spelt incorrectly in Figures 3F-J.

      Figures 3F-J have been revised as requested.

      (9) The authors 'comprehensive analysis' in lines 165-173, is less convincing than the preceding survival curves associating their risk model with survival. Their 'correlations' have no statistics.

      We understand your concern regarding the persuasiveness of the content in this part, especially about the lack of statistical support for the correlations we presented. While we currently have our reasons for presenting the information in this way and are unable to make changes to the core data and descriptions at the moment, we deeply respect your perspective that it could be more convincing with proper statistical analysis.

      (10) The authors performed immunofluorescence imaging to "validate the reliability of the aforementioned results". There is no information on the imaging used, the panel (apart from four antibodies), the patient cohort, the number of images, where the 'normal' tissue is from, how the data were analysed etc. This data is not interpretable without this information.

      a. Is CD39 in the panel? CD8, LAG3? It's not clear what this analysis is.

      The color of each antibody has been marked in Fig 2B. The cohort information and its source have been supplemented. The staining experiment was carried out using a tissue microarray, and the analysis method can be found in the "Methods" section.Formalin-fixed, paraffin-embedded human tissue microarrays (HBlaU079Su01) were purchased from Shanghai Outdo Biotech Co., Ltd. (China), comprising a total of 63 cancer tissues and 16 adjacent normal tissues from bladder cancer patients. Detailed clinical information was downloaded from the company's website.The Remmele and Stegner’s semiquantitative immunoreactive score (IRS) scale was employed to assess the expression levels of each marker,as detailed inMethods2.5.CD39, CD8, and LAG3 were also stained, but the results were not presented.

      (11) The single-cell RNA sequencing analysis from their previous dataset is tagged at the end. CALR expression in most identified cells is interesting. Not clear what this adds to the work beyond 'we did scRNA-seq'. How were these data analysed? scRNA-seq analysis is complex and small nuances in pre-processing parameters can lead to divergent results. The details of such analysis are required!

      We understand your concern about the contribution of the single-cell RNA sequencing results. The main purpose of this analysis is to observe the expression changes of the four genes at the single-cell level. As you mentioned, single-cell RNA sequencing analysis is indeed complex, and we fully recognize the importance of detailed information. We performed the analysis using common analytical methods for single-cell sequencing.It has been supplemented in the Methods section.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Below is a point-by-point response to reviewers concerns.

      Main changes are colored in red in the revised manuscript.

      Reviewer #1 (Significance (Required)):

      General assessment:

      This study provides a valuable computational framework for investigating the dynamic interplay between DNA replication and 3D genome architecture. While the current implementation focuses on Saccharomyces cerevisiae, whose genome organization differs significantly from mammalian systems.

      Advance: providing the first in vivo experimental evidence in investigating the role(s) of Cohesin and Ctf4 in the coupling of sister replication forks.

      Audience: broad interests; including DNA replication, 3D genome structure, and basic research

      Expertise: DNA replication and DNA damage repair within the chromatin environment.

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      By developing a new genome-wide 3D polymer simulation framework, D'Asaro et al. investigated the spatiotemporal interplay between DNA replication and chromatin organization in budding yeast: (1) The simulations recapitulate fountain-like chromatin patterns around early replication origins, driven by colocalized sister replication forks. These findings align with Repli-HiC observations in human and mouse cells, yet the authors advance the field by demonstrating that these patterns are independent of Cohesin and Ctf4, underscoring replication itself as the primary driver. (2) Simulations reveal a replication "wave" where forks initially cluster near the spindle pole body (SPB) and redistribute during S-phase. While this spatial reorganization mirrors microscopy-derived replication foci (RFis), discrepancies in cluster sizes compared to super-resolution data suggest unresolved mechanistic nuances. (3) Replication transiently reduces chromatin mobility, attributed to sister chromatid intertwining rather than active forks.

      This work bridges replication timing, 3D genome architecture, and chromatin dynamics, offering a quantitative framework to dissect replication-driven structural changes. This work provides additional insights into how replication shapes nuclear organization and vice versa, with implications for genome stability and regulation.

      We thank Reviewer 1 for her/his enthusiasm and her/his comments that help us to greatly improve the manuscript.

      However, the following revisions could strengthen the manuscript:

      Major:

      Generalizability to Other Species While the model successfully recapitulates yeast replication, its applicability to larger genomes (e.g., mammals) remains unclear. Testing the model against (Repli-HiC/ in situ HiC, and Repli-seq) data from other eukaryotes (particularly in mammalian cells) could enhance its broader relevance.

      We agree with the reviewer that testing the model in higher eukaryotes would be highly informative. The availability of Repli-HiC on one hand and higher resolution microscopy on the other could enable insightful quantitative analyses. With our formalism, it is in principle already possible to capture realistic 1D replication dynamics as the integrated mathematical formalism (by Arbona et al. ref. [63]) was already used to model human genome S-phase. In addition, the formalism developed for chain duplication is generic and can be contextualized to any species. However, when addressing the problem in 3D, we would likely require including other crucial structural features such as TADs or compartments. Such a model would require an extensive characterization worthy of its own publication. These considerations are now mentioned in the Discussion as exciting future perspectives (Page 17).

      On the other hand, we would like to highlight that, while very minimal in many aspects, our model includes many layers of complexity (explicit replication, different forks interactions, stochastic 1D replication dynamics, physical constraints at the nuclear level). In addition, addressing this problem in budding yeast offers the great advantage of simultaneously capturing at the same time both the local and global spatio-temporal properties of DNA replication and to focus first only on those aspects and not on the interplay with other mechanisms like A/B compartmentalization (absent in yeast) that may add confusions in the data analysis and comparison with experimental data . Studying such an interplay is a very important and challenging question that, we believe, goes beyond the scope of the present work.

      Validation with Repli-HiC or Time-Resolved Techniques

      The Hi-C data in early S-phase supports the model, but the intensity of replication-specific chromatin interactions is faint, which could be further validated using Repli-HiC, which captures interactions around replication forks. Alternatively, ChIA-PET or HiChIP targeting core component(s) (eg. PCNA or GINS) of replisomes may also solidify the coupling of sister replication forks.

      We thank the reviewer for the suggestion. Unfortunately, corroborating our HiC results using Repli-HiC or HiChIP would require developing and adapting the protocols to budding yeast which is well beyond the scope of this work mainly focused on computational modelling. In addition, we believe that the signature found in our Hi-C data is clear and significant enough to demonstrate the effect.

      However, we included in the Discussion (Page 15) a more detailed description on how our work compares with the Repli-HiC study in mammals. In particular, we added a new supplementary figure (new Fig. S23) where we discuss our prediction on how Repli-HiC maps would appear in yeast in both scenarios of sister-forks interaction. Interestingly, we find that:

      1) Fountain signals are strongly enhanced when sister forks interact.

      2) Only mild replication dependent enrichment is detected when diverging forks do not interact.

      These two results imply that disrupting putative sister-forks interaction would have a drastic effect on Repli-HiC if compared to HiC.

      Interactions Between Convergent Forks

      The study focuses on sister-forks but overlooks convergent forks (forks moving toward each other from adjacent origins), whose coupling has been observed in Repli-HiC. Could the simulation detect the coupling of convergent fork dynamics?

      We thank the reviewer for this suggestion. We included in our Hi-C analysis aggregate plots around termination sites. Interestingly, no clear signature of coupling between convergent forks was detected (such as type II fountains in mammals) in vivo and in silico. Similarly, from visual inspection of individual termination sites, no fountains were clearly observed. These results can be found in the new Fig. S24 and possible mechanistic explanations are described more in detail in the Discussion (Page 15).

      Unexpected Increase in Fountain Intensity in Cohesin/Ctf4 Knockouts.

      In Fig.3A, a schematic illustrating the cell treatment would improve clarity. In Sccl- and Ctf4-depleted cells, fountain signals persist or even intensify (Fig. 3A). This counterintuitive result warrants deeper investigation. Could the authors provide any suggestions or discussions? Potential explanations may include:

      Compensatory mechanisms (e.g., other replisome proteins stabilizing sister-forks).

      Altered chromatin mobility in mutants, enhancing Hi-C signal resolution.

      Artifacts from incomplete depletion (western blots for Sccl/Ctf4 levels should be included).

      A scheme illustrating the experimental protocol for degron systems (CDC45-miniAID & SCC1-V5-AID) with the corresponding western blots and cell-cycle progression are shown in Fig. S26. Note that for Ctf4, we are using a KO cell line where the gene was deleted.

      We do agree with the reviewer that there exist several possible explanations explaining the differences between WT fountains and those observed in mutants. In the revised manuscript, we discussed some of them in Section 2 II B (Page 8):

      (1) As already suggested in the paper, asynchronization of cells may impact the intensity of the fountains due a dilution effect mediated by the cells still in G1. Therefore, possible differences in the fractions of replicating/non-relicating cells between the different experiments (new Fig. S7C) would also result in differences in the signal. Moreover, it is important to highlight that aggregate plots are normalized (Observed/Expected) by the average signal (P(s)). Therefore, as Scc1-depleted cells do not exhibit cohesin-mediated loop-extrusion (see aggregate plots around CARs in new Fig. S7B), we may expect an enhancement of signal at origins due to dividing each pixel by a lower contact frequency with respect to the one found in WT.

      (2) In the new Fig. S10, we plotted the relative enrichment of Hi-C reads around origins. While we already used the same approach to compare replicon sizes between simulations and experiments (see Fig S7A and response to comment n°9 of Reviewer 3), this analysis is instructive also when comparing different experimental conditions. While we find that the experiment in WT and Scc1-depleted cells show very similar replicon sizes, we do observe a small increase in the peak height for the cohesin mutant. This may also partially motivate differences in the intensity of the fountain. For ctf4Δ, we observe significantly smaller replicons. We speculate that such a mutant might exhibit slower replication and consequently might be enriched in sister-forks contacts.

      (3) Compensatory mechanisms: we now briefly discussed this in the Discussion (Page 15).

      Inconsistent Figure References

      Several figure citations are mismatched. For instance, Fig. S1A has not been cited in the manuscript. Moreover, there is no Fig.1E in figure 1, while it has been cited in the text. All figure/panel references must be cross-checked and corrected.

      We thank the reviewer for this observation. We have now corrected the mismatches.

      Minor:

      Page2: "While G1 chromosomes lack of structural features such as TADs or loops [3]" However, Micro-C captures chromatin loops, although much smaller than those in mammalian cells, within budding yeast.

      Loops of approx 20-40 kb are found in interphase in budding yeast but only after the onset of S-phase ( ref. [52-61]). For this reason, our G1 model of yeast without loops well captures the experimental P(s) curves (Fig. S2). See also answer to point 12 of reviewer 2 .

      In figure 2E, chromatin fountain signals can be readily observed in the fork coupling situation and movement can also be observed. However, the authors should indicate the location of DNA replication termination sites and show some examples at certain loci but not only the aggregated analysis.

      The initial use of aggregate plots was motivated by the fact that fountains are quite difficult to observe at the single origin level in the experimental Hi-C due to the strong intensity of surrounding contacts (along the diagonal). However, when dividing early-S phase maps by the corresponding G1 map, we can now observe clear correlation between origin and fountain positions on such normalized maps. We now added an example for chromosome 7 in Fig.3 indicating early/late origins.

      In Fig. S8 and S9 (where we also included termination sites), we show that fountains are prominently found at origins during S-phase and are lost in G2/M.

      Reviewer #2 (Significance (Required)):

      The topic is relevant and the problem being addressed is very interesting. While there has been some earlier work in this area, the polymer simulation approach used here is novel. The simulation methodology is technically sound and appropriate for the problem. Results are novel. The authors compare their simulations with experimental data and explore both interacting and non-interacting replication forks. Most conclusions are supported by the data presented. Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      The manuscript by D'Asaro et al. investigates the relationship between DNA replication and chromatin organization using polymer simulations. While this is primarily a simulation-based study, the authors also present relevant comparisons with experimental data and explore mechanistic aspects of replication fork interactions.

      We thank Reviewer 2 for her/his positive evaluation of our work and her/his suggestions that help us to clarify many aspects in our manuscript.

      The primary weakness is that many aspects are not clear from the manuscript. Below is a list of questions that the authors must clarify:

      In the Model and Methods section, it is written "Arbitrarily, we choose the backbone to be divided into two equally long arms, in random directions." It is unclear what is meant by "backbone to be divided" and "two equally long arms." Does this refer to replication?

      We agree with the reviewer that the term backbone may be ambiguous. In the context of the initialization of the polymer, it refers to the L/4 initial bonds used to recursively build an unknotted polymer chain of final size L using the Hedgehog algorithm (see refs [101,109]). As shown in the Fig S1A, these initial L/4 bonds define the initial backbone of each chromosome before they are recursively grown to their final size. We chose to divide them into two branches (called “arms” in the old version of the manuscript) of equal length (L/8) and with random orientations. To avoid any ambiguity between the term arm used in that context and the chromosome arms in a biological sense (sequences on the left and right with respect to centromeres), we changed it to “linear branches” to improve clarity. We highlighted in Fig. S1A two examples of such a “V-shaped” backbone.

      As stated in the text, these initial configurations are artificial and just aim to generate unknotted, random structures. After initiating the structures, we then added the geometrical constraints to the centromeric, telomeric and rDNA beads. This, combined with the tendency of the polymer to explore and fill the spherical volume, determine the relaxed G1-like state (see Fig. S2) obtained after an equilibration stage (corresponding to 10^7 MCS). Only after that initialization protocol, DNA replication is activated.

      In chromosome 12, since the length inside the nucleolus (rDNA) is finite, the entry and exit points should be constrained. Have the authors applied any relevant constraint in the model?

      Indeed, we did not introduce any specific constraint on the relative distance between rDNA boundary monomers in our model. They can therefore freely diffuse, independently from each other, on the nucleolus surface. This point is now clarified in the text. Note that, in this paper, we did not aim to finely describe the rDNA organization and its interactions with the rest of the genome, that is why we did not explicitly model rDNA. Moreover, to the best of our knowledge, there is not available experimental data to potentially tune such additional restraints.

      Previous models such as Tjong et al. (ref. [66]) and Di Stefano et al. (ref [67]) have used very similar approximations than us. In the works of Wong et al. (ref.[61]) and Arbona et al. (ref.[63]), rDNA is explicitly modelled via larger/thicker beads/segments, and thus accounts for some generic polymer-based constraints between rDNA boundary elements.

      However, note that all these different models, including ours, still correctly predict the strong depletion of contacts between rDNA boundaries, indicating that there exists a spatial separation between the two boundary elements that is qualitatively well captured by our model (See Fig. S1 D and Fig. 1B).

      What is the rationale for normalizing the experimental and simulation results by dividing by the respective P_intra(s = 10 kb)?

      This normalization was used in Fig. 1 to obtain a rescaling between experiments and simulations. This approach assumes that simulated and experimental Hi-C maps are proportional by a factor that, in Fig 1B, was set to P_exp(s=16kb)/P_sim(s=16kb). Similar strategies are used in a number of modeling studies (for example ref. [103,106]).

      We use the average contact frequency (P_intra) at this genomic scale (s in the order of 10s of kb) because our polymer simulations well capture the experimental P(s) decay above this scale. This method allows to plot the two signals with the same color scale and to give a qualitative, visual intuition on the quality of the modeling. Note that normalization has no impact on the Pearson correlation given in text. More generally, it allows to semi-quantitatively compare predicted and experimental Hi-C data.

      In Fig 1D, we instead normalize the average signal between pairs of centromeres (inter-chromosomal aggregate plot off-diagonal) by the average P_intra(s=10kb). This method allows estimating how frequently centromeres of different chromosomes are in contact relative to intra-chromosomal contacts at the chosen scale (10 kb). In the new paragraph “Comparison with in vivo HiC maps in G1” (Page 22) , we describe more in detail the quantitative insights that can be recovered from such analysis.

      As a comparison, such normalization is not required when computing Observed/Expected maps (Fig. 1C or aggregate plots in Fig. 2 and Fig. 3) as simulation and experimental maps are normalized by their own P(s) curves. We now clarify this aspect in the Materials in Methods under the paragraph “Comparison between on diagonal aggregate plots” (Page 22).

      In the sentence "For instance, chromosomes are strictly bound by the strong potential to localize between 250 and 320 nm from the SPB," is it 320 or 325 nm? Is there a typo?

      We confirm that the upper bound is indeed 325 nm as stated in Eq.2 and not 320 nm.

      Please list the number of beads in each chromosome and the location of the centromere beads.

      A new table (Table S2) was included to highlight beads number and centromere positions.

      In Eq. 7, when the Euclidean distance between the sister forks d_ij > 50 nm, the energy becomes more and more negative. This implies that the preferred state of sister forks is at distances much greater than 50 nm. Then how is "co-localization of sister forks" maintained?

      We corrected the typo sign in Eq.7. The corrected equation without the minus sign - consistently with what simulated - implies that sister forks tend to minimize their 3D distance. The term goes to zero when their distance is within 40 nm (2 nearest-neighbouring sites).

      The section on "non-specific fork interactions" is unclear. You state that the interaction is between "all the replication forks in the system," but f_ij is non-zero only for second nearest-neighbors. The whole subsection needs clarification.

      We corrected the text, specifying that the energy is non-zero for both first and second neighbours. In practice, two given forks do not experience any attractive energy unless their 3D distance is less than 2 nearest-neighbours. To clarify this aspect, we articulated more in the methods how non-specific fork interactions are implemented in the lattice during the KMC algorithm. We also included a new supplementary image (Fig. S15), where we schematize how forks move in 3D and how changes in their position update the table that tracks the number of forks around each lattice site.

      Eq. 6 has no H_{sister-forks}. Is this a typo?

      We confirm that it is a typo and the formula was corrected to H_{sister-forks}.

      While discussing the published work, the authors may cite the recent paper [https://doi.org/10.1103/PhysRevE.111.054413].

      The reference is now included when discussing previous polymer models of DNA replication.

      It is not clear how the authors actually increase the length of new DNA in a time-dependent manner. For example, when a new monomer is added near the replication origin (green bead in Fig. 3C), what happens to the red and blue polymer segments? Do they get shifted? How do the authors take into account self-avoidance while adding a new monomer? These details are not clear.

      The detailed description of the chain duplication algorithm and its systematic analysis was performed in our previous study (ref. [25]).

      However, we agree with the reviewer that to improve self-consistency more details must be included in the present manuscript (see also answer to comment 1 of Reviewer 3). In particular, we now highlight in Materials and Methods that self-avoidance is indeed temporarily broken when we add a newly replicated monomer on top of the site where the fork is. Such double occupancy in the lattice rapidly vanishes due to 3D local moves. We refer to our PRX work (ref [25] and in particular to the following figure (extracted from FIG. S1 in ref.[25]) which illustrates how the bonds/segments of the two sister chromatids are consistently maintained.

      How do the authors ensure that monomers get added at a rate corresponding to velocity v? The manuscript mentions "1 MCS = 0.075 msec," but in how many MC steps is a new monomer added? How is it decided?

      Similarly to origin firing, replication by fork movement along the genome occurs stochastically, with a rate which we derive by converting the physiological fork speed in yeast 2.2 kb/min (ref. [41]) into a rate in (number of monomer/MCS) units. In practice, we generate a random number that, if smaller than such a rate, leads to forks duplication. We clarify this aspect in the Materials and Methods, also referring to our previous work for a more detailed summary.

      The authors stress the relevance of loop extrusion. However, in their polymer simulation, the newly replicated chromatin does not form any loops. Is this consistent with what is known?

      Indeed, our simulations do not have any concurrent extrusion mechanism such as cohesin-mediated loops. This choice was purposely made to isolate and characterize replication-dependent effects.

      That is why we compare our predictions on chromatin fountain patterns (Fig. 3) with data obtained for the Scc1 mutant strain where cohesin is absent in order to disentangle the possible interference with loop-extruding cohesin. For subsection C where microscopy data are available only in WT condition, we cannot rule out that the observed discrepancies between experiments and predictions cannot be due to missing mechanisms including loop extrusion. It was already mentioned in the Discussion (Page 16). It is however unclear whether sparse and small loops between CARs (see Fig. S7B) in S-phase, could be sufficient to recapitulate the microscopy estimates on the sizes of replication foci and no clear signature of inter-origin loops (possibly mediated by loop extrusion) are observed in Hi-C data in WT and Scc1 deficient conditions.

      Moreover, as mentioned in the Discussion, the poorly characterized mechanisms behind forks/extruding-cohesin encounters does not allow for a straightforward modelling of such processes whose accurate description/simulation would require its own study.

      Please add a color bar to Fig. 4B.

      The color bar was included.

      In the MSD plot (Fig. 6), even though it appears to be a log-log plot, the exponents are not computed. Typically, exponents define the dynamics.

      We plot the expected 0.5 exponent at smaller time-scales as mentioned in the main text in Fig. 6, previously included only in new Fig. S19A.

      The dynamics will depend on the precise nature of interactions, such as the presence or absence of loop extrusion. If the authors present dynamics without extrusion, is it likely to be correct?

      The reviewer is correct in highlighting how our model does not capture the potential decrease in dynamics due to cohesin mediated loop extrusion. However, our model does capture the expected Rouse regime (see Fig. 6A, S19A and ref [83]), which justify our timemapping strategy. In comment 16 of reviewer 3, we discuss more in detail the robustness of our results with respect to variation in such a mapping. In the specific context of Fig. 6A, we predict the gradual decrease in dynamics due to sister chromatids intertwining independently of any cohesin-associated activity (both loop-extruding and cohesive). As loop extrusion is also decreasing chromatin mobility overall (ref. [87]), if such a decrease in mobility is observed in WT in vivo, it may be indeed difficult to assign such a decrease to replication rather than loop extrusion. That is why in the Discussion (Page 16), we propose to compare our prediction to experiments in cohesin-depleted cells. In the context of Fig.6B&C, we don’t expect loop extrusion to be a confounding effect as the predicted decrease in dynamics is specific to forks.

      Reviewer #3 (Significance (Required)):

      The work has been conducted thoroughly, and in general the paper is well written with good attention to detail. As far as I am aware, this is the first study where replication is simulated in a whole nucleus context, and the scale of the simulations is impressive. This allows the authors to address questions on replication foci and the spatiotemporal organisation of replication which would not be possible with more limited simulations, and to compare the model with previous experimental work. This, together with the new HiC data, I think this makes this a strong paper which will be of interest to biophysics and molecular biology researchers; the manuscript is written such that it would suit an interdisciplinary basic research audience.

      We thank Reviewer 3 for her/his enthusiasm and her/his comments that help us to greatly improve the manuscript.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      The paper "Genome-wide modelling of DNA replication in space and time confirms the emergence of replication specific patterns in vivo in eukaryotes" by D'Asaro et. al presents new computational and experimental results on the dynamics of genome replication in yeast. The authors present whole-nucleus scale simulations using a kinetic Monte Carlo polymer physics model. New HiC data for synchronised yeast samples with different protein knock-downs are also presented.

      The main questions which the paper addresses are whether sister forks remain associated during replication, whether there is more general clustering of replication forks, and whether replication occurs in a 'spatial wave' through the nucleus. While the authors' model data are not able to conclusively show whether sister forks remain co-localised, the work provides some important insights which will be of high interest to the field.

      I have no major issues with the paper, only some minor comments and suggestions to improve the readability of the manuscript or provide additional detail which will be of interest to readers. I list these here in the order in which they appear in the paper. There are also a number of typos and grammatical issues through the text, so I recommend thorough proofreading.

      The paper seems to be aimed at a broad interdisciplinary audience of biophysicists and molecular biologists. For this reason, the introduction could be expanded slightly to include some more background on DNA replication, the key players and terminology. Also, it seems that this work builds on previous modelling work (Ref. 19), so a bit more detail of what was done there, and what is new here would be helpful. The final paragraph the introduction mentions chromosome features such as TADs and loops, which should be explained in more detail.

      We now have expanded the introduction to address some of these aspects. In particular, also as a response to comment 1 of Reviewer 4, we included additional background on the eukaryotic replication time program. We address in more detail its known interplay and correlation with crucial 3D structural features such as compartments and TADs. Finally, we add a sentence to clarify how the current work is distinct from the prior implementation and the novelty introduced here.

      In the first results section, end of p2, the "typical brush-like architecture" is mentioned. This is not well explained, some additional detail or a diagram might help.

      As very briefly summarized in the mentioned paragraph, the yeast genome is organized in the so-called Rabl organization where chromosome arms are all connected via the centromeres at the Spindle Pole Body (SPB). This is analogous to the definition of a polymer brush where several branches (the arms in this case), are grafted to a surface or to another polymer (see new Inset panel in Fig S1B). We refer in the main text to the scheme in Fig. S1B where we also include the snapshot of a single chromosome and the physical constraints that characterize this large-scale organization and extend the caption to clarify the analogy. A typical emerging feature at the single chromosome level is described in Fig. 1 B and C.

      On p3-4, some previous work is described, with Pearson correlations of 0.86 and 0.94 are mentioned. What cases these two different values correspond to is not clear.

      These Pearson correlations are obtained for our own modeling. We correct the values in the main text and more clearly indicate the specific correspondence with the maps used. We describe now in the Materials and Methods (new paragraph “Comparison with in vivo HiC maps in G1” and Table S2) how these values were obtained.

      In section II-A-2, on the modelling details, it should be made clearer that the nucleus volume is kept constant, and that this is an approximation since typically the nucleus grows during S-phase. This is discussed in the Methods section, but it would be useful to also mention it here (and give some justification why it will not likely change the results).

      We now state more clearly in the main text the limitation of our model regarding the doubling of DNA content without any increase of nuclear size. As mentioned in the Discussion, we do not expect this approximation to strongly impact our results, which mainly focus on early S-phase.

      We now also included in the Discussion how the detection of the “replication wave” should be qualitatively independent of the density regime. In fact, even in the case of growing nuclei and constant density, the polarity induced by the Rabl organization and replication timing are the main drivers of such fork redistribution.

      Regarding the slowdowning in diffusion due to sister chromatids intertwinings (see response to comment 13), we instead verified that the effect is indeed density independent (new Fig S21).

      Fig 2. The text in Fig 2B is much smaller than other panels and difficult to read. Also Fig 3B, Fig 6.

      This is now corrected.

      In 2E, are the times given above each map the range which is averaged over? This could be clearer in the caption. In the caption it stated that these are 'observed over expected'; what the 'expected' is could be clearer.

      We reformulate the description in the caption to make clearer that the time indicated above the plots indicate the time window used for the computation. As mentioned more in detail in the response to comment 17 below (and comment 3 of Reviewer 2), we included in the Material and Methods a more precise description on the normalization used in the case of on-diagonal aggregate plots (observed-over-expected).

      In section II-B-2, the authors state that the cells are fixed 20 mins after release from S-phase. Can they comment on the rationale behind this choice, since from Fig 2 their simulations predict that the fountain pattern will no-longer be visible by that time.

      In the experimental setup, cells are arrested in G1 with alpha-factor and then released in S-phase (see Fig S26 with corresponding scheme). The release from G1 synchronisation is not immediate, and staging of cells by flow-cytometry every 5 minutes for 30 minutes after release (data not shown in the main text but provided below) proved 20 minutes to be an adequate early S-phase timepoint (Page 17 in the Materials and Methods). As a consequence, the times indicated when describing the in vivo experiment, do not correspond to the ones indicated in our in silico system, for which the onset of replication is well defined. For these reasons, we have to determine which time window among the ones used in Fig 2E, is the most appropriate to compare with the experiment (see response to comment 9 for more details).

      Fig.R1: Cell cycle progression monitored by flow cytometry after the release. For the first 15 minutes, cells are still mainly in G1 and only start replicating ~20 minutes after the release.

      Section II-B-2(b) could be clearer. I don't understand what the conclusion the authors take from the metaphase arrest maps is. I'm not sure why they discuss again the Cdc45-depleted cells here, since this was already covered in the previous section.

      Taken together, the G1, Cdc20 (metaphase-arrested cells), and Cdc45-depleted (early S cells but not replicated) conditions suggest that fountains reflect ongoing replication. Namely, G1-arrest shows that fountains require S-phase entry; Cdc45-depletion shows that fountains require origin firing and is not due to another S-phase event; and metaphase-arrested cells show that fountains are not permanent structures established by replication, but a transient replication-dependent structure.

      This demonstrates that the emerging signal is not trivially dependent on (1) the presence of the second sister chromatids; or on (2) potential overlaps between origin positions and barriers (CARs) to loop extrusion (see also comment 12 of Reviewer 2). A sentence at the end of II-a was added to clarify the different information gained with the two strains.

      We discuss again the cdc20 and cdc45 mutants in II-b to highlight how the results in II-a do not exclude potential interplay between cohesin-mediated loop-extrusion in presence forks progression. These considerations motivated our experiment in Scc1-depleted cells during early S-phase.

      At the start of p8 (II-B-3) there is a discussion of the mapping to times to the early-S stage experiments. This could have more explanation. I don't follow what the issue is, or the process which has been used to do the mapping. From Fig 2B, it seems that the simulation time is already mapped well to real time.

      As mentioned above in comment 7, we cannot clearly define a “t=0” when replication starts in vivo as the release from the G1-arrest is not immediate and perfectly synchronous. On the other hand, the times indicated within the text are those following the onset of polymer self-duplication in our simulations. Note that the mean replication time (MRT) shown in Fig.2B does not represent an absolute time, but rather an average relative timing along S-phase (signal rescaled between 0 and 1).

      For all these considerations, we think that the most reliable strategy to compare fountains in vivo and in silico is to look at the replicon size via the enrichment in raw contacts around early origins, as illustrated in Fig S7A. In practice, looking at the relative counts of contacts around early origins we have a proxy for the average replicon size that we can match by computing the same analysis on simulated signals (Fig S7A). As a result, we find that the best simulated time window is between 5 and 7.5 minutes, compatible with early-S phase and with an approximate duration of G1 after release of 15 minutes as observed in other studies (ref. [61]).

      Note that our conclusions are robust with respect to modulating this mapping method. In particular in Fig. S7, we thoroughly investigated how several confounding factors (such as time window used or partial synchronization) may impact the quantitative nature of our prediction without affecting the qualitative insights.

      We included a more precise reference to the Supplementary Materials, where the approach is described and clarified.

      In Fig 4A above each plot there is a cartoon showing the fork scenario. The left-hand cartoon is rendered properly, but the right-hand one has overlapping black boxes which I don't think should be there. These black boxes are present in many other figures (4B, 3B, 2E etc).

      This issue seems to appear using the default PDF viewer on Mac OS. We have corrected the problem and no more black boxes should appear in the main text and in the Supplementary Material.

      In II-C-2(b) it is mentioned that the number of forks within RFis is always assumed to be even. This discussion could be clearer. In particular, the authors state that under both fork scenarios, in the simulations they can detect odd numbers of forks within RFis - how can this happen in the case where sister forks are held together?

      We included a more accurate description in the main text about why Saner et al. (ref [20]) make these assumptions in their estimates. We highlight possible inconsistencies such as the presence of termination events which, in our formalism, break sister forks interactions and lead to single forks to be detected. We also clarify the latter point when describing Fig 5B and describe in more detail replication bubbles merging events in the Materials and Methods.

      Fig 6B and C, it would be useful if the same scale was used on both plots.

      We now use the same scale when plotting Fig 6B and C.

      Section II-D-1. There is a discussion on the presence of catenated chains; I did not understand how the replicated DNA becomes catenated, and what this actually means in this context. The way the process is described and the snapshots in Fig2C do not suggest that the chains are catenated. Some further discussion or a diagram would be useful here.

      We included a small paragraph to better explain how intertwining of sister chromatids occurs, and more clearly refer to a snapshot in supplementary figure S19D (Page 14). As correctly mentioned by the reviewer, replication bubbles by construction are always unknotted during their growth (see example in Fig. 2C). As we thoroughly characterize in our previous work (ref. [25]), when several replication bubbles merge, the random orientation of sister chromatids potentially lead to catenation points and intertwined structures. We show below a scheme from our previous work (ref [25]). While in this past work, we demonstrated that the center of mass of the two sister chromatids show subdiffusive behaviour due to the additional topological constraints of their intertwining, this new analysis in the present work suggests that possible effects may also be observed when tracking the MSD (mean square displacement at the locus level) in a more realistic scenario where we included correct replication timing, chromosome sizes and Rabl-organization.

      On p14 (section III) there is a section discussing possible mechanisms for sister fork interactions, and that result that Ctf4 might not play a role in this, as previously suggested. Are there any other candidate proteins which could be tested in the future?

      To the best of our knowledge, there is no other candidate protein of the replisome that has been directly associated to sister-fork pairing in previous studies (as Ctf4). However, components of the replisome such as Cdt1, that have the capacity to oligomerize/self-interact, could be good candidates. We now mention this possibility in the Discussion (Page 15).

      As on p14, second paragraph: there is a sentence "replication wave [51] cannot be easily visualised at the single cell level.", which seems to contradict the discussion on p9 "such a "wave" can also be observed at the level of an individual trajectory (Video S3,4) even if much more stochastic." I think more explanation is needed here.

      We rephrased the mentioned passages to clarify the differences in detecting such “replication wave” at the population vs single cell level. In video S3 and S4, we can still observe an enrichment of forks at the SPB and later in S-phase a shift towards the equatorial plane. However, the stochasticity of polymer dynamics and 1D replication strongly hinder the ability to clearly visualize such redistribution.

      In the methods section, p18, it is mentioned that the volume fraction is 3%. I assume this is before replication, and so after replication is complete this will increase to 6%. This should be stated more explicitly, with also a comment on the 5% volume fraction used in the time-scale mapping discussed on p17.

      Indeed, we choose to map the experimental MSD measured in ref [83] by simulating a homopolymer 5% volume fraction and in periodic boundary conditions for consistency to previous work in the group (ref. [102-106]) and our previous replication model (ref.[25]). Moreover, this intermediate density regime also lies in between the minimal (3%) and maximal (6%) densities present in our system. When redoing the time mapping with the G1 MSD plotted in Fig 6A and new Fig S19A, we obtain a very similar value of approx. 1MC=0.6ms. Note that the time mapping aims to obtain a rough estimation of real times as several factors, such as active processes, non-constant density, cell-cycle progression may all contribute to chromatin diffusion in vivo (see also comment 15 to Reviewer 2). In the context of our formalism, differences in time mapping do not affect the 1D replication dynamics as all the parameters to model the 1D process are rescaled by the same factor. Moreover, as we characterized in more depth in our previous work (ref [25]), a crucial aspect that defines self-replicating polymers is the relationship between fork progression and the polymer relaxation dynamics. In physiological conditions, we remain in the regime where forks progress almost quasi-statically to allow the bubbles to re-equilibrate. Therefore, small discrepancies in the time mapping will not modify this regime and our results should remain robust.

      On p20, processing of simulated HiC using cooltools is discussed. For readers unfamiliar with this software, a bit more detail should be given. Specifically, how does the normalisation account for having some segments which have been replicated and some which have not. Later on the same page (IV-C-2) two different strategies for comparing HiC maps are given; why are two different methods required, and what is the reasoning in each case?

      In the raw - unbalanced - data, we observe an artificial increase in contacts around origins in S-phase for both simulation and experiments. This is simply due to the presence of the second Sister chromatids and the fact that contacts between distinct DNA segments are mapped to a single bin.

      In the new Fig. S25, we illustrate this effect by computing aggregate plots around early origins using single-chromosome simulations. We demonstrate that the ICE normalization corrects for the variations in copy number due to replication and thus for such artificial increases in contacts during S-phase. We show that such a normalization is equivalent to explicitly divide each bin by the average copy-number of the corresponding segments.

      We have now included a sentence in the Materials and Methods to clarify this. Moreover, a detailed description of the other alternative strategies used to compare experiments and simulations were presented in response to comment 3 to Reviewer 2 and two new paragraphs were added in the Materials and Methods.

      The references section has an unusual formatting with journal names underlined.

      We updated the formatting.

      Reviewer #4 (Significance (Required)):

      D’Asaro et al focus on the problem of how genome structure is altered by the progression of replisomes through S-phase in the budding yeast S. cerevisiae. The authors employ computational polymer modeling of G1 chromosomes, then implement a hierarchical model of replication origin firing along these polymers to examine how the G1 chromosome structural state is perturbed by replisome progression. Their results indicate that replication origins create 'fountains' - Hi-C map features that other groups have demonstrated are likely to originate from symmetric extrusion by condensin / cohesin complexes originating at a fixed point. These 'fountains' appear to be cohesin-independent, as revealed by depletion Hi-C experiments. Finally, the authors provide evidence from their model of a 'replication wave' that emanates from the spindle pole body. This is an interesting manuscript that raises some exciting questions for the field to follow up on.

      Reviewer #4 (Evidence, reproducibility and clarity (Required)):

      In their manuscript, "Genome-wide modeling of DNA replication in space and time confirms the emergence of replication specific patterns in vivo in eukaryotes," authors Asaro et al perform computational modeling analyses to address an important open question in the chromatin field: how is DNA replication timing coupled to 3D genome architecture? Over the past ten years, the convergence of high-resolution replication timing (RT) analysis with high-resolution 3D genome mapping (e.g. 'Hi-C' technology) has resulted in the discovery that replication timing domains overlap considerably with 3D genomic domains such as topologically associating domains (TADs). How and why this happens both remain unknown, and advances in 3D genome mapping technology have provided even more data to model the problem of both 1) scheduling replication from distinct series of origins / initiation zones, and 2) modeling how 3D genome architecture is altered by the progression of replication forks, which inherently destroy chromatin structure before faithfully reforming G1 structures on daughter chromatids. As such, the problem being tackled by this computational manuscript is interesting.

      We thank Reviewer 4 for her/his positive evaluation of our work and her/his comments that help us to greatly improve the manuscript.

      Reviewer Comments / Significance

      In their manuscript, "Genome-wide modeling of DNA replication in space and time confirms the emergence of replication specific patterns in vivo in eukaryotes," authors D’Asaro et al perform computational modeling analyses to address an important open question in the chromatin field: how is DNA replication timing coupled to 3D genome architecture? Over the past ten years, the convergence of high-resolution replication timing (RT) analysis with high-resolution 3D genome mapping (e.g. 'Hi-C' technology) has resulted in the discovery that replication timing domains overlap considerably with 3D genomic domains such as topologically associating domains (TADs). How and why this happens both remain unknown, and advances in 3D genome mapping technology have provided even more data to model the problem of both 1) scheduling replication from distinct series of origins / initiation zones, and 2) modeling how 3D genome architecture is altered by the progression of replication forks, which inherently destroy chromatin structure before faithfully reforming G1 structures on daughter chromatids. As such, the problem being tackled by this computational manuscript is interesting.

      D’Asaro et al focus on the problem of how genome structure is altered by the progression of replisomes through S-phase in the budding yeast S. cerevisiae. The authors employ computational polymer modeling of G1 chromosomes, then implement a hierarchical model of replication origin firing along these polymers to examine how the G1 chromosome structural state is perturbed by replisome progression. Their results indicate that replication origins create 'fountains' - Hi-C map features that other groups have demonstrated are likely to originate from symmetric extrusion by condesin / cohesin complexes originating at a fixed point. These 'fountains' appear to be cohesin-independent, as revealed by depletion Hi-C experiments. Finally, the authors provide evidence from their model of a 'replication wave' that emanates from the spindle pole body. This is an interesting manuscript that raises some exciting questions for the field to follow up on.

      Major Comments

      There is a tremendous amount of work coupling RT domains to 3D genome architecture, especially deriving from the ENCODE and 4D Nucleome consortia. These studies are not adequately highlighted in the introduction and discussion of this manuscript, and this treatment of the literature would ideally be amended in any revised manuscript.

      We include new sentences in the introduction to discuss more in detail the correlation between 3D genome architecture and replication timing program, and advancement in this field in the last decades. We also included additional citations to reviews and publications (ref [8-16]). These references were also included at the end of the Discussion where we address the exciting perspective of employing our model in higher eukaryotes and potentially tackle the complex interplay between 3D nuclear compartmentalization and replication dynamics (see also response 1 to Reviewer 1).

      S. cerevisiae origins of replication differ from metazoan origins of replication in that they are sequence-defined and are known to fire in a largely deterministic pattern (see classic study PMID11588253). From the methods of the authors it is not clear that the known deterministic firing pattern is being used here, but instead a stochastic sampling method? Please clarify in the manuscript. Specifically, it would be good to understand how the Initiation Probability Landscape Signal correlates with what is already known about origin firing timing.

      In our model, the positions of origins are stochastically sampled proportionally to the IPLS which was inferred directly from experimental MRT (ref. [63]) and RFD (ref. [44]). This modeling approach allows reproducing with a very high accuracy the known replication timing data (correlation of 0.96) and Fork directionality data (correlation of 0.91) (see ref. [71]). Origins were defined as the peaks in the IPLS signal. In Fig S3, we extensively compare these origins and the known ARS positions from the Oridb database. For example, most of our early origins (96%) are located close to known, confirmed ARS. Moreover, even if our algorithm is stochastic for origin firing, we remark that each early origin will fire in 90 % of the simulations, coherent with the quasi-deterministic pattern of origin firing and experimental MRT and RFD data. We now have added such statistics of firing in the revised manuscript (Page 4).

      It seems possible that experimental sister chromatid Hi-C data (PMID32968250) and nanopore replicon data (PMID35240057) could be used to further ascertain the validity of some of the findings of this paper. Specifically, could the authors demonstrate evidence in sister chromatid Hi-C data that the replisome is in fact extruding sister chromatids? Moreover, are the interactions being measured specifically in cis (as opposed to trans sister contacts)? For the nanopore replicon data, how do replicon length, replication timing, and position along the replication 'wave' correlate?

      We thank the reviewer for the suggestions.

      Hopelessly there is currently no Sister-C data available during S-phase. In the seminal study (PMID32968250), cells were arrested in G2/M via nocodazole treatment. For a different unpublished work, we already analysed in detail the SisterC dataset and we did not observe clear fountain-like signature, consistent with our own G2/M Hi-C maps (cdc20) where fountains were absent. Note that, in the present work, in order to compare our predictions with standard HiC data, we included all contacts (cis and trans chromatids), mapping pairwise contacts from distinct replicated sequences/monomers to a single bin (see also response to comment 17 to Reviewer 3 and new Fig. S25).

      We now mention in the Discussion that Sister-C data during S-phase could help monitoring the role of replisomes on relative sister-chromatids organization (Page 15).

      Main results from the nanopore replicon data study include the observed high symmetry between sister forks and their linear progression, as the density of replicons appears to be uniform with respect to their length. Since these two specific constraints are already present in the framework of Arbona et al. (ref. [63]), our model is able to reproduce these features of DNA replication captured by the nanopore data.

      Moreover, as we model with very high accuracy replication timing data (see response to comment 2) and forks positioning, we can assume that our formalism well captures replicon positioning and lengths observed in vivo.

      As this study does not include any additional exploration or variation of the parameters inferred by Arbona et al. (ref. [63]), we consider a quantitative comparison with the nanopore replicon data to be beyond the scope of this paper.

      Minor Comments:

      The paper is in most places easy to follow. However, Section C bucked this trend and in general was quite difficult to follow. We would recommend that the authors try to revise this section to make clearer the actual physical parameters that govern a 'replication wave' and the formation of replication foci - how many forks, the extent to which the sisters are coordinated, etc for early vs. late replicating regions.

      We now state more clearly with a sentence in the main text the driving forces behind the formation of such a “replication wave”. We believe that the several additions and clarifications following the various comments, improved the clarity of the manuscri

    1. Reviewer #1 (Public review):

      Munday, Rosello, and colleagues compared predictions from a group of experts in epidemiology with predictions from two mathematical models on the question of how many Ebola cases would be reported in different geographical zones over the next month. Their study ran from November 2019 to March 2020 during the Ebola virus outbreak in Democratic Republic of the Congo. Their key result concerned predicted numbers of cases in a defined set of zones. They found that neither the ensemble of models nor the group of experts produced consistently better predictions. Similarly, neither model performed consistently better than the other, and no expert's predictions were consistently better than the others'. Experts were also able to specify other zones in which they expected to see cases in the next month. For this part of the analysis, experts consistently outperformed the models. In March, the final month of the analysis, the models' accuracy was lower than in other months, and consistently poorer than the experts' predictions.

      A strength of the analysis is use of consistent methodology to elicit predictions from experts during an outbreak that can be compared to observations, and that are comparable to predictions from the models. Results were elicited for a specified group of zones, and experts were also able to suggest other zones that were expected to have diagnosed cases. This likely replicates the type of advice being sought by policymakers during an outbreak.

      A potential weakness is that the authors included only two models in their ensemble. Ensembles of greater numbers of models might tend to produce better predictions. The authors do not address whether a greater number of models could outperform the experts.

      The elicitation was performed in four months near the end of the outbreak. The authors address some of the implications of this. A potential challenge for the transferability of this result is that the experts' understanding of local idiosyncrasies in transmission may have improved over the course of the outbreak. The model did not have this improvement over time. The comparison of models to experts may therefore not be applicable to early stages of an outbreak when expert opinions may be less well-tuned.

      This research has important implications for both researchers and policy-makers. Mathematical models produce clearly-described predictions that will later be compared to observed outcomes. When model predictions differ greatly from observations, this harms trust in the models, but alternative forms of prediction are seldom so clearly articulated or accurately assessed. If models are discredited without proper assessment of alternatives then we risk losing a valuable source of information that can help guide public health responses. From an academic perspective, this research can help to guide methods for combining expert opinion with model outputs, such as considering how experts can inform models' prior distributions and how model outputs can inform experts' opinions.

      Comments on revisions:

      I am grateful to the authors for their responses to my previous comments. I think their updates have made the paper much clearer. I do not think the updates change the opinions already given in the public review so I have not modified it.

    2. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Munday, Rosello, and colleagues compared predictions from a group of experts in epidemiology with predictions from two mathematical models on the question of how many Ebola cases would be reported in different geographical zones over the next month. Their study ran from November 2019 to March 2020 during the Ebola virus outbreak in the Democratic Republic of the Congo. Their key result concerned predicted numbers of cases in a defined set of zones. They found that neither the ensemble of models nor the group of experts produced consistently better predictions. Similarly, neither model performed consistently better than the other, and no expert's predictions were consistently better than the others. Experts were also able to specify other zones in which they expected to see cases in the next month. For this part of the analysis, experts consistently outperformed the models. In March, the final month of the analysis, the models' accuracy was lower than in other months and consistently poorer than the experts' predictions. 

      A strength of the analysis is the use of consistent methodology to elicit predictions from experts during an outbreak that can be compared to observations, and that are comparable to predictions from the models. Results were elicited for a specified group of zones, and experts were also able to suggest other zones that were expected to have diagnosed cases. This likely replicates the type of advice being sought by policymakers during an outbreak. 

      A potential weakness is that the authors included only two models in their ensemble. Ensembles of greater numbers of models might tend to produce better predictions. The authors do not address whether a greater number of models could outperform the experts. 

      The elicitation was performed in four months near the end of the outbreak. The authors address some of the implications of this. A potential challenge to the transferability of this result is that the experts' understanding of local idiosyncrasies in transmission may have improved over the course of the outbreak. The model did not have this improvement over time. The comparison of models to experts may therefore not be applicable to the early stages of an outbreak when expert opinions may be less welltuned. 

      This research has important implications for both researchers and policy-makers. Mathematical models produce clearly-described predictions that will later be compared to observed outcomes. When model predictions differ greatly from observations, this harms trust in the models, but alternative forms of prediction are seldom so clearly articulated or accurately assessed. If models are discredited without proper assessment of alternatives then we risk losing a valuable source of information that can help guide public health responses. From an academic perspective, this research can help to guide methods for combining expert opinion with model outputs, such as considering how experts can inform models' prior distributions and how model outputs can inform experts' opinions. 

      Reviewer #2 (Public review):

      Summary: 

      The manuscript by Munday et al. presents real-time predictions of geographic spread during an Ebola epidemic in north-eastern DRC. Predictions were elicited from individual experts engaged in outbreak response and from two mathematical models. The authors found comparable performance between experts and models overall, although the models outperformed experts in a few dimensions. 

      Strengths: 

      Both individual experts and mathematical models are commonly used to support outbreak response but rarely used together. The manuscript presents an in-depth analysis of the accuracy and decision-relevance of the information provided by each source individually and in combination. 

      Weaknesses: 

      A few minor methodological details are currently missing.

      We thank the reviewers for taking the time to consider our paper and for their positive reflections and suggestions for our study. We recognise and endorse their characterisation of the study in the public reviews and are greatful for their interest and support for this work. 

      Reviewer #1 (Recommendations For The Authors): 

      I initially found Table 1 difficult to interpret. In the final two columns, the rows relate to each other but in the other columns, rows within months don't relate to each other. Could this be made clearer? 

      Thank you for your helpful suggestion. We agree that this is a little confusing and have now added vertical dividers to the table to indicate which parts of the table relate to each other.

      In Figure 1A, the colours are the same as in the colour-bar for Figure 1B but don't have the same meaning. Could different colours be used or could Figure 1A have its own colour-bar to aid clarity? 

      Thank you for your query. The colours are not the same pallette, but we appreciate that they look very similar. To help the reader we have changed the colour palette of panel A and added a legend to the left.  

      In Figure 3, can labels for each expert be aligned horizontally, rather than moving above and below the timeline each month? 

      Thank you for your perspective on this. We made the concious dicision to desplay the experts in this way as it allows the timeline to be presented in a shorter horizontal space. We appreciate that others may prefer a different design, but we are happy with this one. 

      On lines 292 and 293, the authors state that experts were less confident that case numbers would cross higher thresholds. It seems that this would be inevitable given the number of cases is cumulative. Could this be clarified, please? 

      Thank you for raising this point. We agree that this wording is confusing. We have now reworked the entire section in response to another reviewer. The equivalent section now reads: 

      Experts correctly identified Mabalako as the highest-risk HZ in December. They attributed an average 82% probability of exceeding 2 cases; Mabalako reported 38 cases that month, exceeding all thresholds, although the probability assigned to exceeding the higher thresholds was similar to that of Beni (3 cases)

      Reviewer #2 (Recommendations For The Authors): 

      (1) Some methodological details seem to be missing. Most importantly, the results present multiple ensembles (experts, models, and both), but I can't seem to find anywhere in the Methods that details how these ensembles are calculated. Also, I think it would be useful to define the variables in each equation. It would have been easier to connect the equations to the description if the variables were cited explicitly in the text. 

      Thank you for pointing out these omissions. We have included the following paragraph to detail how ensemble forecasts were calculated. 

      “Enslemble forecasts

      Ensemble forecasts were calculated as an average of the probabilities attributed by the members of the ensemble. For the expert ensemble the arithmetic mean was calculated across all experts with equal weighting. Similarly the model ensemble used the unweighted mean of the model forecasts. For the mixed (model and expert) ensemble, the mean was weighted such that the combined weight of the experts forecasts and the combined weight of the models forecasts were equal.”

      (2) Overall, I think the results provide a strong analysis of model vs. expert performance. However, some sections were highly detailed (e.g., the text usually discusses results for every month and all health zones), which clouded my ability to see the salient points. For example, I found it difficult to follow all the details about expert/model predictions vs. observations in the "Expert panel and health zones..." subsection; instead, the graphical illustration of predictions vs. observations in Figure 4 was much easier to interpret. Perhaps some of these details could be trimmed or moved to the supplementary material. 

      Thank you for your honest feedback on this point. We have shortened this section to highlight the key points that we feel are the most important. We have also simplified the text where we discuss the health zones nominated by experts. 

      (3) Figure 5C is a nice visualization of the fallibility of relying on a single individual expert (or model). I wonder if it would be useful to summarize these results into the probability that a randomly selected expert outperforms a single model. Is it the case that a single expert is more unreliable than a single model? The discussion emphasizes the importance of ensembles and compares a single model to an ensemble of experts, but eliciting predictions from multiple experts may not always be possible. 

      Thank you for raising this. We agree that this is an important point that eliciting expert opinions is not a trivial task and should not be taken for granted. We agree with the principle of your suggestion that it would be useful to understand how the models compare to indevidual experts. We don’t however believe that an additional analysis would add sufficiently more information than already shown in Figure 5, which already displays the full distribution of indevidual experts for each month and threshold. If you would like to try this analysis yourself, the relevant data (the indevidual score for each combination of expert, threshold, heal zone and month) is included in the github repo (https://github.com/epiforecasts/Ebola-Expert-Elicitation/blob/main/outputs/indevidual_results_with_scores.csv).

      Minor comments: 

      (1) Figure 2: the color scales in each panel are meant to represent different places, correct? The figure might be easier to interpret if the colors used were different.  

      Thank you for bringing this to our attention. We have now changed the palette of panel A to differ from panel B.  

      (2) Equation 7: is o(c>c_thresh) meant to be the indicator function (i.e. 1 if c>c_thresh) and 0 otherwise)? 

      Thanks for raising this. The function o is the same as in the previous equation – an observation count function. We appreciate that this is not immediately clear so have added a sentence to explain the notation after the equation.

      (3) Table 1: a brief description of the column headers would be useful.  

      Thank you for the suggestion. We have now extended the table caption to include more description of the columns. 

      “Table 1: Experts and health zones included in each round of the survey. The left part of the table details the experts interviewed (highlighted in green) the health zones included in the main survey in each month. In addition, the right part of the table details the health zones nominated by experts and the number of experts that nominated each one.”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      This study investigates how ant group demographics influence nest structures and group behaviors of Camponotus fellah ants, a ground-dwelling carpenter ant species (found locally in Israel) that build subterranean nest structures. Using a quasi-2D cell filled with artificial sand, the authors perform two complementary sets of experiments to try to link group behavior and nest structure: first, the authors place a mated queen and several pupae into their cell and observe the structures that emerge both before and after the pupae eclose (i.e., "colony maturation" experiments); second, the authors create small groups (of 5,10, or 15 ants, each including a queen) within a narrow age range (i.e., "fixed demographic" experiments) to explore the dependence of age on construction. Some of the fixed demographic instantiations included a manually induced catastrophic collapse event; the authors then compared emergency repair behavior to natural nest creation. Finally, the authors introduce a modified logistic growth model to describe the time-dependent nest area. The modification introduces parameters that allow for age-dependent behavior, and the authors use their fixed demographic experiments to set these parameters, and then apply the model to interpret the behavior of the colony maturation experiments. The main results of this paper are that for natural nest construction, nest areas, and morphologies depend on the age demographics of ants in the experiments: younger ants create larger nests and angled tunnels, while older ants tend to dig less and build predominantly vertical tunnels; in contrast, emergency response seems to elicit digging in ants of all ages to repair the nest.

      We sincerely thank Reviewer #1 for the time and effort dedicated to our manuscript's detailed review and assessment. The revision suggestions were constructive, and we have provided a point-by-point response to address them.

      Reviewer #2 (Public review):

      I enjoyed this paper and the approach to examining an accepted wisdom of ants determining overall density by employing age polyethism that would reduce the computational complexity required to match nest size with population (although I have some questions about the requirement that growth is infinite in such a solution). Moreover, the realization that models of collective behaviour may be inappropriate in many systems in which agents (or individuals) differ in the behavioural rules they employ, according to age, location, or information state. This is especially important in a system like social insects, typically held as a classic example of individual-as-subservient to whole, and therefore most likely to employ universal rules of behaviour. The current paper demonstrates a potentially continuous age-related change in target behaviour (excavation), and suggests an elegant and minimal solution to the requirement for building according to need in ants, avoiding the invocation of potentially complex cognitive mechanisms, or information states that all individuals must have access to in order to have an adaptive excavation output.

      We sincerely thank reviewer #2 for the time and effort dedicated to our manuscript's detailed review and assessment. We have provided a point-by-point response to the reviewer's comments, which we have incorporated into the revised version of the manuscript.

      The only real reservation I have is in the question of how this relationship could hold in properly mature colonies in which there is (presumably) a balance between the birth and death of older workers. Would the prediction be that the young ants still dig, or would there be a cessation of digging by young ants because the area is already sufficient? Another way of asking this is to ask whether the innate amount of digging that young ants do is in any way affected by the overall spatial size of the colony. If it is, then we are back to a problem of perfect information - how do the young ants know how big the overall colony is? Perhaps using density as a proxy? Alternatively, if the young ants do not modify their digging, wouldn't the colony become continuously larger? As a non-expert in social insects, I may be misunderstanding and it may be already addressed in the citations used.

      We thank the reviewer for this interesting question. We find that the nest excavation is predominantly performed by the younger ants in the nest, and the nest area increase is followed by an increase in the population. However, if the young ants dig unrestricted, this could result in unnecessary nest growth as suggested by reviewer #2. Therefore, we believe that the innate digging behavior of ants could potentially be regulated by various cues such as;

      (a) Density-based: If the colony becomes less dense as its area expands, this could serve as a feedback signal for young ants to reduce or stop digging, as described in references (25, 29, 30).

      (b) Pheromone depositions: If the colony reaches a certain population density, pheromone signals could inhibit further digging by young ants, references (25, 29), or space usage as a proxy for the nest area. 

      Thus, rather than perfect information, decentralized control, and digging-based local cues probably regulate the level of age-dependent digging, without the ants needing to estimate the overall colony size or nest area.

      In any case, this is an excellent paper. The modelling approach is excellent and compelling, also allowing extrapolation to other group sizes and even other species. This to me is the main strength of the paper, as the answer to the question of whether it is younger or older ants that primarily excavate nests could have been answered by an individual tracking approach (albeit there are practical limitations to this, especially in the observation nest setup, as the authors point out). The analysis of the tunnel structure is also an important piece of the puzzle, and I really like the overall study.

      We thank the reviewer for the comments. We completely agree that individual tracking of ants within our experimental setup would have been the ideal approach, but we were limited by technical and practical limitations of the setup, as pointed out by the reviewer, such as; 

      (a) Continuous tracking of ants in our nests would have required a camera to be positioned at all times in front of the nest, which necessitates a light background. Since Camponotus fellah ants are subterranean, we aimed to allow them to perform nest excavation in conditions as close to their natural dark environment as possible. Additionally, implementing such a system in front of each nest would have reduced the sample sizes for our treatments.

      (b) The experimental duration of our colony maturation and fixed demographics experiments extended for up to six months (unprecedented durations in these kinds of measurements). These naturally limited our ability to conduct individual tracking while maintaining the identity of each ant based on the current design.

      These details are described in detail within the revised version of the manuscript.

      Reviewer #3 (Public review):

      Summary:

      In this study, Harikrishnan Rajendran, Roi Weinberger, Ehud Fonio, and Ofer Feinerman measured the digging behaviours of queens and workers for the first 6 months of colony development, as well as groups of young or old ants. They also provide a quantitative model describing the digging behaviours and allowing predictions. They found that young ants dig more slanted tunnels, while older ants dig more vertically (straight down). This finding is important, as it describes a new form of age polyethism (a division of labour based on age). Age polyethism is described as a "yes or no" mechanism, where individuals perform or not a task according to their age (usually young individuals perform in-nest tasks, and older ones foraging). Here, the way of performing the task is modified, not only the propensity to carry it or not. This data therefore adds in an interesting way to the field of collective behaviours and division of labour.

      The conclusions of the paper are well supported by the data. Measurements of the same individuals over time would have strengthened the claims.

      We sincerely thank reviewer #3 for the time and effort dedicated to our manuscript's detailed review and assessment. We completely agree with the reviewer’s comments on the measurements of the same individuals over time, however, we were limited by the technical and experimental limitations as described above and pointed out by reviewer #2.

      Strengths:

      I find that the measure of behaviour through development is of great value, as those studies are usually done at a specific time point with mature colonies. The description of a behaviour that is modified with age is a notable finding in the world of social insects. The sample sizes are adequate and all the information clearly provided either in the methods or supplementary.

      We thank reviewer #3  for this assessment.

      Weaknesses:

      I think the paper is failing to take into consideration or at least discuss the role of inter-individual variabilities. Tasks have been known to be undertaken by only a few hyper-active individuals for example. Comments on the choice to use averages and the potential roles of variations between individuals are in my opinion lacking. Throughout the paper wording should be modified to refer to the group and not the individuals, as it was the collective digging that was measured. Another issue I had was the use of "mature colony" for colonies with very few individuals and only 6 months of age. Comments on the low number of workers used compared to natural mature colonies would be welcome.

      Regarding the main comment 1

      We completely agree with the reviewer’s comment on considering inter-individual variability based on activity levels. We have discussed how individual morphological variability could influence digging behavior (references: 28, 31), and we will elaborate further on this aspect in future revisions.

      Regarding the main comment 2:

      The term ‘colony maturation’ in our study refers to the progressive development of colonies from a single queen, distinguishing it from experiments that begin with pre-established, demographically stable colonies. We provide a detailed explanation for this terminology in the revised version of the manuscript. We were practically limited by the continuation of the experiments for more than 6 months of age, predominantly due to the stability of nests, as they were made with a sand-soil mix. We also acknowledge that the colony sizes attained in our maturation experiments may be smaller than those of naturally matured colonies. This trend was observed generally in lab-reared colonies and could be attributed to differences in microclimatic conditions, foraging opportunities, space availability, and other factors. We have explicitly described these details in the revised version of the manuscript.

      Reviewer #1 (Recommendations for the authors):

      The experimental design is fantastic. The large quasi-2D should allow for the direct visualization of the movements of individuals and the creation of the nest, and the inclusion of non-workers (specifically, a mated queen and pupae) is new and important. However, I have some questions and concerns about the results, as outlined below. Also, I found the paper difficult to read, and the connections between the various experiments and the model were not always clear. 

      We thank the reviewer for the time and effort dedicated to reviewing our manuscript. We have modified the manuscript substantially to address the comments and readability. 

      The assumption that the digging rate is constant across ants may be a strong one. Previous work (see, for instance, Aguilar, et al, Science 2018) has demonstrated a very heterogeneous workload distribution among ants. I am not sure what implications that may have for the results here, but the authors should comment on this choice. Related to the point above, given a constant digging rate, the variation in digging is attributed to an age-dependent "desired target area". Can the authors comment on the implications of this, specifically in contrast to a variable digging rate? The distinction between digging rate differences and target area differences seems to be important for the authors. However, the way this is presented, it is difficult to fully understand or appreciate this importance and its implications. What is the consequence of this difference, and why is this important?

      We apologize to the reviewer for the confusion.

      Our model does not assume that the digging rate (da/dt, Equation 1) remains constant throughout the experiment. Instead, we only treat the basal digging rate (r) as a constant.

      The variable digging rate (da/dt, Equation 1) is derived by multiplying the basal rate constant (r) by the term (1 - a/a<sub>age</sub>), which accounts for deviations from the age-dependent target area that the ants aim to achieve. This makes the actual digging rate dynamic, as it responds to changes in excavated area (e.g., expansion or rapid collapse)

      For example, according to our model (Equation 1), two ants with the same basal digging rate (r) may exhibit markedly different actual digging rates at a given time if they differ in age. This occurs because the variable digging rate (da/dt) depends not only on ‘r’ but also on the age-dependent term (1 - a/a<sub>age</sub>). Also, we emphasize that the use of a basal digging rate constant aligns with prior studies (refs. 24, 29, 30).

      In our work, we demonstrate that after a collapse event, ants of all ages dig at rates comparable to those observed in the initial (pre-collapse) phase of the experiment. This occurs because the ants are far from their age-dependent target area, effectively resetting their digging behavior. By comparing maximum digging rates pre- and post-collapse, we provide strong empirical evidence that this rate is age-independent (SI Fig. 6A, 6B), supporting the conclusion that the basal digging rate constant (r) is a fundamental property of the ants' behavior, unaffected by age.

      We agree with the reviewer that individual tracking of ants within our experimental setup would have been the ideal approach. Then, we could have taken the inter-individual variability of the digging activity into account. However, we were limited to doing so by the technical and practical limitations of the setup, such as; 

      (a) Continuous tracking of ants in our nests would have required a camera to be positioned at all times in front of the nest, which necessitates a light background. Since Camponotus fellah ants are subterranean, we aimed to allow them to perform nest excavation in conditions as close to their natural dark environment as possible. Additionally, implementing such a system in front of each nest would have reduced the sample sizes for our treatments.

      (b) The experimental duration of our colony maturation experiments extended for up to six months (unprecedented durations in these kinds of measurements). These naturally limited our ability to conduct individual tracking while maintaining the identity of each ant based on the current design.

      In light of these points, the following lines are added to the discussion (line numbers: 283-295), signifying the above points:

      “Our age-dependent model demonstrates that the digging behavior in Camponotus fellah is governed by a basal digging rate constant (r) modulated by the age-dependent feedback (1 − a/aage). Crucially, we show that after a collapse, the maximum digging rates return to their pre-collapse levels, suggesting that this basal rate ’r’ represents an age-independent ceiling on how fast ants can dig, regardless of age or context (SI Fig. 6 A, B). Previous studies have demonstrated both homogeneous and heterogeneous workload distribution, with varying digging rates among ants (24, 29, 30, 35). Studies showing heterogeneous workload distribution relied on continuous individual tracking of ants to quantify digging rates (35). However, this approach was not feasible in our current design due to the experimental durations of both our colony maturation and fixed demographics experiments. Additionally, sample size requirements naturally limited our ability to conduct continuous individual tracking during nest construction in our study. Thus, based on empirical measurements from our fixed-demographics experiments and supported by the age-independent post-collapse digging rates, we adopted a constant basal digging rate for simulating our age-dependent model—an assumption aligned with both prior literature and the collective dynamics observed in our system (24,29,30)”.

      Model: as presented, the model seems to lack independent validation. The model seems to have built-in that there is an age-dependent target area, and this is what is recovered from the model. I am failing to see what is learned from the model that the experiments do not already show. Also, the model has no ant interactions, though ants are eusocial and group size is known to have a large effect on behavior (this is acknowledged by the authors at the beginning of the discussion). Can the authors comment on this?My recommendation would be to remove the model from this paper or improve the text to address the above comments.

      We did not draw the conclusion of the age-dependent target area from our model. We used the fixed demographics experiments to quantify the age-dependent area target as a function of the age of individuals. We then used this age-dependent area target in our model to quantify the excavation dynamics of the colony maturation experiments, where ants span a variety of ages, as the nest population changes over time, resulting in natural variation in the ages of individuals within the nest.  These results could not have been obtained by performing any of the individual experiments, whether colony maturation or the fixed demographics, young or old, on their own. The need for different age demographics was crucial to quantify the age-dependent effects in nest excavation, which were lacking in previous studies. 

      First, the age-dependent model provides a very good estimate for the natural growth of the nest.  More importantly, after fixing an age threshold of 56 days (mean + standard deviation of the young ant age), the model provides an estimate of which ants are doing the majority of the digging during natural nest expansion. This teaches us that during natural expansion, the older ants are far from their density target and therefore do not engage in any substantial digging, which is shown in Figure 4. C. 

      On the other hand, the younger ants are close to their area targets and induced to dig. Indeed, the target area fitted for the age-independent model closely approximates the empirically measured age-dependent target when extrapolated to very young ants. This provides further support for the idea that, in the colony maturation experiments, the youngest ants are responsible for most of the digging.

      Our model is a simple analytical model, inspired by earlier models that used a fixed area target (such as density models) for nest construction. However, because we knew the precise age of workers in our experiments, we were able to obtain age-dependent area targets, thereby challenging the use of a constant area target (as employed in prior studies) in light of our findings from the fixed demographics of young and old colonies.

      Empirically Quantifiable Parameters: We wanted our model to have empirically quantifiable parameters. Since we did not continuously record the experiment, we could not quantify agent-agent interactions, pheromonal depositions, or similar factors.

      Minimal Model Design: We aimed to keep the model as minimal as possible, which is why we did not include complex interactions such as those found in continuous tracking experiments.

      However, the model does set up some interesting hypotheses that could easily be tested with the experimental setup (e.g., marking the ants / tracking individual activity levels). For instance, it is hypothesized that older ants dig less often, but when they do dig, they do so at the same rate. Given the 2D setup, the authors could track individual ants and test this hypothesis. Also, if the desired target area does decrease with age, the authors could verify this hypothesis by placing older ants into arenas with different-sized pre-formed nests to observe how structure is changed to achieve the desired area/ant.

      We thank the reviewer for this comment.

      We believe that the confusion with the usage of a constant basal digging rate is resolved now. To briefly reiterate, ants dig at variable rates that can be decomposed to a (constant on short time scales but age-dependent) basal rate times the (variable) distance from the density target. The suggested experiments are beyond the scope of our current study, and further studies could utilize the suggested experimental design with better time-resolved imaging for individual ant tracking that could verify the predictions from our model. 

      Specific comments:

      Title:

      The title suggests a broad result, yet the study focuses on one ant species. Please modify the title to more accurately reflect the scope of the work.

      We thank the reviewer for the comment.

      The title is modified as “Colony demographics shape nest construction in Camponotus fellah ants.”

      Introduction:

      Important information and context are missing about this ant species. For instance, please add the following about this species in the introduction:

      What is their natural habitat and substrate? How does the artificial soil compare?

      What is their (rough) colony size? [later, discuss experiment group size choice and potential insights/limitations of results when applied to the natural system].

      The details have been added to the introduction (line numbers : 49-55) and the materials and methods section (Study species).

      “Camponotus fellah ants are native to the Near East and North Africa, particularly found in countries like Israel, Egypt, and surrounding arid and semi-arid regions, where they prefer to nest in moist, decaying wood, including tree trunks, branches, or stumps (49,50). The species lives in monogynous colonies with tens to thousands of individuals. Nests are commonly found in a sand-loamy mix, which is a combination of sand, soil, clay, or gravel, providing structural stability and moisture retention (51). They are typically found under rocks, in the crevices of dried vegetation, or dry, sandy soils, sometimes in areas with loose gravel, with a colony size ranging from tens to thousands of workers”.

      What is the natural life expectancy of a worker? A queen? [later, discuss fixed demographic age choices in this context and/or why were age ranges chosen for experiments?].

      The lifespan of ants, including both queens and workers, varies significantly based on caste, species, and environmental conditions.

      (1) Queen Longevity: From the literature, Camponotus fellah queens can live up to 20 years, with one documented case reaching 26 years (50). 

      (2) Worker Longevity: In contrast to queens, the lifespan of workers is much shorter. Lab studies on Camponotus fellah (82) and other Camponotus species (83) suggest that workers can live for several months depending on environmental conditions, colony health, and caste-specific roles (e.g., minor vs. major workers)

      (3) Laboratory vs. Natural Conditions: Worker longevity is highly variable between laboratory and natural conditions

      Therefore, in the context of the old worker lifespan in our experiments, ~200 days (roughly 6–7 months), we strongly believe that the worker lifespan used in our experiments represents a substantial portion of a worker's expected life. While exact figures for C. fellah workers are unavailable, inferences from related species suggest that workers nearing 200 days are approaching the latter stages of their lifespan, making them meaningfully "old". 

      The details are added to the main text (line numbers: 124-127) and discussion (line numbers: 278-282).

      Why was this species chosen? Convenience, or is there something special about this species that the readers should know? Specifically, is there something that might make the results more general or of broader interest?

      Camponotus fellah was chosen for this study because it is native to Israel, making it convenient to collect and maintain in the lab. Additionally, its nuptial flights occur close to the study location, ensuring a steady supply of colonies. We were able to provide them with a nesting substrate similar to what they naturally use, as their nests are typically found in a sand-loamy mix, similar to the sand-soil mix in our artificial nests. This was possible because we had the opportunity to observe their habitat and nesting behavior in the wild, allowing us to gather preliminary information on their natural nesting conditions.

      Results:

      Line 60: "several brood items" - how many exactly? Was this consistent across experiments? Do mated queens ever produce more pupae during the experiments?

      Yes, the number of brood items (5) was added consistently across the experiments. Additionally, the mated queen did produce pupae during the course of the experiments, which was evident from the noticeable increase in the number of workers in the nest. This was significantly higher than the number of brood items present at the start of the study.

      The above points are added to the section (line numbers : 68-69).

      Figure 1: Panel A - The food ports are never mentioned in the text. Are the ants fed during the experiments? If so, what? With what frequency? Is the water column replenished/maintained? If so, how and how often? panel C - how long did this experiment last?

      We thank the reviewer for pointing this out. We have now updated the nest maintenance section in the Materials and Methods (line numbers : 349-354) part to include all the necessary details and clarifications.

      “We provided food to the ants ad libitum through three separate tubes containing water, 20 % sucrose water, and protein food. The protein mixture included egg powder, tuna, prawns, honey, agar, and vitamins. Each of the three tubes was filled with 5 ml of their respective contents and sealed with a cotton stopper to prevent overflow. The tubes were positioned at a slight angle and connected using a custom-made plexiglass adapter to facilitate the flow of liquids. These tubes were replenished once depleted, and regularly replaced once the nest maintenance was carried out bi-weekly.”

      Line 76: "...excavation was commenced by the founding queen". How were the queen and pupae introduced into the system?

      We initiated colony maturation experiments by introducing a single mated queen and several brood items (pupae) at random positions on the soil layer of the nest (line numbers : 68-69)

      Line 87: Please provide bounds for 11cm2/ant value. Is there any biological or physical justification for this number?

      We thank the reviewer for the suggestion. We have now provided the bounds as requested (line numbers : 97-101). 

      We were unable to pinpoint a specific biological justification based solely on this treatment. However, on extrapolating the age-dependent area fit we derived from the fixed demographics experiment, we found that at the age of 1 day, an ant has a target area of approximately 11.17 cm², which is the largest age-dependent area target possible within our experimental setup.

      From the colony maturation experiment, we obtained the value of  11.6 (±1.15) cm² as the area per ant. The consistency between the area per ant obtained from two completely different treatments across different colonies yielded similar results. We propose that under standardized conditions, a 1-day-old ant has a theoretical maximum target area of 11.17 cm²—the highest value observed in our experimental framework.

      Lines 98-99: "one straightforward possibility would be that newborn ants are the ones that dig". This statement contradicts the results presented in Figures 1 and S1 - the population increase seems to occur at least a few days before increased excavation in nearly all cases.

      We apologize for any confusion caused by our initial phrasing. To clarify, we proposed that a lag likely exists between population growth and nest area expansion. This lag could arise from two sequential processes: (1) newborn ants require time to mature and become active (first delay), and (2) digging to expand the nest takes additional time (second delay; estimated at ~10 days from the cross-correlation analysis). Thus, our results suggest that it is not the population that lags behind the area, but rather the area that lags behind the population, as demonstrated in Figures 2D and SI. Figure. S1.

      The sentence “one straightforward possibility would be that newborn ants are the ones that dig” is modified as below (line numbers : 112-119) to prevent further confusion.

      “One possible explanation is that, although all ants are capable of digging, it is primarily the newly emerged ants who perform this task. In this case, nest expansion would lag behind colony growth due to two delays: first, the time needed for young ants to mature enough to begin digging, and second, the physical time required to excavate additional space (e.g., around 10 days). This mechanism could eliminate the need for ants to assess overall colony density, as each new group of active workers simply enlarges the nest as they become ready. An alternative possibility is that all ants, regardless of age, respond to increased density by initiating excavation. In that scenario, nest expansion would follow more immediately after the emergence of new individuals, making delays less prominent (24, 29, 30)”.

      Line 105: How do group sizes compare to natural colony size? Line 106: How do "young" and "old" classifications compare to natural life expectancy?

      We have already addressed this question in an earlier comment. The details are added to the main text (line numbers: 124-127) and discussion (line numbers: 278-282).

      Line 118-119: How are nests artificially collapsed?

      We have added a new section in the Materials and Methods section that describes the nest collapsing procedure (Nest artificial collapse - line numbers : 386-399).

      Figure 2 Panel A: The white dotted line is nearly impossible to see. Please use a more visible color.

      We thank the reviewer for the comment.

      We changed the solid circles to violet and the dotted line color to continuous white.

      Figure 3: The use of circle markers as post-collapse recovery in young and old as well as old pre-collapse is confusing. Use different symbols for old pre-collapse vs young and old post-collapse.

      We thank the reviewer for pointing out the confusion. We have revised the figure markers as suggested and modified the main text accordingly.

      • Young; pre-collapse : star

      • Young; post-collapse : diamond

      • Old; pre-collapse : circle

      • Old; post-collapse: triangle.

      Figure 3 Panel C: Indicate that fixed demographic values here are pre-collapse. Also, as presented, it appears that there is a large group-size dependence that is not commented on. Previous results (Line 87 and Figure 2C) suggest a constant excavation area per ant of 11cm2/ant. Figure 3, panel C appears to suggest a group-size dependence. If these values are divided by group size, is excavated area per ant nearly constant across groups? How does the numerical value compare to the slope from Figure 2C?

      We thank the reviewer for their insightful comments.

      First, we would like to clarify that the area target of 11.1 (±1) cm²/ant, as described in Line 87, was obtained from the colony maturation experiments. In these experiments, we were unable to track the age of each individual ant, so the area target was calculated by normalizing the total excavated area by the number of ants.

      We normalized the excavated area by the group size for both young and old colonies as suggested, and found that the area per ant was not significantly different across the group sizes (see new SI Fig. 5A). This indicates that the excavated area per ant remains relatively constant within each demographic group. Moreover, this shows that the total excavated area is proportional to group size, in agreement with previous works (24, 29, and 30). 

      We have explicitly described the above information in the line numbers: 142-146

      Regarding the slope comparisons, the slope of Figure 2C (10.71), from the colony maturation experiments, is the largest, followed by the area per ant from the short-term young (8.79 ± 0.98) cm²/ant, and short-term old experiments (5.16 ± 0.44) cm²/ant.

      Lines 128-129: "...younger ants aim to approach a higher target area". Seems hard to know what they "aim" to do... rephrase to report what they are observed to do.

      We thank the reviewer for the comment. The sentence is rephrased as suggested (line numbers : 158-161).

      “In the previous sections, we showed that in fixed-demographics experiments, younger ants excavated a significantly larger nest area compared to older ants (Fig. 3. C).  This difference emerged despite similar temporal patterns in digging rates across age groups, with excavation activity peaking within the first 7 days before asymptotically decaying as nest expansion approached saturation (SI Fig. 8).”

      Lines 133-141: The model description is not clear. Specifically, what parameters are ant-dependent? How does A relate to a?

      We appreciate the reviewer's request for clarification. In our model:

      (1) Equation 1 describes the change in the excavated area due to the digging activity of a single ant. Here, the variable 'a' represents the area excavated by one ant. This formulation allows us to capture the individual digging behavior and its impact on the excavation process.

      (2) Equation 2 extends this concept to the total area excavated in the nest, denoted by 'A'. Specifically, 'A' is the sum of the areas excavated by all ants present in the nest. In other words, it aggregates the individual contributions of each ant, linking the microscopic digging behavior to the macroscopic excavation dynamics.

      Therefore, the relationship between 'a' and 'A' is as follows:

      ●     'a' = Area excavated by a single ant.

      ●     'A' = ∑ 'a' (Summed over all ants in the nest).

      We have explicitly mentioned this in the line numbers “ 161-179”, and describe the model assumptions and parameters in detail.

      Figure 4:

      Figure 4, Panel A: The equation quoted in the caption does not match the data in the figure. The equation has a positive slope and negative intercept, while the figure has a negative slope and a positive intercept. Please provide the correct equation and bounds on fit parameters.

      We thank the reviewer for spotting this typing mistake.

      The equation was already updated in the reviewed preprint published online. The correct equation and the fit bound are provided in the figure caption.

      “Target areas decrease linearly with the ant age (y = −0.032x + 11.22 , 95 % CI (Intercept : (-0.035,-0.027), Slope : (10.53,11.91)), R2 = 0.96 ).”

      Figure 4, Panel A: There seem to be three "fixed target area per ant values" in the paper: around 11cm2/ant (line 87), 11.6 cm2/ant (SI Figure 2), and linearly dependent value from fit to Figure 4A. The distinctions between these values and their significance are hard to keep track of. Can the authors add a discussion somewhere that helps the reader better understand? Is there a way to connect/rationalize/explain these different values in terms of demographics?

      We thank the reviewer for the suggestion.We have added a paragraph in the discussion (line numbers : 270-277) describing the area targets.

      “In our colony maturation experiments, we found that area per ant was highest when the workers were youngest, with values around 11.1–11.6 (±1–1.15). This aligns with observations from naturally growing nests, where newly eclosed ants dominate the population and nest volumes are relatively large. Supporting this, fixed-demographics experiments showed that the area excavated per ant declines linearly with worker age, indicating that the youngest ants contribute most to excavation. Notably, the target area we fit for the age-independent model (11.6 ± 1.15) closely matches the extrapolated value for very young workers (Fig. 4. A), reinforcing the idea that young ants are the primary excavators during early colony growth. In contrast, during events like collapses or displacement, when space is urgently needed, ants of all ages participate in excavation.”

      Figure 4, Panel A: What are various symbols and colors for data with error bars? If consistent with Figure 3, then this panel and subsequent model confound two factors: (1) the age dependence and (2) the behavioral differences pre- and post-collapse (structures are different pre-and post-collapse, according to SI Figure 6; line 120: "...colonies ceased digging when they recovered 93{plus minus}3% of the area lost by the manual collapse..."; lines 201-202: "We find significant quantitative and qualitative differences between nests constructed within this natural context and nests constructed in the context of an emergency") and behavior is different (according to SI Figure 7 and line 119: "...all ants dig after collapse...")). Therefore, without further supporting evidence, it does not seem that these data should be used to fit a single line that defines a model parameter a_age for each ant in equation 2.

      The symbols are the area per ant quantified from the fixed demographics of young, and old experiments. The symbols show the following;

      A.  Star - Young, pre-collapse

      B.  Diamond - Young, post-collapse 

      C.  Circle - Old, pre-collapse

      D.  Triangle - Old, post-collapse.

      The details are clearly described in the figure caption. 

      We apologize to the reviewer for the confusion. We argue that the data can be fit by a single line to quantify the parameter ‘a_age’ as follows. 

      A. All data presented in Figure 4A were obtained from the same fixed-demographics experiments (containing only young and old ants) under experimental collapse conditions, pre- and post-collapse. These results, therefore, exclusively reflect emergency nest-building behaviors during emergency scenarios and do not include any observations from natural colony maturation processes.

      B. Age-dependent excavation differences: As correctly noted by the reviewer, the observed difference in excavated area before versus after collapse reflects the natural aging of ants in our experimental colonies. While colonies recovered >90% of lost area post-collapse, the residual variation was not negligible—instead, it systematically correlated with colony age structure. By tracking colonies across this demographic transition, we obtained additional data points spanning a broader developmental spectrum. This extended range strengthened our ability to detect and quantify the linear relationship between worker age and excavation output.

      C.The quoted sentence (lines 201-202, submitted version) refers to comparisons across all three experimental cases: (1) fixed-demographics young ants, (2) fixed-demographics old ants, and (3) the natural scenario (mixed-age colonies). Importantly, these comparisons are based on pre-collapse steady-state excavation areas, ensuring a consistent baseline across treatments. We highlight quantitative and qualitative differences between these distinct experimental groups, not between pre- and post-collapse phases within the same treatment. The pre- and post-collapse data within fixed-demographics groups were analyzed separately to avoid conflating aging effects with emergency responses.

      To avoid confusion, the whole paragraph in the discussion (line numbers : 253-260) is rephrased.

      In lines 201-202; “We find significant quantitative and qualitative differences between nests constructed within this natural context and nests constructed in the context of an emergency”. 

      Here, by natural context, we mean the nests excavated in the colony maturation experiments. We believe that it could have been confusing, and the sentence is modified as answered for the previous question. 

      Figure 4, Panel B: This uses the model with a_age determined by from Figure 4A and the life table (as shown in the supplemental), whereas the supplemental Figure SI 8 uses the fixed blue line a_age value for the model, which comes from the colony maturation experiments. The age-independent model in the supplemental fits the data better, yet the authors claim the supplemental model cannot be applied to the data because of their experimentally determined age-dependent target area. Given the age-independent target area model fits better, additional evidence/justification is needed to support the choice of the model.

      We agree with the reviewer that the age-independent model fits the data well. However, we believe that the fixed area target cannot be used to explain the excavation dynamics for the following reasons.

      We make an important assumption in our model: that the ants rely on local cues and that individual ants can not distinguish between the fixed demographics and colony maturation experiments (line numbers : 161-166). Given this assumption, the ants cannot change their behavior between experiments, meaning the same model should fit all of our results. However, the fixed demographics experiments revealed a significant difference in the areas excavated by young vs. old cohorts, despite having the same group size. If the ants regulated the excavated area based on an age-independent constant density target model, then the excavated area in the fixed demographics of young and old colonies would have been similar. This discrepancy indicates that the target area per ant is not constant, as assumed in the age-independent density model (SI. Fig. 8). We emphasize that while the age-independent model provides a better fit for the excavated area in colony maturation experiments, the age-dependence of excavation is empirically supported by fixed-demographics experiments. Therefore, we implemented this age-dependence through a variable target area within the age-dependent model framework to explain excavation dynamics in the colony maturation experiments.

      These details are explicitly mentioned in the main text (line numbers : 187 - 198)

      Figure 4, Panel C: Is this plot entirely from the model, or are the data points measured from experiments? Please label this more clearly.

      We apologize to the reviewer for the confusion.

      The Figure 4C is based on the age-dependent digging model. We applied the model to population data from the long-term experiments (n = 22). By setting an age threshold of 56 days (since ants used in the short-term young experiment had an average age of 40 ± 16 days), we categorized the ants into young and old groups. We then quantified the area dug by the young ants, the queen, and the old ants in terms of the percentage of the total area excavated. We hypothesized that, because young ants have a lower digging threshold, they would perform the majority of the digging. We indeed confirm this in Figure 4C.

      This information is added to the main text and described in detail (line numbers: 200 - 208).

      Lines 162-165: "...Furthermore, we quantified the area dug by each ant in the normal colony growth experiment as estimated from the age-dependent model and found that all ants excavated more or less the same amount...". Figure 4D shows a distribution with significant values ranges from 1-16 cm2... how is this interpreted as "more or less the same amount" and what is the significance of this?

      We apologise to the reviewer for the confusion.

      We quantified the percentage contribution to the excavated area of each histogram bin (provided in the new SI table: 4), and found that the area excavated between 5 cm² and 13 cm² accounts for 73.76% of the total excavated area. This indicates that most ants dug within this range rather than exhibiting extreme variations. Additionally, the mean excavation amount is 7.84 cm², with a standard deviation of 3.44 cm², meaning that most values fall between 4.4 cm² and 11.28 cm², which aligns well with the 5–13 cm² range. Since the majority of the excavation is concentrated within this narrow interval, and the mean is well centered within it, this suggests that ants excavated more or less the same amount, rather than forming distinct groups with highly different excavation behaviors.

      We have modified the main text (line numbers: 209-216) to include these points.

      The biological significance of this finding is that since all ants in the colony maturation experiments are born inside the nest, we hypothesize that they should excavate similar amounts. To test this, we quantified the area contribution of each ant over the entire duration of the experiment using the age-dependent digging model as described above and found that they indeed excavated more or less the same amount. From our analysis of fixed demographics experiments, we showed that the youngest ants excavate the largest area. Since the majority of the youngest ants participated in the colony maturation experiments, this further supports our hypothesis.

      Figure 5.

      Figure 5, Panels A-C: Please provide a scale bar. 

      The scale bar is provided in the figure as suggested. The algorithm for the cutoffs for tunnel vs wide tunnels is described in detail in the section “Nest skeletonization, segmentation, and orientation.”

      Figure 5, Panel E: Why does the chamber error bar for 5 ants go to zero?

      In Figure 5, E, we plot the standard error, as described in the figure caption. In the experiments, the chamber area contributions were (0,0,39.94,0) respectively. The mean of the 4 numbers is 9.985, the standard deviation is 19.97, and the standard error is 9.985. So, the mean and the standard error are the same, so the lower error bar goes to zero, and the upper error bar goes to 19.97. This implies that in these experiments, the chamber area is often zero.

      Figure 5, Panel I: Why are there no chambers for young colonies in I when they are in the histogram in E?

      We apologize to the reviewer for the confusion. We initially missed adding the chamber orientation data of the young colonies to Panel I, but it has now been included.

      Line 212: "...densities of ants never become too high...". What is too high? Is there some connection to biological or physical constraints?

      Under normal growth conditions, nest volume is kept proportional to the number of ants, ensuring that the density remains within a specific range. This prevents overcrowding, which could otherwise lead to excessively high densities.

      Yes, we believe there is likely a connection to both biological and physical constraints. The proportional relationship between nest volume and the number of ants is likely driven by factors such as:

      (1) Biological Constraints:

      Ant Colony Size: Ants typically adjust their behavior and social structure to maintain an optimal population size relative to available resources and space.Overcrowding could lead to potentially a breakdown in colony function.

      Colony Health: High densities can lead to faster epidemic spread, leading to negative effects on reproduction, foraging efficiency, and overall colony health. By maintaining density within a specific range, the colony can thrive without these adverse effects.

      (2) Physical Constraints:

      Spatial Limitations: The physical space within the nest limits how many ants can occupy it before space becomes constrained. The nest’s structure and size must physically accommodate the ants, and the volume must be large enough to prevent overcrowding, and efficient resource distribution.

      Lines 272 and 302: How often were photos taken? These two statements seem to suggest different data collection rates.

      As stated in line 272, photos were taken every 1 to 3 days. During each photo session, four photos were taken, with each photo separated by 2 seconds, as mentioned in line 302. To avoid confusion, we rephrased the sentence (line numbers: 359-361).

      “We photographed the nest development every 1-3 days. During each photography session, four pictures of the nest were taken, with a 2-second interval between each.”

      Reviewer #2 (Recommendations for the authors):

      Some more minor points/questions/clarifications:

      This might be pedantic, but I don't think the nest serves as the skeleton of the superorganism, while it does change and grow, the analogy becomes weak beyond that point. The skeleton serves to protect the internal organs of the organism, facilitates movement and muscle attachment, and creates new blood cells. I would be more comfortable with a statement that the nest can grow or shrink according to need.

      We sincerely thank the reviewer for their time and effort in providing a detailed review and assessment of our manuscript. A point-by-point response to the comments is provided below.

      The analogy of treating a nest structure to the skeleton of a superorganism was based on the following points;

      (a) Protection: A nest protects the colony on a collective scale. This is analogous to protecting "organs" by a skeletal framework.

      (b) Organization and Division of Space: The skeletal structure organizes the body's internal layout, just as nest structures are organized into various spatial compartments for various colony functions, with specific regions designated for brood chambers, food storage, and waste disposal.

      Thus, we believe that the analogy can still be valid in a metaphorical way.

      Does this statement need justification with a citation, or is that information contained in the subsequent clause? "However, for more complex structures where ants congregate in specific chambers, workers are less likely to assess the overall nest density." The idea that workers do (or do not) assess overall density touches on many issues, including that of perfect information and adaptive responses, that it seems it needs to be well founded in previous work to be stated in such unequivocal terms.

      We thank the reviewer for this comment. The references for this argument are provided in the next sentence. We have now moved these references to the relevant sentence (reference number: 24, 29,30; line number : 30-31 ) 

      Can you give some more information on this statement? "Experiments were terminated either when the queen died or when she became irreversibly trapped after a structural collapse." Why was this collapse irreversible and therefore unlike treatment 2? Did the queen die in these instances? Was this event more likely than in natural colonies? And if so, was there something inherently different about your experiments that limit interpretation under natural conditions (e.g. the narrow nature of the observation setup? The consistency of the sand?)

      Our nest excavation experiments were terminated under two primary scenarios: (1) the queen died of natural causes, reflecting the baseline mortality expected when queens are brought into laboratory conditions, or (2) the nest experienced a structural collapse that left the queen irreversibly trapped. The second scenario is further elaborated below:

      Irreversible Collapses: These collapses were classified as irreversible because the queen could not be rescued alive. This occurred when the structural stability of the nest failed, burying the queen in a manner that prevented recovery. In some cases, the collapse resulted in the queen's immediate death, while in others, she was trapped beyond reach, and any rescue attempt risked further structural damage.

      Collapse and Experimental Context: These collapses were not uniquely associated with natural colonies or fixed-demographic experiments; rather, they occurred across various experimental setups.

      The sentence is modified as below to improve clarity (line numbers : 70-72 ).

      “In all instances where a collapse resulted in the queen's death or her being irreversibly trapped in the nest, the experiment was excluded from analysis starting from the point of the collapse, as such events did not reflect normal colony dynamics.”

      I want to make sure I understand the following statement: "Moreover, the area excavated by the young cohorts was similar to that excavated by naturally maturing colonies at the point in which they reached the same population size (Tukey's HSD; group size: 5; p = 0.61, group size: 10; p = 0.46, group size: 15; p = 0.20)." Do I have it right that this means a group of (e.g. 10) young ants excavates an area similar to that of a group of 10 naturally maturing ants at the same age as the young ants?

      Yes, the interpretation provided is correct. We apologize to the reviewer for the confusion. We have rephrased the sentence for better readability (line numbers : 146-148).

      “Furthermore, the area excavated by the young cohorts was comparable to that excavated by naturally maturing colonies when they reached the same population size (Tukey's HSD; group size: 5, p = 0.61; group size: 10, p = 0.46; group size: 15, p = 0.20)”

      How old do ants get? Is the 'old' demographic (~200 days) meaningfully old in the context of the overall worker lifespan? While the results certainly demonstrate there is an age effect, I would like to understand how rapid this is in terms of overall lifespan.

      The lifespan of ants, including both queens and workers, varies significantly based on caste, species, and environmental conditions.

      (1) Queen Longevity: From the literature, Camponotus fellah queens can live up to 20 years, with one documented case reaching 26 years. This remarkable longevity underscores the queen's central role in maintaining the colony.

      (2) Worker Longevity: In contrast to queens, the lifespan of workers is much shorter.

      However, specific data on worker longevity in Camponotus fellah colonies are lacking. Studies on other Camponotus species (50, 82) suggest that workers can live for several months depending on environmental conditions, colony health, and caste-specific roles (e.g., minor vs. major workers).

      (3) Laboratory vs. Natural Conditions: Worker longevity is highly variable between laboratory and natural conditions

      Therefore, in the context of the old worker lifespan in our experiments of, ~200 days (roughly 6–7 months) we strongly believe that the worker lifespan used in our experiments represents a substantial portion of a worker's expected life. While exact figures for C. fellah workers are unavailable, inferences from related species suggest that workers nearing 200 days are approaching the latter stages of their lifespan, making them meaningfully "old."

      These details are added to the main text (line numbers : 124 - 127) and to the discussion (line numbers : 278-282)

      Reviewer #3 (Recommendations for the authors):

      We sincerely thank the reviewer for their time and effort in providing a detailed review and assessment of our manuscript. A point-by-point response to the comments is provided below.

      L10: "fixed demographics": I find this term unclear, what does it mean, it should specify if the groups are with or without a queen.

      We thank the reviewer for the comment. The sentence is modified in the abstract, and definitions are later added in detail in the introduction (line numbers : 8-10) and the Materials and Methods section (Fixed demographics colonies). 

      “We experimentally compared nest excavation in colonies seeded from a single mated queen and allowed to grow for six months to excavation triggered by a catastrophic event in colonies with fixed demographics, where the age of each individual worker, including the queen, is known”.

      The details of the “fixed demographics” treatments were explained in the later portion of the text (line numbers: 58-61).

      L36: I think it is documented that younger individuals are the ones who involved in nest construction in many species.

      Previous studies on nest construction were predominantly performed on mature colonies of specific age demographics or rather mixed demographics, where age was not considered as a factor influencing nest construction. Some studies have speculated that young ants could be the most probable ones to dig, but this has not been experimentally verified to the best of our knowledge.

      L50: I do not think the colony should be called mature after only 6 months, given that colonies reach thousands of workers.

      The sentence is changed as suggested (line numbers : 56-57).

      “The "Colony-Maturation" experiment observed the development of colonies up to six months, starting from a single fertile queen and progressing to colonies with established worker populations.” 

      L60: Where was the queen introduced? It is specified in the Methods but a word here would be helpful.

      The detail is added as suggested (line numbers : 68-69).

      “We initiated colony maturation experiments by introducing a single mated queen and several brood items (n = 5, across all experiments) at random positions on the soil layer of the nest.”

      L106: Young vs Old workers 40 vs 171 days. Maybe cite a reference or provide a reason for the selection of those ages?

      Previous studies have shown that the Camponotus fellah queens can live up to 20 years, with one documented case reaching 26 years (50). To the best of our knowledge, specific data on worker longevity in Camponotus fellah colonies in natural conditions are lacking. Lab studies on Camponotus fellah (82) and other Camponotus species (50) suggest that workers can live for several months depending on environmental conditions, colony health, and caste-specific roles (e.g., minor vs. major workers). 

      We intentionally selected workers from two distinct age groups: younger ants (40 ± 16 days old) and older ants (171.56 ± 20 days old). These ages represent functionally different life stages - the younger group had completed about 25% of their expected lifespan at the start of the experiment, while the older group had lived through most of theirs (50, 82). This 4-fold age difference allowed us to compare excavation behaviors across fundamentally different phases of adult life.

      Our experiments lasted for 60-90 days, during which all participating workers continued to age. To ensure all ants remained alive throughout the experiments, and given the constraints of the experimental timeline, we selected young and old workers within the specified age range. 

      These details are added to the main text (line numbers :  124 -127), and the discussion (line numbers  : 278-282)

      L122-123: But usually ants can vary highly in their behaviours. Can the authors comment on their choice to consider an average, implying that all ants of the same age had the same digging rates?

      We thank the reviewer for the comment.

      In our experiments, we could not track each worker's activity over time. As described in the methods, we took snapshots of the nest structure over days and recorded the population size of the nest. Thus, we could not capture the activity of single ants in the nest as described in the response to major comments in the reviewed preprint.

      We agree that individual tracking of ants within our experimental setup would have been the ideal approach. Then, we could have taken the inter-individual variability of the digging activity into account. However, we were limited to doing so by the technical and practical limitations of the setup, such as; 

      (a) Continuous tracking of ants in our nests would have required a camera to be positioned at all times in front of the nest, which necessitates a light background. Since Camponotus fellah ants are subterranean, we aimed to allow them to perform nest excavation in conditions as close to their natural dark environment as possible. Additionally, implementing such a system in front of each nest would have reduced the sample sizes for our treatments.

      (b)The experimental duration of our colony maturation and fixed demographics experiments extended for up to six months (unprecedented durations in these kinds of measurements). These naturally limited our ability to conduct individual tracking while maintaining the identity of each ant based on the current design.

      To clarify this, we have added the following to the discussion (line numbers: 286-292).

      “Previous studies have demonstrated both homogeneous and heterogeneous workload distribution, with varying digging rates among ants (24,29,30,35). Studies showing heterogeneous workload distribution relied on continuous individual tracking of ants to quantify digging rates (35). However, this approach was not feasible in our current design due to the experimental durations of both our colony maturation and fixed demographics experiments. Additionally, sample size requirements naturally limited our ability to conduct continuous individual tracking during nest construction in our study.”

      L171: A line on how the nest structure was acquired and data extracted would be welcome here.

      The algorithm for the nest structure segmentation, data extraction, and analysis is added in detail to the SI section: Nest skeletonization, segmentation, and orientation. The line is modified (line numbers : 221-224) in the main text as suggested.

      “We compared nest architectures by segmenting raw nest images into chambers and tunnels (see SI Section: Nest Skeletonization, Segmentation, and Orientation). Chambers were identified as flat, horizontal structures, while tunnels were narrower and more vertical in orientation (see SI Fig. 9, SI Section: Nest Skeletonization, Segmentation, and Orientation)”.  

      Figure 3: Where does the data of the mean in panel C come from: is it the mean of the first 30 days, before the collapse? How is it comparable with the rest?

      We apologize to the reviewer for the confusion.

      In panel C, the mean values (solid stars and circles) for fixed-demography colonies (young/old groups) represent pre-collapse excavation areas. For colony maturation experiments (where no collapses were induced), we instead plot the mean saturated excavation area for each group size. This allows direct comparison of mean excavated areas across experimental conditions at equivalent colony sizes.

      To improve readability, the following sentences are added to the main text (line numbers : 139 - 146 ) 

      “We compared the saturated excavation areas (pre-collapse) from fixed-demographics experiments (young and old groups) with those from colony maturation experiments of the same colony sizes (Fig. 3C). We find that, for a given age cohort (young or old), the saturation areas increase linearly with the colony size (GLMM, F(35,37); p < 0.0001) (Fig. 3 C, SI. Fig 7 A). The observed proportional scaling between excavated area and group size aligns with previous studies, even though those studies did not explicitly account for age demographics (24, 29, 30). After normalizing the pre-collapse excavated area by group size for both young and old colonies, we found no significant difference in area per ant across group sizes (SI Fig. 5. A). This indicates that the excavated area per ant remains relatively constant within each demographic group”.

      L209-210: I would be more parsimonious in saying that the results presented prove that the target area decreases with age, as the individual behaviour of the ants was not monitored. Suggestion: rephrase to "the target of the group decreases with age".

      The sentence is rephrased as suggested (line numbers : 265-266).

      “Our results reveal that this target area of the group decreases linearly with age, such that young ants are more sensitive to shortages in space.”

      L246: Are C.fellah colonies really found with such few workers?

      Previous studies have speculated that mature Camponotus fellah colonies are a monogynous species typically founded by a single queen following nuptial flights (50,51,82), and can range from tens to thousands of workers. However, during the founding stage (as in our experiments), colonies naturally pass through smaller developmental sizes comparable to the matured colonies.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      This work addresses an important question in the field of Drosophila aggression and mating- prior social isolation is known to increase aggression in males by increased lunging, which is suppressed by group housing (GH). However, it is also known that single-housed (SH) males, despite their higher attempts to court females, are less successful. Here, Gao et al., developed a modified aggression assay, to address this issue by recording aggression in Drosophila males for 2 hours, over a virgin female which is immobilized by burying its head in the food. They found that while SH males frequently lunge in this assay, GH males switch to higher intensity but very low-frequency tussling. Constitutive neuronal silencing and activation experiments implicate cVA sensing Or67d neurons promoting high-frequency lunging, similar to earlier studies, whereas Or47b neurons promote low-frequency but higher intensity tussling. Using optogenetic activation they found that three pairs of pC1 neurons- pC1SS2 increase tussling. While P1a neurons, previously implicated in promoting aggression and courtship, did not increase tussling in optogenetic activation (in the dark), they could promote aggressive tussling in thermogenetic activation carried out in the presence of visible light. It was further suggested, using a further modified aggression assay that GH males use increased tussling and are able to maintain territorial control, providing them mating advantage over SI males and this may partially overcome the effect of aging in GH males.

      Strengths

      Using a series of clever neurogenetic and behavioral approaches, subsets of ORNs and pC1 neurons were implicated in promoting tussling behaviors. The authors devised a new paradigm to assay for territory control which appears better than earlier paradigms that used a food cup (Chen et al, 2002), as this new assay is relatively clutter-free, and can be eventually automated using computer vision approaches. The manuscript is generally well-written, and the claims made are largely supported by the data.

      Thank you for your precise summary of our study, and being very positive on the novelty and significance of the study.

      Weaknesses

      I have a few concerns regarding some of the evidence presented and claims made as well as a description of the methodology, which needs to be clarified and extended further.

      (1) Typical paradigms for assaying aggression in Drosophila males last for 20-30 minutes in the presence of nutritious food/yeast paste/females or all of these (Chen et al. 2002, Nilsen et al., 2004, Dierick et al. 2007, Dankert et al., 2009, Certel & Kravitz 2012). The paradigm described in Figure 1 A, while important and more amenable for video recording and computational analysis, seems a modification of the assay from Kravitz lab (Chen et al., 2002), which involved using a female over which males fight on a food cup. The modifications include a flat surface with a central food patch and a female with its head buried in the food, (fixed female) and much longer adaptation and recording times respectively (30 minutes, 2 hours), so in that sense, this is not a 'new' paradigm but a modification of an existing paradigm and its description as new should be appropriately toned down. It would also be important to cite these earlier studies appropriately while describing the assay.

      We now toned down the description of the paradigm and cited more related references.

      (2) Lunging is described as a 'low intensity' aggression (line 111 and associated text), however, it is considered a mid to high-intensity aggressive behavior, as compared to other lower-intensity behaviors such as wing flicks, chase, and fencing. Lunging therefore is lower in intensity 'relative' to higher intensity tussling but not in absolute terms and it should be mentioned clearly.

      We have modified the description as suggested.

      (3) It is often difficult to distinguish faithfully between boxing and tussling and therefore, these behaviors are often clubbed together as box, tussle by Nielsen et al., 2004 in their Markov chain analysis as well as a more detailed recent study of male aggression (Simon & Heberlein, 2020). Therefore, authors can either reconsider the description of behavior as 'box, tussle' or consider providing a video representation/computational classifier to distinguish between box and tussle behaviors.

      Indeed, we could not faithfully distinguish boxing and tussling. To address this concern, we now made textual changes in the result section we occasionally observed the high-intensity boxing and tussling behavior in male flies, which are difficult to distinguish and hereafter simply referred to as tussling.

      We also added this information in the Materials and Methods section Tussling is often mixed with boxing, in which both flies rear up and strike the opponent with forelegs. Since boxing is often transient and difficult to distinguish from tussling, we referred to the mixed boxing and tussling behavior simply as tussling.

      (4) Simon & Heberlein, 2020 showed that increased boxing & tussling precede the formation of a dominance hierarchy in males, and lunges are used subsequently to maintain this dominant status. This study should be cited and discussed appropriately while introducing the paradigm.

      We now cited this important study in both the Introduction and Discussion sections.

      (5) It would be helpful to provide more methodological details about the assay, for instance, a video can be helpful showing how the males are introduced in the assay chamber, are they simply dropped to the floor when the film is removed after 30 minutes (Figures 1-2)?

      We now provided more detailed description about behavioral assays and how we analyze them. For example All testers were loaded by cold anesthesia. After a 30-minute adaptation, the film was gently removed to allow the two males to fell into the behavioral chamber, and the aggressive behavior was recorded for 2 hours.

      (6) The strain of Canton-S (CS) flies used should be mentioned as different strains of CS can have varying levels of aggression, for instance, CS from Martin Heisenberg lab shows very high levels of aggressive lunges. Are the CS lines used in this study isogenized? Are various genetic lines outcrossed into this CS background? In the methods, it is not clear how the white gene levels were controlled for various aggression experiments as it is known to affect aggression (Hoyer et al. 2008).

      We used the wtcs flies from Baker lab in Janelia Research Campus, and are not sure where they are originated. We appreciate your concern on the use of wild-type strains as they may show different fighting levels, but this study mainly used wild-type strains to compare behavioral differences between SH and GH males. All flies tested in this study are in w+ background, based on w+ balancers flies but are not backcrossed. We have listed detailed genotypes of all tested flies in Table S1 in the revised manuscript.

      (7) How important it is to use a fixed female for the assay to induce tussling? Do these females remain active throughout the assay period of 2.5 hours? Is it possible to use decapitated virgin females for the assay? How will that affect male behaviors?

      We used a fixed female to restrict it in the center of food. These females remain active throughout the assay as their legs and abdomens can still move. Such design intends to combine the attractive effects from both female and food. One can also use decapitated females, but in this case, males can push the decapitated female into anywhere in the behavioral chamber. The logic to use fixed females has now been added in the Materials and Methods section of the revised manuscript.

      (8) Raster plots in Figure 2 suggest a complete lack of tussling in SH males in the first 60 minutes of the encounter, which is surprising given the longer duration of the assay as compared to earlier studies (Nielsen et al. 2004, Simon & Heberlein, 2020 and others), which are able to pick up tussling in a shorter duration of recording time. Also, the duration for tussling is much longer in this study as compared to shorter tussles shown by earlier studies. Is this due to differences in the paradigm used, strain of flies, or some other factor? While the bar plots in Figure 2D show some tussling in SH males, maybe an analysis of raster plots of various videos can be provided in the main text and included as a supplementary figure to address this.

      Indeed, tussling is very low in SH males in our paradigm, which may be due to different genetic backgrounds and behavioral assays. Since tussling behavior is a rare fighting form, it is not surprising to see variation between studies from different labs. Nevertheless, this study compared tussling behaviors in SH and GH males, and our finding that GH males show much more tussling behaviors is convincing. The longer duration of tussling in our paradigm may also be due to the modified behavioral paradigm, which also supports that tussling is a high-level fighting form.

      (9) Neuronal activation experiments suggesting the involvement of pC1SS2 neurons are quite interesting. Further, the role of P1a neurons was demonstrated to be involved in increasing tussling in thermogenetic activation in the presence of light (Figure 4, Supplement 1), which is quite important as the role of vision in optogenetic activation experiments, which required to be carried out in dark, is often not mentioned. However, in the discussion (lines 309-310) it is mentioned that PC1SS2 neurons are 'necessary and sufficient' for inducing tussling. Given that P1a neurons were shown to be involved in promoting tussling, this statement should be toned down.

      Thank you for this important comment. We now toned down the statement on pC1SS2 function.

      (10) Are Or47b neurons connected to pC1SS2 or P1a neurons?

      We conducted pathway analysis in the FlyWire electron microscopy database to investigate the connection between Or47b neurons and pC1 neurons. The results indicate that at least three levels of interneurons are required to establish a connection from Or47b neurons to pC1 neurons. Although the FlyWire database currently only contains neuronal data from female brains, they provide a reference for circuit connect in males.

      (11) The paradigm for territory control is quite interesting and subsequent mating advantage experiments are an important addition to the eventual outcome of the aggressive strategy deployed by the males as per their prior housing conditions. It would be important to comment on the 'fitness outcome' of these encounters. For instance, is there any fitness advantage of using tussling by GH males as compared to lunging by SH males? The authors may consider analyzing the number of eggs laid and eclosed progenies from these encounters to address this.

      Thank you for this suggestion. We agree with you and other reviewers that increased tussling behaviors correlate with better mating competition, but it is difficult for us to make a direct link between them. Thus, in the revised manuscript, we prefer to tone down this statement but not expanding on this part.

      Reviewer #2 (Public review):

      Summary

      Gao et al. investigated the change of aggression strategies by the social experience and its biological significance by using Drosophila. Two modes of inter-male aggression in Drosophila are known lunging, high-frequency but weak mode, and tussling, low-frequency but more vigorous mode. Previous studies have mainly focused on the lunging. In this paper, the authors developed a new behavioral experiment system for observing tussling behavior and found that tussling is enhanced by group rearing while lunging is suppressed. They then searched for neurons involved in the generation of tussling. Although olfactory receptors named Or67d and Or65a have previously been reported to function in the control of lunging, the authors found that these neurons do not function in the execution of tussling, and another olfactory receptor, Or47b, is required for tussling, as shown by the inhibition of neuronal activity and the gene knockdown experiments. Further optogenetic experiments identified a small number of central neurons pC1[SS2] that induce the tussling specifically. In order to further explore the ecological significance of the aggression mode change in group rearing, a new behavioral experiment was performed to examine territorial control and mating competition. Finally, the authors found that differences in the social experience (group vs. solitary rearing) are important in these biologically significant competitions. These results add a new perspective to the study of aggressive behavior in Drosophila. Furthermore, this study proposes an interesting general model in which the social experience-modified behavioral changes play a role in reproductive success.

      Strengths

      A behavioral experiment system that allows stable observation of tussling, which could not be easily analyzed due to its low frequency, would be very useful. The experimental setup itself is relatively simple, just the addition of a female to the platform, so it should be applicable to future research. The finding about the relationship between the social experience and the aggression mode change is quite novel. Although the intensity of aggression changes with the social experience was already reported in several papers (Liu et al., 2011, etc), the fact that the behavioral mode itself changes significantly has rarely been addressed and is extremely interesting. The identification of sensory and central neurons required for the tussling makes appropriate use of the genetic tools and the results are clear. A major strength of the neurobiology in this study is the finding that another group of neurons (Or47b-expressing olfactory neurons and pC1[SS2] neurons), distinct from the group of neurons previously thought to be involved in low-intensity aggression (i.e. lunging), function in the tussling behavior. Further investigation of the detailed circuit analysis is expected to elucidate the neural substrate of the conflict between the two aggression modes.

      Thank you for the acknowledgment of the novelty and significance of the study, and your suggestions for improving the manuscript.

      Weaknesses

      The experimental systems examining the territory control and the reproductive competition in Figure 5 are novel and have advantages in exploring their biological significance. However, at this stage, the authors' claim is weak since they only show the effects of age and social experience on territorial and mating behaviors, but do not experimentally demonstrate the influence of aggression mode change itself. In the Abstract, the authors state that these findings reveal how social experience shapes fighting strategies to optimize reproductive success. This is the most important perspective of the present study, and it would be necessary to show directly that the change of aggression mode by social experience contributes to reproductive success.

      We agree that our data did not directly show that it is the change of aggression mode that results in territory and reproductive advantages in GH males. To address the concern, we have toned down the statement throughout the manuscript. For example, we made textual changes in the abstract as following

      Moreover, shifting from lunging to tussling in socially enriched males is accompanied with better territory control and mating success, mitigating the disadvantages associated with aging. Our findings identify distinct sensory and central neurons for two fighting forms and suggest how social experience shapes fighting strategies to optimize reproductive success.

      In addition, a detailed description of the tussling is lacking. For example, the authors state that the tussling is less frequent but more vigorous than lunging, but while experimental data are presented on the frequency, the intensity seems to be subjective. The intensity is certainly clear from the supplementary video, but it would be necessary to evaluate the intensity itself using some index. Another problem is that there is no clear explanation of how to determine the tussling. A detailed method is required for the reproducibility of the experiment.

      Thank you for this important suggestion. We now analyzed duration of tussling and lunging, and found that a lunging event is often very short (less than 0.2s), while a tussling event may last from seconds to minutes. This new data is added as Figure 2G. In addition, we also provided more detailed methods regarding to tussling behavior

      .<br /> Reviewer #3 (Public review):

      In this manuscript, Gao et al. presented a series of intriguing data that collectively suggest that tussling, a form of high-intensity fighting among male fruit flies (Drosophila melanogaster) has a unique function and is controlled by a dedicated neural circuit. Based on the results of behavioral assays, they argue that increased tussling among socially experienced males promotes access to resources. They also concluded that tussling is controlled by a class of olfactory sensory neurons and sexually dimorphic central neurons that are distinct from pathways known to control lunges, a common male-type attack behavior.

      A major strength of this work is that it is the first attempt to characterize the behavioral function and neural circuit associated with Drosophila tussling. Many animal species use both low-intensity and high-intensity tactics to resolve conflicts. High-intensity tactics are mostly reserved for escalated fights, which are relatively rare. Because of this, tussling in the flies, like high-intensity fights in other animal species, has not been systematically investigated. Previous studies on fly aggressive behavior have often used socially isolated, relatively young flies within a short observation duration. Their discovery that 1) older (14-days-old) flies tend to tussle more often than younger (2-days-old) flies, 2) group-reared flies tend to tussle more often than socially isolated flies, and 3) flies tend to tussle at a later stage (mostly ~15 minutes after the onset of fighting), are the result of their creativity to look outside of conventional experimental settings. These new findings are keys for quantitatively characterizing this interesting yet under-studied behavior.

      Precisely because their initial approach was creative, it is regrettable that the authors missed the opportunity to effectively integrate preceding studies in their rationale or conclusions, which sometimes led to premature claims. Also, while each experiment contains an intriguing finding, these are poorly related to each other. This obscures the central conclusion of this work. The perceived weaknesses are discussed in detail below.

      Thank you for the precise summary of the key findings and novelty of the study, and your insightful suggestions.

      Most importantly, the authors' definition of "tussling" is unclear because they did not explain how they quantified lunges and tussling, even though the central focus of the manuscript is behavior. Supplemental movies S1 and S2 appear to include "tussling" bouts in which 2 flies lunge at each other in rapid succession, and supplemental movie S3 appears to include bouts of "holding", in which one fly holds the opponent's wings and shakes vigorously. These cases raise a concern that their behavior classification is arbitrary. Specifically, lunges and tussling should be objectively distinguished because one of their conclusions is that these two actions are controlled by separate neural circuits. It is impossible to evaluate the credibility of their behavioral data without clearly describing a criterion of each behavior.

      Thank you for this very important suggestion. We now provided more detailed description of the two fighting forms in the Materials and Methods section. See below

      Lunging is characterized by a male raising its forelegs and quickly striking the opponent, and each lunge typically lasts less than 0.2 seconds through detailed analysis. Tussling is characterized by both males using their forelegs and bodies to tumble over each other, and this behavior may last from seconds to minutes. Tussling is often mixed with boxing, in which both flies rear up and strike the opponent with forelegs. Since boxing is often transient and difficult to distinguish from tussling, we referred to the mixed boxing and tussling behavior simply as tussling. As we manually analyze tussling for 2 hours for each pair of males, it is possible that we may miss some tussling events, especially those quick ones.

      It is also confusing that the authors completely skipped the characterization of the tussling-controlling neurons they claimed to have identified. These neurons (a subset of so-called pC1 neurons labeled by previously described split-GAL4 line pC1SS2) are central to this manuscript, but the only information the authors have provided is its gross morphology in a low-resolution image (Figure 4D, E) and a statement that "only 3 pairs of pC1SS2 neurons whose function is both necessary and sufficient for inducing tussling in males" (lines 310-311). The evidence that supports this claim isn't provided. The expression pattern of pC1SS2 neurons in males has been only briefly described in reference 46. It is possible that these neurons overlap with previously characterized dsx+ and/or fru+ neurons that are important for male aggressions (measured by lunges), such as in Koganezawa et al., Curr. Biol. 2016 and Chiu et al., Cell 2020. This adds to the concern that lunge and tussling are not as clearly separated as the authors claim.

      Thank you very much for this important question. Indeed, there are many experiments that could do to better understand the function of pC1SS2 neurons, and we only provide the initial characterization of them due to the limited scope of this study. My lab has been focused on studying P1/pC1 function in both male and female flies and will continue to do so.

      To partially address your concern, we made the following revisions

      (1) We provided higher-resolution images of P1a and pC1SS2 (Figure 4C-4E). While their cell bodies are very close, they project to distinct brain regions, in addition to some shared ones.

      (2) By staining these neurons with GFP and co-staining with anti-FruM or anti-DsxM antibodies, we showed that P1a neurons are partially FruM-positive and partially DsxM-positive, while pC1SS2 neurons are DsxM-positive and FruM-negative (Figure 5A-5D).

      (3) As pC1SS2 neurons are DsxM-positive and FruM-negative, we also examined how DsxM regulates the development of these neurons. We found that knocking down DsxM expression in pC1SS2 neurons using RNAi significantly affected pC1 development regarding to both cell numbers (Figure 5G) and their projections (Figure 5H).

      (4) We further found that DsxM in pC1SS2 neurons is crucial for executing their tussling-promoting function, as optogenetic activation of these neurons with DsxM knockdown failed to induce tussling behavior in the initial activation period, and a much lower level of tussling in the second activation period compared to control males (Figure 5I-5K).

      (5) While it is very difficult to identify the upstream and downstream neurons of P1a and pC1SS2 neurons, we made an initial step by utilizing trans-tango and retro-Tango to visualize potential downstream and upstream neurons of P1a and pC1SS2 (Figure 4-figure supplement 2), which certainly needs future investigation.  

      While their characterizations of tussling behaviors in wild-type males (Figures 1 and 2) are intriguing, the remaining data have little link with each other, making it difficult to understand what their main conclusion is. Figure 3 suggests that one class of olfactory sensory neurons (OSN) that express Or47b is necessary for tussling behavior. While the authors acknowledged that Or47b-expressing OSNs promote male courtship toward females presumably by detecting cuticular compounds, they provided little discussion on how a class of OSN can promote two different types of innate behavior. No evidence of a functional or circuitry relationship between the Or47b pathway and the pC1SS2 neurons was provided. It is unclear how these two components are relevant to each other.

      It has been previously found that Or47b-expressing ORNs respond to fly pheromones common to both sexes, and group-housing enhances their sensitivity. Regarding to how Or47b ORNs promotes two different types of innate behaviors, a simple explanation is that they act on multiple second-order and further downstream neurons to regulate both courtship and aggression, not mentioning that neural circuitries for courtship and aggression are partially shared. We did not include this in the discussion as we would like to focus on aggression modes, and how different ORNs (Or47b and Or67d) mediate distinct aggression modes.

      Regarding to the relationship between Or47b ORNs and pC1<sub>SS2</sub> neurons, or in general ORNs to P1/pC1, it is interesting and important to explore, but probably in a separate study. We tried to conduct pathway connection analyses from Or47b to pC1 using the FlyWire database, and found that Or47b neurons can act on pC1 neurons via three layers of interneurons. Although the FlyWire database currently only contains neuronal data from female brains, they can provide a certain degree of reference. We hope the editor and reviewers would agree with us that identifying these intermediate neurons involved in their connection is beyond this study.

      Lastly, the rationale of the experiment in Figure 5 and the interpretation of the results is confusing. The authors attributed a higher mating success rate of older, socially experienced males over younger, socially isolated males to their tendency to tussle, but tussling cannot happen when one of the two flies is not engaged. If, for instance, a socially isolated 14-day-old male does not engage in tussling as indicated in Figure 2, how can they tussle with a group-housed 14-day-old male? Because aggressive interactions in Figure 5 were not quantified, it is impossible to conclude that tussling plays a role in copulation advantage among pairs as authors argue (lines 282-288).

      Indeed, we do not have direct evidence to show it is tussling that makes socially experienced males to dominate over socially isolated males. To address your concern, we have made following revisions

      (1) We toned down the statements about the relationship between fighting strategies and reproductive success throughout the manuscript. For example, in the abstract Moreover, shifting from lunging to tussling in socially enriched males is accompanied with better territory control and mating success.

      (2)  Regarding to whether a SH male can engage in tussling with a GH male, we found that while two SH males rarely perform tussling, paired SH and GH males displayed similar levels of tussling like two GH males, although tussling duration from paired SH and GH males is significantly lower compared to that in two GH males (Figure 6-figure supplement 2).

      (3) To support the potential role of tussling in territory control and mating competition, we performed additional experiments to silence Or47b or pC1SS2 neurons that almost abolished tussling, and paired these males with control males. We found that males with Or47b or pC1SS2 neurons silenced cannot compete over control males, further suggesting the involvement of tussling in territory control and mating competition.  

      Despite these weaknesses, it is important to acknowledge the authors' courage to initiate an investigation into a less characterized, high-intensity fighting behavior. Tussling requires the simultaneous engagement of two flies. Even if there is confusion over the distinction between lunges and tussling, the authors' conclusion that socially experienced flies and socially isolated flies employ distinct fighting strategies is convincing. Questions that require more rigorous studies are 1) whether such differences are encoded by separate circuits, and 2) whether the different fighting strategies are causally responsible for gaining ethologically relevant resources among socially experienced flies. Enhanced transparency of behavioral data will help readers understand the impact of this study. Lastly, the manuscript often mentions previous works and results without citing relevant references. For readers to grasp the context of this work, it is important to provide information about methods, reagents, and other key resources.

      Thank you very much for this comment and we almost totally agree.

      (1) Our results suggest the involvement of distinct sensory neurons and central neurons for lunging and tussling, but do not exclude the possibility that they may also utilize shared neurons. For example, activation of P1a neurons promotes both lunging and tussling in the presence of light.

      (2) We have now toned down the statements about the relationship between fighting strategies and reproductive success throughout the manuscript.

      (3) We provided more detailed methods, genotypes of flies to improve transparency of the manuscript.

      Reviewer #1 (Recommendations for the authors):

      (1) Figure 1 Supplement 1 shows that increased aging has a linear and inverse relationship with the number of lunges, this is in contrast to a previous study from Dierick lab (Chowdhury, 2021), where using Divider assays they showed that aggressive lunges increased up to day 10 and subsequently decreased in 30-day old flies. Given that this study did not use 14-day-old flies, it might be useful to comment on this.

      Thank you for this comment. Indeed, Chowdhury et al., suggested a decline of lunging after 10 days, which is not contradictory to our findings that lunging in 14d-old males is lower than that in 7d-old males. It is ideally to perform a time-series experiments to reveal the detailed relationship between ages and aggression (lunging or tussling) levels, but given our initial findings that 14d-old males showed stable tussling behavior, we prefer to use this time point for the rest of this study.

      (2) For Figure 3, do various manipulations also affect the duration of tussling and boxing besides frequency and latency?

      Thank you for this comment. We only analyzed latency and frequency, but not duration, as data analysis was performed manually rather than automatically on every fly pair for about 2 hours, which is very labor-consuming. We hope you could agree with us that the two parameters (frequency and latency) for tussling are representative for assaying this behavior.

      (3) For Figure 3 A-F, the housing status of the males is not clearly mentioned either in the main text or the figure. What is the status of the tussling and lunging status when this housing condition is reversed when Or47b neurons are silenced, or the gene is knocked down? Do these manipulations overcome the effect of housing conditions similar to what is seen in NaChBac-mediated activation experiments?

      Figure 3A-F used group-housed males and we have now added such information in the figure legends as well as Table S1.

      We appreciate your suggestion on using different housing conditions. As silencing Or47b neurons or knocking down Or47b reduced tussling, it is reasonable to use GH males (as we did in Figure 3A-F) that performed stable tussling behavior, but not SH males that rarely tussle.

      (4) The connections between Or47b neurons and pC1SS2 or P1a neurons can be addressed by available connectomic datasets or TransTango/GRASP approaches.

      Thank you for this important suggestion. We used the FlyWire electron microscope database to analyze the pathway connections between these two types of neurons. The results indicated that there are at least three levels of interneurons for connecting Or47b and pC1 neurons. Although the FlyWire database currently only contains neuronal data from female brains, they can provide a certain degree of reference for males.

      The lack of direct synaptic connection also suggests that it is challenging to resolve the connection between these two neuronal types using methods like trans-Tango/GRASP. To partially address this question, we utilized trans-Tango and retro-Tango techniques to visualize potential downstream and upstream neurons of P1a and pC1SS2 (Figure 4-figure supplement 2). Future investigations are certainly needed for clarifying functional connections between Or47b/Or67d and P1a/pC1SS2 neurons.

      (5) Figure 5, 'Winning index' and 'Copulation advance index' while described in Material and Methods, should be referred to in the main text.

      We now described these two indices briefly in the main manuscript, and in the Discussion section with more details.

      (6) Figure 6 shows comparisons for territorial control and mating outcomes where four different housing and aging conditions are organized in a hierarchical sequence. It is not clear from the data in Figure 5, how this conclusion was arrived at. A supplementary table with various outcomes with statistical analysis would help with this.

      We now added a supplementary table (Table S2) with various outcomes with statistical analysis.

      Minor Comments

      (1) Line 26 says that the courtship levels in SH and GH males are not different, however, unilateral wing extension is higher in SH males as compared to GH males (Pan & Baker, 2014; Inagaki et al., 2014), also it was shown that courtship attempts are higher in D. paulsitorium (Kim & Ehrman, 1998). It would be better to clarify this statement.

      Indeed, it is found in some cases that SH males court more vigorously than GH males. We have added more references on this matter in the introduction.

      (2) Figure 4, correct 'Tussing' to 'Tussling' or 'Box, Tussling' as appropriate.

      Corrected.

      (3) Duistermars, 2018 should be cited while discussing the role of vision in aggression (Figure 4). [A Brain Module for Scalable Control of Complex, Multi-motor Threat Displays]

      We now cited this reference and added more discussion in the revised manuscript.

      (4) Reviews on Drosophila aggression and social isolation can be cited in the introduction/discussion to incorporate recent literature e.g., Palavicino-Maggio, 2022 [The Neuromodulatory Basis of Aggression Lessons From the Humble Fruit Fly]; Yadav et al., 2024[Lessons from lonely flies Molecular and neuronal mechanisms underlying social isolation], etc.

      We now cited these references in both the introduction and discussion sections.

      (5) The concentration of apple juice agar should be mentioned in the methods.

      We added this and other necessary information for materials in the Materials and Methods section of the study.

      (6) Source of the LifeSongX software and, if available, a Github link would be helpful to include in the materials and methods section.

      We now provided the source of the LifesongY software (website https//sourceforge.net/projects/lifesongy/), which is a Windows version of LifesongX (Bernstein, Adam S.et al., 1992).

      Reviewer #2 (Recommendations for the authors):

      (1) Major comment 1

      As pointed out in the public review, the weakness of this study is that the relationship between the aggression strategy and reproductive success is an inference that is not based on experimental facts; I understand that the frequency of tussling is not so high, but at least tussling-like behavior can be observed in the territory control experiment shown in Video 3. Wouldn't it be possible to re-analyse data and examine the correlation between aggressive behavior and territory control? Even if the analysis of tussling itself in this setup is difficult, for example, additional experiments using Or47b knock-out fly or pC1[SS2]-inactivated fly could provide stronger support.

      Indeed, we can only make a correlation between the type of aggressive behavior and territory control. We now toned down this statement throughout the manuscript. For example, in the abstract, we changed our conclusions as following

      Moreover, shifting from lunging to tussling in socially enriched males is accompanied with better territory control and mating success. Our findings identify distinct sensory and central neurons for two fighting forms and suggest how social experience shapes fighting strategies to optimize reproductive success.

      To further address the concern, we now performed additional experiments to silence Or47b or pC1SS2 neurons that almost abolished tussling, and paired these males with control males. We found that males with Or47b or pC1SS2 neurons silenced cannot compete over control males (Figure 6-figure supplement 3), further suggesting the involvement of tussling in territory control and mating competition.

      In relation to the above, some of the text in the Abstract should be changed.Line 28 These findings "reveal" how social experience shapes fighting strategies to optimise reproductive success.

      "suggest" is more accurate at this stage.

      Changed as suggested.

      (2) Major comment 2

      The tussling is the central subject of this paper. However, neither the main text nor Materials and Methods section provides a clear explanation of how this aggression mode was detected. Did the authors determine this behavior manually? Or was it automatically detected by some kind of image analysis? In either case, the criteria and method for detecting the tussling should be clearly described.

      The behavioral data analysis in this study was performed manually. We now provided more detailed description of the two fighting forms in the Materials and Methods section. See below

      Lunging is characterized by a male raising its forelegs and quickly striking the opponent, and each lunge typically lasts less than 0.2 seconds through detailed analysis. Tussling is characterized by both males using their forelegs and bodies to tumble over each other, and this behavior may last from seconds to minutes. Tussling is often mixed with boxing, in which both flies rear up and strike the opponent with forelegs. Since boxing is often transient and difficult to distinguish from tussling, we referred to the mixed boxing and tussling behavior simply as tussling. As we manually analyze tussling for 2 hours for each pair of males, it is possible that we may miss some tussling events, especially those quick ones.

      For the experimental groups where tussling cannot be observed, the latency is regarded as 120 min, but this is a value depending on the observation time. While it is reasonable to use the latency to evaluate the behavior such as the lunging that is observed at relatively early times, care should be taken when using it to evaluate the tussling. Since similar trends to those obtained for the latency are observed for Number of tussles and % of males performing tussling, it may be better to focus on these two indices.

      We initially intended to provide all three statistical metrics. However, we found that using the "% of males performing tussling" would require a significantly larger sample size for subsequent statistical analysis (using chi-square tests), greatly increasing the workload. At the same time, we believe that the trend observed with "% of males performing tussling" is consistent with the other two indices, and the percentage information can also be derived from the individual sample scatter data of the other two metrics. Therefore, we opted to use "latency" and "numbers" as the statistical metrics, despite the caveat as you mentioned.

      The authors repeatedly mention that tussling is less frequent but more vigorous. The low frequency can be understood from the data in Fig. 1 and Fig. 2, but there are no measured data on the intensity. As the authors mention in line 125, each tussling event appears to be sustained for a relatively long period, as can be seen from the ethogram in Fig. 2. For example, it would be possible to evaluate the intensity by measuring the duration of the tussling event.

      Thank you for your valuable suggestion. We now analyzed duration of tussling and lunging, and found that a lunging event is often very short (less than 0.2s), while a tussling event may last from seconds to minutes, further supporting their relative intensities. This new data is added as Figure 2G.

      (3) Minor comments

      a) Line 117 How many flies were placed in one vial for group-rearing (GH)? Were males and females grouped together? Please specify in the Materials and Methods section.

      We have added this information in the Materials and Methods section. In brief, 30-40 virgin males were collected after eclosion and group-housed in each food vial.

      b) Line 174 The trans-Tango is basically a postsynaptic cell labeling technique. It is unlikely that the labeling intensity changes depending on neuronal activity. Do the authors want to say in this text the high activity of Or47b-expressing neurons under GH conditions? Or are they trying to show that the expression level of the Or47b gene, which is supposedly monitored by the expression of GAL4, is increased by GH conditions? The authors should clarify which is the case.

      Although the primary function of the trans-Tango technique is to label downstream neurons, the original literature indicates that the signal strength in downstream neurons depends on the use of upstream neurons evidenced by age-dependent trans-Tango signals. Therefore, the trans-Tango technique can indirectly reflect the usage of upstream neurons. Our findings that GH males showed broader Or47b trans-Tango signals than SH males can indirectly suggest that group-housing experience acts on Or47b neurons. We made textually changes to clarify this.

      c) Line 178 Which fly line labels the mushroom body; R19B03-GAL4?

      Yes, we now provided the detailed genotypes for all tested flies in the Table S1.

      d) Line 184 It was reported in Koganezawa et al., 2016 that some dsx-expressing pC1 neurons are involved in aggressive behavior. The authors should also refer to this paper as they include tussling in the observed aggressive behavior.

      Thank you for this comment, and we now cited this reference in the revised manuscript.

      e) Line 339 I think you misspelled fruM RNAi.

      Thank you for pointing this out. fruMi refers to microRNAi targeting fruM, and we have now clearly stated this information in the main text.

      f) Line 681 Is tussling time (%) the total duration of tussling occurrences during the observation time? Or is it the percentage of individuals observed tussling during the observation time? This needs to be clarified.

      It is the former one. We now clearly stated this definition in the Materials and Methods section

      Reviewer #3 (Recommendations for the authors):

      For authors to support their conclusion that enhanced tussling among socially experienced flies allows them to better retain resources, it is necessary to quantify aggressive behaviors (mainly tussling and lunging) in Figure 5.

      We agree that we can only make a correlation between enhanced tussling behavior and mating competition. We now toned down this statement throughout the manuscript. For example, in the abstract, we changed our conclusions as following Moreover, shifting from lunging to tussling in socially enriched males is accompanied with better territory control and mating success. Our findings identify distinct sensory and central neurons for two fighting forms and suggest how social experience shapes fighting strategies to optimize reproductive success.

      To further address the concern, we now performed additional experiments to silence Or47b or pC1SS2 neurons that almost abolished tussling, and paired these males with control males. We found that males with Or47b or pC1SS2 neurons silenced cannot compete over control males (Figure 6-figure supplement 3), further suggesting the involvement of tussling in territory control and mating competition.

      In contrast to the authors' data in Figure 4, movies in ref 36 clearly show instances of 2 flies exchanging lunges after the optogenetic activation of P1a neurons, like the examples shown in supplementary movies S1-S3. It is a clear discrepancy that requires discussion (and raises a concern about the lack of transparency about behavioral quantification).

      In our study, optogenetic activation of P1<sup>a</sup> neurons failed to induce obvious tussling behavior, and temperature-dependent activation of P1<sup>a</sup> neurons can only induce tussling in the presence of light. These data are different from Hoopfer et al., (2015), but are generally consistent with a new study (Sten et al., Cell, 2025), in which pC1SS2 neurons but not P1a neurons promote aggression. Such discrepancy has now been discussed in the revised manuscript.

      The authors often fail to cite relevant references while discussing previous results, which compromises the scholarship of the manuscript. Examples include (but are not limited to)

      (1) Line 85-86 Simon and Heberlein, J. Exp. Biol. 223 jeb232439 (2020) suggested that tussling is an important factor for flies to establish a dominance hierarchy.

      Reference added.

      (2) Line 142-143 Cuticular compounds such as palmitoleic acid are characterized to be the ligands of Or47b by ref #18.

      Reference added.

      (3) Line 185-187 pC1SS1 and pC1SS2 are first characterized by ref #46. Expression data of this paper also implies that pC1SS1 and pC1SS2 label different neurons in the male brain.

      We have now added this reference at the appropriate place in the revised manuscript. In addition, we have clarified that these two drivers exhibit sexually dimorphic expression patterns in the brain.

      (4) Line 196-199 Cite ref #36, which describes the behavior induced by the optogenetic activation of P1a neurons.

      Reference added.

      (5) Line 233-235 The authors' observation that control males do not form a clear dominance directly contradicts previous observations by others (Nilsen et al., PNAS 10112342 (2002); Yurkovic et al., PNAS 10317519 (2006); also see Trannoy et al., PNAS 1134818 (2016) and Simon and Heberlein above). The authors must at least discuss why their results are different.

      There is a misunderstanding here. We clearly state that there is a ‘winner takes all’ phenomenon. However, for wild-type males of the same age and housing condition, we calculated the winning index as (num. of wins by unmarked males – num. of wins by marked males)/10 encounters * 100%, which is roughly zero due to the randomness of marking.

      (6) Line 251-254 The authors' observation that aged males are less competitive than younger males contradicts the conclusion in ref #18. Discussion is required.

      We have now added a discussion on this matter. In brief, Lin et al., showed that 7d-old males are more competitive than 2d-old males, which is probably due to different levels of sexual maturity of males, but not a matter of age like our study that used up to 21d-old males.

      (7) Line 274-275 It is unclear which "previous studies" "have found that social isolation generally enhances aggression but decreases mating competition in animal models". Cite relevant references.

      Reference added.

      (8) Line 309-310 The evidence supporting the statement that "there are only three pairs of pC1SS2 neurons". If there is a reference, cite it. If it is based on the authors' observation, data is required.

      We have now provided additional data on the number of pC1SS2 neurons in Figure 5G of the revised manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review): 

      The manuscript by Feng et al. reported that the Endothelin B receptor (ETBR) expressed by the satellite glial cells (SGCs) in the dorsal root ganglions (DRG) acted to inhibit sensory axon regeneration in both adult and aged mice. Thus, pharmacological inhibition of ETBR with specific inhibitors resulted in enhanced sensory axon regeneration in vitro and in vivo. In addition, sensory axon regeneration significantly reduces in aged mice and inhibition of ETBR could restore such defect in aged mice. Moreover, the study provided some evidence that the reduced level of gap junction protein connexin 43 might act downstream of ETBR to suppress axon regeneration in aged mice. Overall, the study revealed an interesting SGC-derived signal in the DRG microenvironment to regulate sensory axon regeneration. It provided additional evidence that non-neuronal cell types in the microenvironment function to regulate axon regeneration via cell-cell interaction. 

      However, the molecular mechanisms by which ETBR regulates axon regeneration are unclear, and the manuscript's structure is not well organized, especially in the last section. Some discussion and explanation about the data interpretation are needed to improve the manuscript. 

      We thank the reviewer for the positive comments. We agree that the mechanisms by which ETBR signaling functions as a brake on axon growth and regeneration remain to be elucidated. We believe that unraveling the detailed molecular pathways downstream of ETBR signaling in SGCs that promote axon regeneration is beyond the scope of this manuscript. Answering these questions would first require cell specific KO of ETBR and Cx43 to confirm that this pathway is operating in SGCs to control axon regeneration. We would also need to identify how SGCs communicate with neurons to regulate axon regeneration, which is a large area of ongoing research that remains poorly understood. Our data showing that pharmacological inhibition of ETBR with specific FDA-approved inhibitors enhances sensory axon regeneration provide not only new evidence for non-neuronal mechanisms in nerve repair, but also a new potential clinical avenue for therapeutic intervention.

      As suggested by the reviewer, we have extensively revised the organization of the manuscript, especially the last section of results. We have performed additional snRNAseq experiments to establish the impact of aging in DRG. We have also performed additional experiments to determine if blocking ETBR improves target tissue reinnervation. Following the reviewer’s suggestion, we have also expanded the Discussion section to discuss alternative mechanisms and o]er additional interpretation of our data. Below we describe how we address each point in detail.

      (1) The result showed that the level of ETBR did not change after the peripheral nerve injury. Does this mean that its endogenous function is to limit spontaneous sensory axon regeneration? In other words, the results suggest that SGCs expressing ETBR or vascular endothelial cells expressing its ligand ET-1 act to suppress sensory axon regeneration. Some explanation or discussion about this is necessary. Moreover, does the protein level of ETBR or its ligand change during aging?  

      We thank the reviewer for this point. Our results indeed indicate that one endogenous function of ETBR is to limit the extent of sensory axon regeneration. This may be a part of a mechanism to limit spontaneous sensory axon growth or plasticity and maladaptive neural rewiring after nerve injury. While the increased growth capacity of damaged peripheral axons can lead to reconnection with their targets and functional recovery, the increased growth capacity can also lead to axonal sprouting of the central axon terminals of injured neurons in the spinal cord, and to pain (see for example Costigan et al 2010, PMID: 19400724).  In the context of aging that we describe here, this protective mechanism may hinder beneficial recovery. Other mechanisms that slow axon regeneration have been reported, and include, for example, axonally synthesized proteins, which typically support nerve regeneration through retrograde signaling and local growth mechanisms. RNA binding proteins (RBP) are needed for this process. One such RBP, the RNA binding protein KHSRP is locally translated following nerve injury. Rather than promoting axon regeneration, KHSRP promotes decay of other axonal mRNAs and slows axon regeneration.  Another example includes the Rho signaling pathway, which was shown to function as an inhibitory mechanism that slows the growth of spiral ganglion neurites in culture. We have now included these examples in the Discussion section.

      To address the reviewer’s second question, we have checked protein levels of ETBR and ET-1 in adult and aged DRG tissue. We observed a robust increase in ET-1 in aged DRG, while the levels of ETBR did not appear to change significantly. These results are now presented in Figure 4- Figure Supplement 1, and further support the notion that in aging, activation of the ETBR signaling hinders axon regeneration.

      (2) In ex vivo experiments, NGF was added to the culture medium. Previous studies have shown that adult sensory neurons could initiate fast axon growth in response to NGF within 24 hours. In addition, dissociated sensory neurons could also initiate spontaneous regenerative axon growth without NGF after 48 hours. Some discussion or rationale is needed to explain the di]erence between NGF-induced or spontaneous axon growth of culture adult sensory neurons and the roles of ETBR and SGCs. 

      We appreciate the reviewer’s suggestion. In adult DRG explant or dissociated cultures, NGF is not typically required for survival or axon outgrowth. However, in dissociated culture, the addition of NGF to the medium stimulates growth from more neurons compared to controls (Smith and Skene 1997). In the DRG explant, NGF does not promote significant e]ects on axon growth, but stimulates glial cell migration (Klimovich et al 2020). We opted to included NGF in our explant assay to increase the potential of stimulating axon regeneration with pharmacological manipulations of ETBR. We have now clarified these considerations in the Method section.

      (3) In cultured dissociated sensory neurons, inhibiting ETBR also enhanced axon growth, which meant the presence of SGCs surrounding the sensory neurons. Some direct evidence is needed to show the cellular relationship between them in culture.  

      We thank the reviewer for raising this point and have added new data, now presented in Figure 2B, to show that in mixed DRG cultures, SGCs labeled with Fabp7 are present in the culture in proximity to neurons labeled with TUJ1, but they do not fully wrap the neuronal soma. These results are consistent with prior findings reporting that as time in culture progresses, SGCs lose their adhesive contacts with neuronal soma and adhere to the coverslip (PMID: 22032231, PMID: 27606776).  While in some cases SGCs can maintain their association with neuronal soma in the first day in culture after plating, in our hands, most SGCs have left the soma at the 24h time point we examined. 

      (4) In Figure 3, the in vivo regeneration experiments first showed enhanced axon regeneration either 1 day or 3 days after the nerve injury. The study then showed that inhibiting ETBR could enhance sensory axon growth in vitro from uninjured naïve neurons or conditioning lesioned neurons. To my knowledge, in vivo sensory axon regeneration is relatively slow during the first 2 days after the nerve injury and then enters the fast regeneration mode on the 3rd day, representing the conditioning lesion e]ect in vivo. Some discussion is needed to compare the in vitro and the in vivo model of axon regeneration. 

      We agree that axon growth is relatively slow the first 2 days and enters a fast growth mode on day 3. This has been elegantly demonstrated in Shin et al Neuron 2012 (PMID: 22726832), where an in vivo conditioning injury 3 days prior increases axon growth one day after injury. In vitro, similar e]ects have been described: a prior in vivo injury accelerates growth capacity within the first day in culture, but a similar growth mode occurs in naive adult neurons after 2-3 days in vitro (Smith and Skene 1996). We also know that the neurite growth in culture is stimulated by higher cell density, likely because non-neuronal cells can secrete trophic factors (Smith and Skene 1996). Our in vitro results thus suggest that blocking ETBR in SGCs in these mixed cultures may alter the media towards a more growth promoting state. In vivo, our data show that Bosentan treatment for 3 days partially mimics the conditioning injury and potentiate the e]ect of the conditioning injury. One possible interpretation is that inhibition of ETBR alters the release of trophic factors from SGCs. Future studies will be required to unravel how ETBR signaling influence the SGCs secretome and its influence on axon growth. We have now included these discussions points in the Results and Discussion Section.

      (5) In Figure 5, the study showed that the level of connexin 43 increased after ETBR inhibition in either adult or aged mice, proposing an important role of connexin 43 in mediating the enhancing e]ect of ETBR inhibition on axon regeneration. However, in the study, there was no direct evidence supporting that ETBR directly regulates connexin 43 expression in SGCs. Moreover, there was no functional evidence that connexin 43 acted downstream of ETBR to regulate axon regeneration.  

      We thank the reviewer for this point and agree that we do not provide direct evidence that connexin 43 acts downstream of ETBR to regulate axon regeneration. To obtain such functional evidence would require selective KO of ETBR and Cx43 in SGCs, which we believe is beyond the scope of the current study. We have revised the Results and Discussion sections to emphasize that while we observe that ETBR inhibition increases Cx43 levels and Cx43 levels correlates with axon regeneration, whether Cx43 directly mediates the e]ect on axon regeneration remains to be established.  We also discuss potential alternative mechanisms downstream of ETBR in SGCs that could contribute to the observed e]ects on axon regeneration. Specifically, we discuss the possibility that  ETBR signaling may limit axon regeneration via regulating SGCs glutamate reuptake functions, because of the following reasons: 1) Similarly to astrocytes, glutamate uptake by SGCs is important to regulate neuronal function, 2) exposure of cultured cortical astrocytes to endothelin results in a decrease in glutamate uptake that correlates with a major loss of basal glutamate transporter expression (GLT-1 and1), 3) Both glutamate transporters are expressed in SGCs in sensory ganglia 4) GLAST and glutamate reuptake function is important for lesion-induced plasticity in the developing somatosensory cortex. 

      Reviewer #2 (Public Review): 

      Summary: 

      In this interesting and original study, Feng and colleagues set out to address the e]ect of manipulating endothelin signaling on nerve regeneration, focusing on the crosstalk between endothelial cells (ECs) in dorsal root ganglia (DRG), which secrete ET-1 and satellite glial cells (SGCs) expressing ETBR receptor. The main finding is that ETBR signaling is a default brake on axon growth, and inhibiting this pathway promotes axon regeneration after nerve injury and counters the decline in regenerative capacity that occurs during aging. ET-1 and ETBR are mapped in ECs and SGCs, respectively, using scRNA-seq of DRGs from adult or aged mice. Although their expression does not change upon injury, it is modulated during aging, with a reported increase in plasma levels of ET-1 (a potent vasoconstrictive signal). Using in vitro explant assays coupled with pharmacological inhibition in mouse models of nerve injury, the authors demonstrate that ET-1/ETBR curbs axonal growth, and the ETAR/ETBR antagonist Bosentan boosts regrowth during the early phase of repair. In addition, Bosentan restores the ability of aged DRG neurons to regrow after nerve lesions. Despite Bosentan inhibiting both endothelin receptors A and B, comparison with an ETAR-specific antagonist indicates that the e]ects can be attributed to the ET-1/ETBR pathway. In the DRGs, ETBR is mostly expressed by SGCs (and a subset of Schwann cells) a cell type that previous studies, including work from this group, have implicated in nerve regeneration. SGCs ensheath and couple with DRG neurons through gap junctions formed by Cx43. Based on their own findings and evidence from the literature, the pro-regenerative e]ects of ETBR inhibition are in part attributed to an increase in Cx43 levels, which are expected to enhance neuron-SGC coupling. Finally, gene expression analysis in adult vs aged DRGs predicts a decrease in fatty acid and cholesterol metabolism, for which previous work by the authors has shown a requirement in SGCs to promote axon regeneration. 

      Strengths: 

      The study is well-executed and the main conclusion that "ETBR signaling inhibits axon regeneration after nerve injury and plays a role in age-related decline in regenerative capacity" (line 77) is supported by the data. Given that Bosentan is an FDA-approved drug, the findings may have therapeutic value in clinical settings where peripheral nerve regeneration is suboptimal or largely impaired, as it often happens in aged individuals. In addition, the study highlights the importance of vascular signals in nerve regeneration, a topic that has gained traction in recent years. Importantly, these results further emphasize the contribution of longneglected SGCs to nerve tissue homeostasis and repair. Although the study does not reach a complete mechanistic understanding, the results are robust and are expected to attract the interest of a broader readership. 

      We thank the reviewer for the positive comments, especially in regard to the rigor and originality of our study.

      Weaknesses: 

      Despite these positive comments provided above, the following points should be considered: 

      (1) This study examines the contribution of the ET-1 pathway in the ganglia, and in vitro assays are consistent with the idea that important signaling events take place there. Nevertheless, it remains to be determined whether the accelerated axon regrowth observed in vivo depends also on cellular crosstalk mediated by ET-1 at the lesion site. Are ECs along the nerve secreting ET-1? What cells are present in the nerve stroma that could respond and participate in the repair process? Would these interactions be sensitive to Bosentan? It may be di]icult to dissect this contribution, but it should at least be discussed.  

      We thank the reviewer for this important point and agree that the in vivo e]ects observed cannot rule out the contribution of ECs or SCs at the lesion site in the nerve. Dissecting the contribution of ETBR expressing cells in the nerve would require cell-specific manipulations that go beyond the scope of this manuscript. We have revised the Discussion section to highlight the potential contribution of ECs, fibroblast and SCs in the nerve.  

      (2) It is suggested that the permeability of DRG vessels may facilitate the release of "vascularderived signals" (lines 82-84). Is it possible that the ET-1/ETBR pathway modulates vascular permeability, and that this, in turn, contributes to the observed e]ects on regeneration?  

      We thank the reviewer for raising this interesting point. ET-1 can have an impact on vascular permeability. It was indeed shown that in high glucose conditions, increased trans-endothelial permeability is associated with increased Edn1, Ednra and Ednrb expression and augmented ET1 immunoreactivity (PMID: 10950122). It is thus possible that part of the e]ects observed results from altered vascular permeability. We have included this point in the Discussion section. Future experiments will be required to test how injury and age a]ects vascular permeability in the DRG.

      (3) Is the a]inity of ET-3 for ETBR similar to that of ET-1? Can it be excluded that ET-3 expressed by fibroblasts is relevant for controlling SGC responses upon injury/aging?  

      We thank the reviewer for raising this point. ET-1 binds to ETAR and ETBR with the same a]inity, but ET3 shows a higher a]inity to ETBR than to ETAR (Davenport et al. Pharmacol. Rev 2016 PMID: 26956245). We attempted to examine ET-3 level in adult and aged DRG by western blot, but in our hands the antibody did not work well enough, and we could not obtain clear results. We thus cannot exclude the possibility that ET-3 released by fibroblasts contribute to the e]ects we observe on axon regeneration. Indeed, in cultured cortical astrocytes, application of either ET-1 or ET-3 leads to inhibition of Cx43 expression. We have revised the text in the Discussion section to highlight the possibility that both ET-1 and ET-3 could participate on the ETBRdependent e]ect on axon regeneration.

      (4) ETBR inhibition in dissociated (mixed) cultures uncovers the restraining activity of endothelin signaling on axon growth (Figure 2C). Since neurons do not express ET-1 receptors, based on scRNA-seq analysis, these results are interpreted as an indication that basal ETBR signaling in SGC curbs the axon growth potential of sensory neurons. For this to occur in dissociated cultures, however, one should assume that SGC-neuron association is present, similar to in vivo, or to whole DRG cultures (Figure 2C). Has this been tested?

      We thank the reviewer for this point. In dissociated DRG culture, neurons, SGCs and other nonneuronal cells are present, but SGCs do not retain the surrounding morphology as they do in vivo. Within 24 hours in culture, SGCs lose their adhesive contacts with neuronal soma and adhere to the coverslip (PMID: 22032231, PMID: 27606776).  We have included new data in Figure 2B to show that in our culture conditions, SGCs are present, but do not wrap neurons soma as they do in vivo. We also know from prior studies that the density of the culture a]ects axon growth, an e]ect that was attributed to trophic factors released from non-neuronal cells (Smith and Skene 1997). Therefore, although SGCs do not surround neurons, the signaling pathway downstream of ETBR may be present in culture and contribute to the release of trophic factors that influence axon growth. We have revised the Results section to better explain our in vitro results and their interpretation.

      In both in vitro experimental settings (dissociated and whole DRG cultures) how is ETBR stimulated over up to 7 days of culture? In other words, where does endothelin come from in these cultures (which are unlikely to support EC/blood vessel growth)? Is it possible that the relevant ligand here derives from fibroblasts (see point #6)? Or does it suggest that ETBR can be constitutively active (i.e., endothelin-independent signaling)? Is there any chance that endothelin is present in the culture media or Matrigel? 

      We thank the reviewer for raising this point.  Our single-cell data indicate that ET-1 is expressed by endothelial cells and ET-3 by fibroblasts. In dissociated DRG culture at 24h time point, all DRGs cells are present, including endothelial cells and fibroblasts, and could represent the source of ET-1 or ET-3. In the explant setting, it is also possible that both ET-1 and ET-3 are released by endothelial cells and fibroblasts during the 7 days in culture. According to information for the suppliers, endothelin is not present neither in the culture media nor in the Matrigel. While mutations can facilitate the constitutive activity of the ETBR receptor, we are not aware of data showing that endogenous ETBR can be constitutively active.  Because the molecular mechanisms governing ETBR -mediated signaling remain incompletely understood (see for example PMID: 39043181, PMID: 39414992) future studies will be required to elucidate the detailed mechanisms activating ETBR in SGCs and its downstream signaling mechanisms.  We have now expanded the Results and discussion sections to clarify these points. 

      (5) The discovery that ET-1/ETBR signaling in SGC curtails the growth capacity of axons at baseline raises questions about the physiological role of this pathway. What happens when ETBR signaling is prevented over a longer period of time? This could be addressed with pharmacological inhibitors, or better, with cell-specific knock-out mice. The experiments would certainly be of general interest, although not within the scope of this story. Nevertheless, it could be worth discussing the possibilities. 

      We agree that this is an interesting point. As mentioned above in response to point #1 of reviewer 1, the physiological role of this pathway could be to limit plasticity and prevent maladaptive neural rewiring that can happen after injury (Costigan et al 2009, PMID: 19400724), but can also hinder beneficial recovery after injury. Other mechanisms that limit axon regeneration capacity have been described and involve local mRNA translation and Rho signaling. We have revised the Discussion section to include these points. We agree that understanding the consequence of blocking ETBR over longer time periods is beyond the scope of the current study, but we now discuss the possibility that blocking ETBR with a cell specific KO approach could unravel its physiological function on target innervation and behavior. 

      (6) Assessing Cx43 levels by measuring the immunofluorescence signal (Figure 5E-F) is acceptable, particularly when the aim is to restrict the analysis to SGCs. The modulation of Cx43 expression by ET-1/ETBR plays an important part in the proposed model. Therefore, a complementary analysis of Cx43 expression by quantitative RT-PCR on sorted SGCs would be a valuable addition to the immunofluorescence data. Is this attainable? 

      We agree and have attempted to perform these types of experiments but encountered technical di]iculties. We attempted to sorting SGCs from transgenic mice in which SGCs are fluorescently labeled. However, the cells did not survive the sorting process and died in culture.  We think that increasing the viability of cells after sorting would require capillary- free fluorescent sorting approaches. However, we do not currently have access to such technology. We attempted this experiment with cultured SGCs, following a previously published protocol (Tonello et al. 2023 PMID: 38156033). In these experiments, SGCs are cultured for 8 days to obtain purity. We did not observe any di]erence in Cx43 protein or mRNA level upon treatment with ET-1 with or without BQ788. However, in these SGCs cultures, Cx43 displayed a di]use localization, rather than puncta as observed in vivo. Therefore, despite our multiple attempts, quantifying Cx43 on sorted or purified SGCs was not attainable.

      (7) The conclusions "We thus hypothesize that ETBR inhibition in SGCs contributes to axonal regeneration by increasing Cx43 levels, gap junction coupling or hemichannels and facilitating SGC-neuron communication" (lines 303-305) are consistent with the findings but seem in contrast with the e]ect of aging on gap junction coupling reported by others and cited in line 210: "the number of gap junctions and the dye coupling between these cells increases (Huang et al., 2006)". I am confused by what distinguishes a potential, and supposedly beneficial, increase in coupling after ETBR inhibition, from what is observed in aging. 

      We agree that the aging impact of Cx43 level and gap junction number appears contradictory. Procacci et al 2008 reported that Cx43 expression in SGCs decreases in the aged mice. Huang et al 2006 report that both the number of gap junctions and the dye coupling between these cells were found to increase with aging. Procacci et al suggested as a possible explanation for this apparent discrepancy that additional connexin types other than Cx43 may contribute to the gap junctions between SGCs in aged mice. Our snRNAseq data did not allow us to verify this hypothesis, because there were less SGCs in aged mice compared to adult, and connexin genes were detected in only 20% or less of SGCs.  Furthermore, our quantification did not look specifically at gap junctions, but just at Cx43 puncta. Cx43 can also form hemichannels in addition to gap junctions, and can also perform non-channel functions, such as protein interaction, cell adhesion, and intracellular signaling. Thus, more research examining the role of Cx43 in SGCs is necessary to address this discrepancy in the literature. We have expanded the Discussion section to include these points. 

      (8) I find it di]icult to reconcile the results in Figure 5F with the proposed model since (1) injury increases Cx43 levels in both adult and aged mice, (2) the injured aged/vehicle group has a similar level to the uninjured adult group, (3) upon injury, aged+Bosentan is much lower than adult+Bosentan (significance not tested). It seems hard to explain the e]ect of Bosentan only through the modulation of Cx43 levels. Whether the increase in Cx43 levels following ETBR inhibition actually results in higher SGC-neuron coupling has not been assessed experimentally. 

      We thank the reviewer for this point and agree that the e]ect of Bosentan is likely not exclusively through the modulation of Cx43 levels in SGCs, and that Cx43 levels may simply correlate with axon regenerative capacity. We have revised the manuscript to clarify this point.  We have also added the missing significance test in Figure 5F.

      Cell specific KO of Cx43 and ETBR would allow to test this hypothesis directly but is beyond the scope of the current study. We have not tested SGCs-neuron coupling, as these experiments are currently beyond our area of expertise. Cx43 has also other functions beyond gap junction coupling, such as protein interaction, cell adhesion, and intracellular signaling. Investigating the precise function of Cx43 would require in depth biochemical and cell specific experiments that are beyond the scope of this study. Furthermore, as we now mentioned in response to reviewer #2 point 5, ETBR signaling may also have other downstream e]ects in SGCs, such as glutamate transporters expression, or a]ect other cells in the nerve during the regeneration process. We have revised the Discussion section to include these alternative mechanisms.

      Reviewer #3(Public Review): 

      Summary: 

      This manuscript suggests that inhibiting ETBR via the FDA-approved compound Bosentan can disrupt ET-1-ETBR signalling that they found detrimental to nerve regeneration, thus promoting repair after nerve injury in adult and aged mice. 

      Strengths: 

      (1) The clinical need to identify molecular and cellular mechanisms that can be targeted to improve repair after nerve injury. 

      (2) The proposed mechanism is interesting. 

      (3) The methodology is sound. 

      We thank the reviewer for highlighting the strengths of our study

      Weaknesses: 

      (1) The data appear preliminary and the story appears incomplete. 

      We appreciate the reviewer’s point. We would like to emphasize that our results provide compelling evidence that ETBR signaling is a default brake on axon growth, and inhibiting this pathway promotes axon regeneration after nerve injury and counters the decline in regenerative capacity that occurs during aging. We also provide evidence that ETBR signaling regulates the levels of Cx43 in SGCs. Furthermore, our results document the use of an FDA approved compound to increase axon regeneration may be of interest to the broader readership, as there is currently no therapies to improve or accelerate nerve repair after injury. We agree that the detailed mechanisms operating downstream of ETBR will need to be elucidated. Answering these questions would first require cell specific KO of ETBR and Cx43 to confirm that this pathway is operating in SGCs to control axon regeneration. We would also need to identify how SGCs communicate with neurons to regulate axon regeneration, which is a large area of ongoing research that remains poorly understood. This extensive and highly complex set of experiments is beyond the scope of the current study. As we discussed in our response to reviewer #1 and #2 we attempted to perform numerous additional experiments to better define the role of ETBR signaling in SGCs in aging and have included additional results in Fig. 2B, Fig 3G-H,  Fig 5A-E, and Figure 4- Figure Supplement 1and Figure 5- Figure Supplement 1. We have expanded the

      Discussion to acknowledge the limitation of our study and to discuss possible mechanisms.  

      (2) Lack of causality and clear cellular and molecular mechanism. There are also some loose ends such as the role of connexin 43 in SGCs: how is it related to ET-1- ETBR signalling?  

      We thank the reviewer for this point and agree that the molecular mechanisms downstream of ETBR remain to be elucidated. However, we believe that our manuscript reports an interesting potential of an FDA-approved compound in promoting nerve repair. We focused on Cx43 downstream of ETBR signaling because decreased Cx43 expression in SGCs in ageing was previously established, but the mechanisms were not elucidated. Furthermore, it was reported that ET1 signaling in cultured astrocytes, which share functional similarities with SGCs, leads to the closure of gap junctions and reduction in Cx43 expression. Our study thus provides a mechanism by which ETBR signaling in SGCs regulates Cx43 expression. Whether Cx43 directly impact axon regeneration remains to be tested. Cell specific KO of Cx43 and ETBR would be required to answer this question. We have revised the Introduction and Discussion section extensively to provide a link between ETBR and Cx43 and to acknowledge the lack of causality in Cx43 in SGCs, as well as to provide additional potential mechanisms by which ETBR inhibition may promote nerve repair.

      Reviewer #2 (Recommendations For The Authors): 

      In addition to the points listed in the Public Review section, please consider the following comments: 

      (1) ETAR, which is high in mural cells, does not seem to be implicated in the reported proregenerative e]ects. Even so, can vasoconstriction be ruled out as an underlying cause of the age-dependent decline in axon regrowth potential and, more generally, in the e]ects of ET-1 inhibition on regeneration? This could be discussed. 

      We agree that we can’t exclude a role in vasoconstriction or e]ect on vascular permeability in the age-dependent decline in axon regrowth potential. However, our in vitro and ex vivo experiments, in which vascular related mechanisms are unlikely, suggest that vasoconstriction may not be a major contributor to the e]ects we observed.

      (2) The manuscript (e.g. line 287-288) would benefit from a discussion of the role that blood vessels play in the peripheral nervous system, and possibly CNS, repair. Vessels were shown to accompany regenerating fibers and instruct the reorganization of the nerve tissue to favor repair potentially through the release of pro-regenerative signals acting on stromal cells, glia, and other cellular components. Highlighting these processes will help put the current findings into perspective. 

      We agree and have revised the Discussion section to better explain the role of blood vessels in orientating Schwann cells migration and guiding axon regeneration.

      (3) The vast majority of the cells that are sequenced and shown in the UMAP in Figure 1C are from adult (3-month-old) mice [16,923 out of 18,098]. It would be useful to include the UMAP split (or color-coded) by timepoint to appreciate changes in cell clustering that may occur with aging.  

      We apologize for this misunderstanding, Figure 1C had all cells from all ages. However, the number of cells we obtained from the age group was insu]icient to perform in depth analysis of each cell type. We have thus revised this section and Figure 1, now only presenting the data from adult mice.  

      It is not discussed why fewer cells were sequenced at later stages. Additionally, I do not know how to interpret the double asterisks next to the labeling "18,098 samples" in Figure 1C. 

      Since our original sequencing of adult and aged mice using 10x yielded so few cells from the aged DRG, we tested and optimized a new technology for single cell preparation of DRG using Illumina Single Cell 3’ RNA Prep. This preparation creates templated emulsions using a vortex mixer to capture and barcode single-cell mRNA instead of a microfluidics system. This method yielded much better results for nuclei recovery from aged DRG, with more nuclei and better quality of nuclei. Thus, we now present in Figure 5 and Figure 5- Figure Supplement 1 the results from snRNA-sequencing of aged and adult DRG using the Illumina single cell kit. The results of the snRNA-sequencing show a decreased abundance of SGCs in aged mice, consistent with the results from our morphology analysis with EM. We were also able to perform SGCs-specific pathway analysis because of the increased number of nuclei captured in the aged SGCs, which we included in the manuscript.

      (4) The in vivo studies are designed to examine the e]ects of ETBR inhibition during the first phase of axon regrowth after nerve injury (1-3 days post-injury, dpi). Is there a reason why later stages have not been studied? It would be interesting to understand whether ETBR inhibition improves long-term recovery or is only e]ective at boosting the initial growth of axons through the lesion. It is possible that early inhibition will be enough for long-term recovery. If so, these experiments would define a sensitivity window with therapeutic value. 

      We agree that assessing functional recovery requires proper behavioral tests or morphological evaluations of reinnervation. To determine if Bosentan treatment has long-term e]ects on recovery, we administered Bosentan or vehicle for 3 weeks (daily for 1 week, and then once a week for the subsequent 2 weeks) after sciatic nerve crush. At 24 days after SNC, we assessed intraepidermal nerve fiber density (IENFD) in the injured paw and saw a trend towards increased fibers/mm in the treated animals (new Figure 3G,H). Future studies will examine how long-term Bosentan treatment a]ects functional recovery and innervation at later time points. Additionally, behavior assays will be needed to determine if these morphological changes relate to behavioral improvements using IENFD and behavior assays.

      (5) I am unsure if the gene expression analysis shown in Figure 6 fits well into this story. It is interesting per se and in line with previous work from this group showing the relevance of fatty acid metabolism in SGCs for axon regeneration. Nevertheless, without a mechanistic link to endothelin signaling and Cx43/gap junction modulation, the observations derived from DEG analysis are not well integrated with the rest and may be more distracting than helpful. One limitation is that there is no cell-type information for the DEGs due to the small number of cells recovered from aged mice. For instance, if ETBR inhibition rescued gene downregulation associated with fatty acid/cholesterol metabolism, then the DGE results would become more relevant for understanding the cellular basis of the pro-regenerative e]ect, which at this point remains quite speculative (lines 264-265; lines 318-319).  

      We agree and have added new snRNA sequencing data to replace these findings (see above response to point #4, new Figure 5 and Figure 5- Figure Supplement 1. The new data shows a decreased abundance of SGCs in aged mice, consistent with our TEM results. Pathway analysis revealed that aging triggers extensive transcriptional reprogramming in SGCs, reflecting heightened demands for structural integrity, cell junction remodeling, and glia–neuron interactions within the aged DRG microenvironment.  

      (6) It would be interesting to determine whether Bosentan increases SGC coverage of neuronal cell bodies in aged mice (Figures 6A-C). 

      We agree that this would be very interesting, but will require extensive EM analysis at di]erent time points and is beyond the scope of the current manuscript.

      (7) Finally, adding a summary model would help the readers. 

      We agree and have made a summary model, now presented in Figure 6F.

      Reviewer #3 (Recommendations For The Authors): 

      Longer time points post-injury and assessment of functional recovery after Bosentan would be of great value here. 

      We agree that assessing functional recovery requires proper behavioral tests or morphological evaluations of reinnervation. To determine if Bosentan treatment has long-term e]ects on recovery, we administered Bosentan or vehicle for 3 weeks (daily for 1 week, and then once a week for the subsequent 2 weeks) after sciatic nerve crush. At 24 days after SNC, we assessed intraepidermal nerve fiber density in the injured paw and saw a trend towards increased fibers/mm in the treated animals (Fig 3). While the results do not reach significance, we decided to include this new data as it provides evidence that Bosentan treatment may also improves long term recovery. Future studies will be required examine how long-term Bosentan treatment a]ects functional recovery and innervation at later time points. Additionally, behavior assays will be needed to determine if these morphological changes relate to behavioral improvements.

      It would be important to know how ET-1- ETBR signalling axis promotes the regeneration of axons:this remains unaddressed. What are the cells that are specifically involved? Endothelial cellsSGC- neurons- SC? There are no experiments addressing the role of any of these? 

      We agree that the molecular and cellular mechanisms by which ETBR signaling in SGCs promote axon regeneration remains to be elucidated.  Answering these questions would first require cell specific KO of ETBR and Cx43 to confirm that this pathway is operating in SGCs to control axon regeneration. We would also need to identify how SGCs communicate with neurons to regulate axon regeneration, which is a large area of ongoing research that remains poorly understood. While these are important experiments, because of numerous technical and temporal constrains, we believe they are beyond the scope of the current manuscript. 

      How does connexin 43 in SGCs related to ET-1- ETBR signalling? 

      The relation between connexin 43 and ETBR signaling stems from observations made in astrocytes. ET1 signaling in cultured astrocytes, which share functional similarities with SGCs, was shown to lead to the closure of gap junctions and the reduction in Cx43 expression. Because Cx43 expression, a major connexin expressed in SGCs as in astrocytes, was previously shown to be reduced at the protein level in SGCs from aged mice, we decided to explore it this ETBR-Cx43 mechanism also operates in SGCs. We have revised the Introduction and Discussion section extensively to acknowledge the lack of causality in Cx43 expression SGCs and to provide additional potential mechanisms by which ETBR inhibition may promote nerve repair.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      EnvA-pseudotyped glycoprotein-deleted rabies virus has emerged as an essential tool for tracing monosynaptic inputs to genetically defined neuron populations in the mammalian brain. Recently, in addition to the SAD B19 rabies virus strain first described by Callaway and colleagues in 2007, the CVS N2c rabies virus strain has become popular due to its low toxicity and high trans-synaptic transfer efficiency. However, despite its widespread use in the mammalian brain, particularly in mice, the application of this cell-type-specific monosynaptic rabies tracing system in zebrafish has been limited by low labeling efficiency and high toxicity. In this manuscript, the authors aimed to develop an efficient retrograde monosynaptic rabies-mediated circuit mapping tool for larval zebrafish. Given the translucent nature of larval zebrafish, whole-brain neuronal activities can be monitored, perturbed, and recorded over time. Introducing a robust circuit mapping tool for larval zebrafish would enable researchers to simultaneously investigate the structure and function of neural circuits, which would be of significant interest to the neural circuit research community. Furthermore, the ability to track rabies-labeled cells over time in the transparent brain could enhance our understanding of the trans-synaptic retrograde tracing mechanism of the rabies virus. 

      To establish an efficient rabies virus tracing system in the larval zebrafish brain, the authors conducted meticulous side-by-side experiments to determine the optimal combination of trans-expressed rabies G proteins, TVA receptors, and recombinant rabies virus strains. Consistent with observations in the mouse brain, the CVS N2c strain trans-complemented with N2cG was found to be superior to the SAD B19 combination, offering lower toxicity and higher efficiency in labeling presynaptic neurons. Additionally, the authors tested various temperatures for the larvae post-virus injection and identified 36℃ as the optimal temperature for improved virus labeling. They then validated the system in the cerebellar circuits, noting evolutionary conservation in the cerebellar structure between zebrafish and mammals. The monosynaptic inputs to Purkinje cells from granule cells were neatly confirmed through ablation experiments.

      However, there are a couple of issues that this study should address. Additionally, conducting some extra experiments could provide valuable information to the broader research field utilizing recombinant rabies viruses as retrograde tracers.

      (1) It was observed that many radial glia were labeled, which casts doubt on the specificity of trans-synaptic spread between neurons. The issues of transneuronal labeling of glial cells should be addressed and discussed in more detail. In this manuscript, the authors used a transgenic zebrafish line carrying a neuron-specific Cre-dependent reporter and EnvA-CVS N2c(dG)-Cre virus to avoid the visualization of virally infected glial cells. However, this does not solve the real issue of glial cell labeling and the possibility of a nonsynaptic spread mechanism.

      In agreement with the reviewer’s suggestion, we have incorporated a standalone section in the revised Discussion (page 9) to address the issue of transneuronal glial labeling, including its spatial distribution, temporal dynamics, potential mechanisms, and possible strategies for real resolution.

      Regarding the specificity of trans-synaptic spread between neurons, we have demonstrated that our transsynaptic tracing system reliably and specifically labels input neurons. Structurally, we only observed labeling of inferior olivary cells (IOCs) outside the cerebellum, which are the only known extracerebellar inputs to Purkinje cells (PCs), while all other traced neurons remained confined within the cerebellum throughout the observation period (see Figure 2G–I). Functionally, we verified that the traced neurons formed synaptic connections with the starter PCs (see Figure 2J–M). Together, these findings support the conclusion that our system enables robust and specific retrograde monosynaptic tracing of neurons in larval zebrafish.

      Regarding the transneuronal labeling of radial glia cells, we observed that their distribution closely correlates with the location of neuronal somata and dendrites (see Author response image 2). In zebrafish, radial glial cells are considered functional analogs of astrocytes and are often referred to as radial astroglia. The adjacent labeled astroglia may participate in tripartite synapses with the starter neurons and express viral receptors that enable RV particle entry at postsynaptic sites. This suggests that rabies-based tracing in zebrafish may serve as a valuable tool for identifying synaptically associated and functionally connected glia. Leveraging this approach to investigate glia–neuron interactions represents a promising direction for future research.

      In our system, the glial labeling diminishes at later larval stages, likely due to abortive infection (see Author response image 3 and relevant response). However, the eventual clearance of infection does not preclude the initial infection of glial cells, which may compete with neuronal labeling and reduce overall tracing efficiency. Notably, transneuronal infection of glial cells by RV has also been observed in mammals (Marshel et al., 2010). To minimize such off-target labeling, future work should focus on elucidating the mechanisms underlying glial susceptibility—such as receptor-mediated viral entry— and developing strategies to suppress receptor expression specifically in glia, thereby improving the specificity and efficiency of neuronal circuit tracing.

      In addition, wrong citations in Line 307 were made when referring to previous studies discovering the same issue of RVdG-based transneuronal labeling radial glial cells. "The RVdG-based transneuronal labeling of radial glial cells was commonly observed in larval zebrafish29,30".

      The cited work was conducted using vesicular stomatitis virus (VSV). A more thorough analysis and/or discussion on this topic should be included.

      We thank the reviewer for pointing out the citation inaccuracy. The referenced study employed vesicular stomatitis virus (VSV), which, like RV, is a member of the Rhabdoviridae family. We have revised the text accordingly—from "RVdG-based transneuronal labeling of radial glial cells…" to " Transneuronal labeling of radial glial cells mediated by VSV, a member of the Rhabdoviridae family like RV, has been commonly observed in larval zebrafish" (page 9, line 347).

      Several key questions should be addressed:

      Does the number of labeled glial cells increase over time? 

      Yes, as shown in Figure 2—figure supplement 1C and G, the number of labeled radial glial cells significantly increased from 2 to 6 days post-injection (dpi). This phenomenon has been addressed in the revised Discussion section (page 9, line 357).

      Do they increase at the same rate over time as labeled neurons?

      Although glial cell labeling continued to increase over time, we observed a slowdown in labeling rate between 6 and 10 dpi, as shown in Figure 2—figure supplement 1C and G. Therefore, we divided the timeline into two intervals (2–6 and 6–10 dpi) to compare the rate of increase in labeling between neurons and glia. The rate (R) was defined as the daily change in convergence index. To quantify the difference between neuronal and glial labeling rates, we calculated a labeling rate index: R<sub>g</sub>−R<sub>n</sub>, where R<sub>g</sub> and R<sub>n</sub> denote the rates for glia and neurons, respectively) (Author response image1). Our analysis revealed that, between 2 and 6 dpi, glial cells exhibited a higher labeling rate than neurons. However, this trend reversed between 6 and 10 dpi, with neurons surpassing glial cells in labeling rate. These findings have been included in the revised Discussion section (page 9).

      Author response image 1.

      Labeling rate index of glia and neurons across two time intervals. Data points represent the mean labeling rate index for each tracing strategy within each time interval. *P < 0.05 (nonparametric two-tailed Mann-Whitney test).  

      Are the labeled glial cells only present around the injection site?

      We believe the reviewer is inquiring whether labeled glial cells are spatially restricted to the vicinity of starter neurons. The initial infection is determined by the expression of TVA rather than the injection site. For example, injecting a high volume of virus into the anterior hindbrain resulted in the infection of TVA-expressing cells in distant regions, including the 109 tectum and posterior hindbrain (Author response image 2). 

      Regarding glial labeling, PC starter experiments showed that labeled glial cells (i.e. Bergmann glia) were predominantly localized within the cerebellum, likely due to the confinement of PC dendrites to this region. When using vglut2a to define starter neurons, glial labeling was frequently observed near the soma and dendrites of starter cells (14 out 114 of 17 cases; Author response image 2). These observations suggest that transneuronal labeled glial cells may be synaptically associated with the starter neurons. We have included this point in the revised Discussion section (page 9).

      Author response image 2.

      Location of transneuronal labeled glial cells. (a and b) Confocal images showing the right tectum (a) and posterior hindbrain (b) of different WT larvae expressing EGFP and TVA using UGNT in randomly sparse neurons (vglut2a<sup>+</sup>) and infected with CVSdGtdTomato[EnvA] (magenta) injected into the anterior hindbrain. Dashed yellow circles, starter neurons (EGFP<sup>+</sup>/tdTomato<sup>+</sup>); gray arrows, transneuronally labeled radial glia (tdTomato<sup>+</sup>/EGFP<sup>−</sup>); dashed white lines, tectum or hindbrain boundaries. C, caudal; R, rostral. Scale bars, 20 μm.

      Can the phenomenon of transneuronal labeling of radial glial cells be mitigated if the tracing is done in slightly older larvae?

      Yes, we agree. As elaborated in the following response, we hypothesize that the loss of fluorescence in radial glial cells at later developmental stages is due to abortive infection (see Author response image 3 and associated response). This supports the notion that abortive infection becomes increasingly pronounced as larvae mature, potentially explaining the negligible glial labeling observed in adult zebrafish (Dohaku et al., 2019; Satou et al., 2022). However, as noted in our response to the first comment, the disappearance of fluorescence does not indicate the absence of viral entry. Viral receptors may express on glial cells, allowing initial infection despite a failure in subsequent replication. Consequently, glial infection—though abortive—may still compete with neuronal infection and reduce tracing efficiency.

      What is the survival rate of the infected glial cells over time?

      We observed the disappearance of glial fluorescence after transneuronal labeling, while we did not observe punctate fluorescent debris typically indicative of apoptotic cell death. Therefore, we favor the hypothesis that the loss of glial fluorescence results from abortive infection rather than cell death. Abortive infection refers to a scenario in which viral replication is actively suppressed by host antiviral responses, preventing the production of infectious viral particles. For example, recent studies have shown that lab-attenuated rabies virus (RV) induces the accumulation of aberrant double-stranded DNA in astrocytes, which activates mitochondrial antiviral-signaling protein (MAVS) and subsequent interferon expression (Tian et al., 2018). This antiviral response inhibits RV replication, ultimately resulting in abortive infection. 

      In addition, we quantified the proportion of glial cells labeled at 2 dpi and 4dpi that retained fluorescence over time. By 6 dpi (approximately 11 dpf), glial labeling had largely diminished in both groups (Author response image 3). These results suggest that the decline in glial fluorescence is more closely linked to larval age than to the duration of glial infection, supporting the notion of abortive infection. This also addresses the reviewer’s earlier concern and indicates that glial labeling is mitigated in older larvae.

      Author response image 3.

      Fraction of glial cells with fluorescence retention. (a and b) Proportion of glial cells labeled at 2 dpi (a) and 4 dpi (b) that retained fluorescence over time. Data are from the CVS|N2cG|36°C group. In boxplots: center, median; bounds of box, first and third quartiles; whiskers, minimum and maximum values. n.s., not-significant; *P < 0.05, **P < 0.01 (nonparametric two-tailed Mann-Whitney test).

      If an infected glial cell dies due to infection or gets ablated, does the rabies virus spread from the dead glial cells?

      In our system, glial cells do not express the rabies glycoprotein (G). Therefore, even if glial cells are transneuronally infected, they cannot support viral budding or assembly of infectious particles due to the absence of G (Mebatsion et al., 1996), preventing further viral propagation to neighboring cells.

      If TVA and rabies G are delivered to glial cells, followed by rabies virus injection, will it lead to the infection of other glial cells or neurons?

      We have conducted experiments in which TVA and rabies G were specifically expressed in astroglia using the gfap promoter, followed by RVdG-mCherry[EnvA] injection. This resulted in initial infection of TVA-positive astroglia and occasional subsequent labeling of nearby TVA-negative astroglia (Author response image 4), suggesting astroglia-toastroglia transmission. Notably, no neuronal labeling was observed. This glial-to-glial spread is consistent with previous rabies tracing studies reporting similar phenomena involving the interaction of astrocytes with astrocytes and microglia (Clark et al., 2021). However, the underlying mechanism remains unclear, and we have discussed this in response to the first comment.

      Author response image 4.

      Viral tracing initiated from astroglia. (a) Confocal images of the tectum of a larva expressing EGFP and TVA using UGBT in randomly sparse astroglia (gfap<sup>+</sup>) and infected by SADdG-mCherry[EnvA] (magenta) injected into the anterior hindbrain.  (b) Confocal images of the posterior hindbrain of a larva expressing EGFP and TVA using UGNT in randomly sparse astroglia (gfap<sup>+</sup>) and infected by CVSdG-tdTomato[EnvA] (magenta) injected into the anterior hindbrain. Dashed yellow circles, starter astroglia (EGFP+/mCherry<su>+</sup> or EGFP<sup>+</sup>/tdTomato<sup>+</sup>); gray arrows, transneuronally labeled astroglia (tdTomato<sup>+</sup>/EGFP<sup>−</sup>); dashed white lines, tectum or hindbrain boundaries. C, caudal; R, rostral. Scale bars, 20 μm.<br />

      Answers to any of these questions could greatly benefit the broader research community.

      (2) The optimal virus tracing effect has to be achieved by raising the injected larvae at 36C. Since the routine temperature of zebrafish culture is around 28C, a more thorough characterization of the effect on the health of zebrafish should be conducted.

      Yes, 36°C is required to achieve optimal labeling efficiency. Although this is above the standard zebrafish culture temperature (28°C), previous work (Satou et al., 2022) and our observations indicate that this transient elevation does not adversely affect larval health within the experimental time window. 

      In the previous study, Satou et al. reported no temperature-dependent effects on swimming behavior, social interaction, or odor discrimination in adult fish maintained at 28°C and 36°C. In larvae, both non-injected and virus-injected fish showed a decrease in survival at later time points (7 dpi), with slightly increased mortality observed at elevated temperatures.

      In our study, we raised the same batch of non-virus-injected larvae at 28°C and 36°C, and found no mortality over a 10-day period. For CVS-N2c-injected larvae, electrode insertion caused injury, but survival rates remained around 80% at both temperatures (see Figure 3A). Moreover, we successfully maintained CVS-N2c-injected larvae at 36°C for over a month, indicating that elevated temperature does not adversely affect fish health. Notably, higher temperatures were associated with an accelerated developmental rate. 

      This point was briefly addressed in the previous version and has now been further elaborated in the revised Discussion section (page 8).

      (3) Given the ability of time-lapse imaging of the infected larval zebrafish brain, the system can be taken advantage of to tackle important issues of rabies virus tracing tools.

      a) Toxicity. 

      The toxicity of rabies viruses is an important issue that limits their application and affects the interpretation of traced circuits. For example, if a significant proportion of starter cells die before analysis, the traced presynaptic networks cannot be reliably assigned to a "defined" population of starter cells. In this manuscript, the authors did an excellent job of characterizing the effects of different rabies strains, G proteins derived from various strains, and levels of G protein expression on starter cell survival. However, an additional parameter that should be tested is the dose of rabies virus injection. The current method section states that all rabies virus preparations were diluted to 2x10^8 infection units per ml, and 2-5 nl of virus suspension was injected near the target cells. It would be interesting to know the impact of the dose/volume of virus injection on retrograde tracing efficiency and toxicity. Would higher titers of the virus lead to more efficient labeling but stronger toxicities? What would be the optimal dose/volume to balance efficiency and toxicity? Addressing these questions would provide valuable insights and help optimize the use of rabies viruses for circuit tracing.

      This is an important concern. Viral cytotoxicity is primarily driven by the level of viral transcription and replication, which inhibits host protein synthesis (Komarova et al., 2007). The RVdG-EnvA typically infects cells at a rate of one viral particle per cell (Zhang et al., 2024), suggesting that increasing viral concentration does not proportionally increase percell infection. Accordingly, viral titer and injection volume are unlikely to influence cytotoxicity at the single-cell level. In our experiments, injection volumes up to 20 nl (i.e., 4 to 10 times the standard injection volume) did not affect starter cell survival. However, higher titers or volumes may increase the number of initially infected starter cells, potentially leading to greater overall mortality in larval zebrafish.

      Similarly, given that rabies virus typically infects cells at one particle per cell, increasing viral titer alone is unlikely to enhance tracing efficiency once the virus type is fixed. In contrast, the level of G protein expression significantly influences tracing efficiency (see Figure 2D). However, excessive G protein expression reduces the survival of starter cells (see Figure 3D). Therefore, careful control of G protein levels is essential to balance tracing efficiency and cytotoxicity.

      Notably, regardless of whether infected cells undergo apoptosis or necrosis due to cytotoxicity, the resulting disruption of the plasma membrane severely impairs viral budding. As a result, the formation of intact, G protein-enveloped viral particles is prevented, limiting further infection of neighboring neurons.

      The latest second-generation ΔGL RV vectors (Jin et al., 2024), which lack both the G and L (viral polymerase) genes, have been shown to markedly reduce cytotoxicity. These improved tracing strategies may be explored in future zebrafish studies to further optimize labeling efficiency and cell viability.

      The issue of viral titer and volume has been addressed in the revised Discussion section (page 10).

      b) Primary starters and secondary starters: 

      Given that the trans-expression of TVA and G is widespread, there is the possibility of coexistence of starter cells from the initial infection (primary starters) and starter cells generated by rabies virus spreading from the primary starters to presynaptic neurons expressing G. This means that the labeled input cells could be a mixed population connected with either the primary or secondary starter cells.

      It would be immensely interesting if time-lapse imaging could be utilized to observe the appearance of such primary and secondary starter cells. Assuming there is a time difference between the initial appearance of these two populations, it may be possible to differentiate the input cells wired to these populations based on a similar temporal difference in their initial appearance. This approach could provide valuable insights into the dynamics of rabies virus spread and the connectivity of neural circuits.

      The reviewers suggestion is valuable. Regarding the use of Purkinje cells (PCs) as starter cells, we consider the occurrence of secondary PCs to be extremely rare. Although previous evidence suggests that PCs can form synaptic connections with one another (Chang et al., 2020), our sparse labeling strategy—typically involving fewer than 10 labeled cells— significantly reduces the likelihood of viral transmission between PC starter cells. In addition, if secondary starter PCs were frequently generated, we would expect increased tracing efficiency at 10 dpi compared to 6 dpi. However, our results show no significant difference (see Figure 2—figure supplement 1C and G). 

      Given the restricted expression of TVA and G in PCs, even if a limited number of secondary starters were generated, the labeled inputs would predominantly be granule cells (GCs), thereby preserving the cell-type identity of upstream inputs. While this raises a potential concern regarding an overestimation of the convergence index (CI). Notably, within the GC-PC circuit, individual GCs often project to multiple PCs. Consequently, a GC labeled via a secondary PC may also a bona fide presynaptic partner of the primary starter population. This overlap could mitigate the overestimation of CI. Taken together, we believe that the CI values reported in this study provide a reasonable approximation of monosynaptic connectivity.

      In scenarios where TVA and G are broadly expressed—for example, under the control of vglut2a promoter—secondary starter cells may arise frequently. In such cases, long-term time-lapse imaging in the zebrafish whole brain presents a promising strategy to distinguish primary and secondary starter cells, along with their respective input populations, based on the timing of their appearance. This approach potentially enables multi-step circuit tracing within individual animals. An alternative strategy is to use an EnvA-pseudotyped, G-competent rabies virus, which allows targeted initial infection while supporting multisynaptic propagation. When combined with temporally resolved imaging, this strategy could facilitate direct labeling of higher-order circuits and allow clear differentiation between multi-order inputs and the original starter population over time.

      In conclusion, we find this suggestion compelling and will explore these strategies in future studies to optimize and broaden the application of rabies virus-based circuit tracing.

      Reviewer #2 (Public Review):

      The study by Chen, Deng et al. aims to develop an efficient viral transneuronal tracing method that allows efficient retrograde tracing in the larval zebrafish. The authors utilize pseudotyped-rabies virus that can be targeted to specific cell types using the EnvA-TvA systems. Pseudotyped rabies virus has been used extensively in rodent models and, in recent years, has begun to be developed for use in adult zebrafish. However, compared to rodents, the efficiency of the spread in adult zebrafish is very low (~one upstream neuron labeled per starter cell). Additionally, there is limited evidence of retrograde tracing with pseudotyped rabies in the larval stage, which is the stage when most functional neural imaging studies are done in the field. In this study, the authors systematically optimized several parameters of rabies tracing, including different rabies virus strains, glycoprotein types, temperatures, expression construct designs, and elimination of glial labeling. The optimal configurations developed by the authors are up to 5-10 fold higher than more typically used configurations.

      The results are solid and support the conclusions. However, the methods should be described in more detail to allow other zebrafish researchers to apply this method in their own work.

      Additionally, some findings are presented anecdotally, i.e., without quantification or sufficient detail to allow close examinations. Lastly, there is concern that the reagents created by the authors will not be easily accessible to the zebrafish community.

      (1) The titer used in each experiment was not stated. In the methods section, it is stated that aliquots are stored at 2x10e8. Is it diluted for injection? Are all of the experiments in the manuscripts with the same titer?

      We injected all three viral vectors as undiluted stock aliquots. The titer for SADdGmCherry[EnvA], CVSdG-tdTomato[EnvA], and CVSdG-mCherry-2A-Cre[EnvA]) was 2 × 10<sup>8</sup>, 2 × 10<sup>8</sup>, and 3 × 10<sup>8</sup> infectious units/mL, respectively. This has been clarified in the updated Methods section (page 12).

      (2) The age for injection is quite broad (3-5 dpf in Fig 1 and 4-6 dpf in Fig 2). Given that viral spread efficiency is usually more robust in younger animals, describing the exact injection age for each experiment is critical.

      We appreciate the reviewer’s suggestions. For the initial experiments tracing randomly from neurons in Figure 1, the injection age was primarily 3–4 dpf, with a one-day difference. Due to the slower development of PCs, the injection age for experiments related to Figure 2,3, and 4, is mainly 5 dpf. To clarify the developmental stages at the time of injection for each experiment, we have  newly added tables (see Figure 1,2—table supplement 2) listing the number of fish used at each injection age for all experimental groups shown in Figure 1 and 2.

      (3) More details should be provided for the paired electrical stimulation-calcium imaging study. How many GC cells were tested? How many had corresponding PC cell responses? What is the response latency? For example, images of stimulated and recorded GCs and PCs should be shown.

      Yes, these are important details for the paired electrical stimulation-calcium imaging study. We stimulated 33 GCs from 32 animals and detected calcium responses in putative postsynaptic PCs in 15 cases. Among these, we successfully ablated the single GC in 11 pairs and observed a weakened calcium response in PCs following ablation (see Figure 2M). The response latency was determined as the first calcium imaging frame where ΔF/F exceeded the baseline (pre-stimulus average) by 3 times the standard deviation. Imaging was performed at 5 Hz, and as shown in Figure 2L, the calculated average response latency was 152 ± 35 ms (mean ± SEM), indicating an immediate response with calcium intensity from the first post-stimulus imaging frame consistently exceeding the threshold.

      We have added additional details to the Results (page 5), Discussion (page 9), and Methods (page 15) sections. A representative image showing both the stimulated GC and the recorded PC has been added to Figure 2 in the revised manuscript (see Figure 2K).

      (4) It is unclear how connectivity between specific PC and GC is determined for single neuron connectivity. In other images (Figure 4C), there are usually multiple starter cells and many GCs. It was not shown that the image resolution can establish clear axon dendritic contacts between cell pairs.

      In our experiments, sparse labeling typically results in 1–10 starter cells per fish. Regarding the case shown in Figure 4C (right column), only two PC starters were labeled, which simplifies the assignment of presynaptic inputs to individual PCs. Connectivity is determined based on clear axon-dendritic or axon-cell body apposition between GCs and PCs. We have accordingly added more details to the Methods (page 16) section regarding how we determined connectivity between specific PCs and GCs.

      Reviewer #2 (Recommendations For The Authors):

      To enable broader use of this technique, I would encourage the authors to submit their zebrafish lines, plasmids, and plasmid sequences to public repositories such as ZIRC and  Addgene. Additionally, there is no mention of how viral vectors will be shared.

      We have deposited the related zebrafish lines at CZRC (China Zebrafish Resource Center) and uploaded plasmid maps and sequences to Addgene. The viral vectors are available through BrainCase (Shenzhen, China). We have included the information in the revised manuscript.

      Reviewer #3 (Public Review):

      Summary:

      The authors establish reagents and define experimental parameters useful for defining neurons retrograde to a neuron of interest.

      Strengths:

      A clever approach, careful optimization, novel reagents, and convincing data together lead to convincing conclusions.

      Weaknesses: 

      In the current version of the manuscript, the tracing results could be better centered with  respect to past work, certain methods could be presented more clearly, and other approaches worth considering.

      Appraisal/Discussion:

      Trans-neuronal tracing in the larval zebrafish preparation has lagged behind rodent models,limiting "circuit-cracking" experiments. Previous work has demonstrated that pseudotyped rabies virus-mediated tracing could work, but published data suggested that there was considerable room for optimization. The authors take a major step forward here, identifying a number of key parameters to achieve success and establishing new transgenic reagents that incorporate modern intersectional approaches. As a proof of concept, the manuscript concludes with a rough characterization of inputs to cerebellar Purkinje cells. The work will be of considerable interest to neuroscientists who use the zebrafish model.

      Reviewer #3 (Recommendations For The Authors):

      The main limitations of the work are as follows:

      (1) The optimizations might differ for different neurons. Purkinje cells are noteworthy because they develop considerably during the time window detailed here, almost doubling in number between 7-14dpf. Presumably, connectivity follows. This sort of neurogenesis is much less common elsewhere. It would be useful to show similar results in, say, tectal neurons, which would have spatially-restricted retinal ganglion cells labelled.

      We acknowledge that Purkinje cells (PCs) undergo significant development between 7–14 dpf, which may influence synaptic connectivity and result in differences in tracing efficiency. However, all experimental conditions were standardized across groups, and the selection of starter PCs was unbiased, typically focusing on PCs in the lateral region of the CCe (corpus cerebelli) subregion, ensuring that the relative comparisons remain valid. 

      We agree that testing other neuronal populations would be valuable, as tracing efficiency is influenced by multiple factors, such as the number of endogenous inputs, synaptic maturation, and developmentally regulated synaptic strength. Tectal neurons, which receive spatially restricted retinal ganglion cell inputs, would be a suitable choice for further investigation. However, due to the various tectal cell types and the opacity of the eyeball, such studies present additional technical challenges and are beyond the scope of this paper.

      (2) The virus is delivered by means of microinjection near the cell. This is invasive and challenging for labs that dont routinely perform electrophysiology. It would be useful to know if coarser methods of viral delivery (e.g. intraventricular injection) would be successful. 

      Our protocol does not require the level of precision needed for electrophysiology. The procedure can be performed using a standard high-magnification upright (135× magnification, Nikon SMZ18) or inverted fluorescence microscope (200× magnification, Olympus IX51). The virus suspension was loaded into a glass micropipette with a ~10 µm tip diameter and directly microinjected into the target region using a micromanipulator. The procedure was comparable to embryonic microinjection in terms of precision and operational control. Notably, direct contact with the target cells is not necessary, as the injected virus solution can diffuse and effectively infect nearby cells.  

      We had attempted intraventricular injection as an alternative, but it failed to produce robust labeling, reinforcing the necessity for direct tissue injection. 

      We have now included additional methodological details in the Methods section (page 13). 

      (3) Because of the combination of transgenic lines, plasmid injection, and viral type, it is often confusing to follow exactly what is being done for a particular experiment. It would be useful to specify the transgenic background used for each experiment using standard nomenclature e.g. "Plasmids were injected into Tg(elavl3:GAL4) fish." This is particularly important for the experiments in Figure 4: it isnt clear what the background used for the sparse labels was. 

      Thank the reviewer for bringing this issue to our attention. In order to improve clarity, we have revised the figure legends to explicitly state the transgenic background, injected plasmids, and viral type used in each experiment, particularly for Figure 4. 

      (4) Plasmids should be deposited with Addgene along with maps specifying the particular "codon-optimized Tetoff" per 388. 

      We confirm that all plasmids, including those containing codon-optimized Tetoff constructs, have been uploaded to Addgene along with detailed maps.

      (5) It would be useful to know if there were more apoptotic cells after transfection -- an acridine orange or comparable assay is recommended, rather than loss of fluorescence. 

      We appreciate the reviewer’s suggestion to assess apoptosis using acridine orange staining or comparable assays. We agree that such methods can provide more direct detection of apoptotic events. However, we believe that the difference in cytotoxicity is already evident in our current data: SAD-infected cells exhibit greater loss than CVSinfected cells (see Figure 3D). This is consistent with previous observations in mice, where greater toxicity of SAD compared to CVS was demonstrated using propidium iodide (PI) staining in cultured cells (Reardon et al., 2016).

      (6) Line 219-228 Hibis lab has described the subtypes of granule cells in detail already; the work should discuss the tracings with respect to previous characterizations instead of limiting that work to a citation. 

      Thanks for the reminding of this point. We have expanded the Results section (page 6) to discuss the subtypes of GCs and PCs in relation to previously reported characterizations.

      (7) "Activities" is often used when "activity" is correct. The use of English in the manuscript is, by and large, excellent, but its worth running the text through software like Grammarly to catch the occasional error. 

      We have carefully edited the manuscript using professional language editing tools to correct any grammatical issues.

      (8) The experiments in 2J-2L would be more convincing if they were performed on inferior olive inputs as well -- especially given the small size of the granule cells. 

      We acknowledge the reviewers observation that granule cells (GCs) are relatively small, which may underline the finding that, out of 33 stimulated GCs, only 15 were capable of eliciting calcium responses in putative postsynaptic PCs. However, in all 11 pairs where a single GC was successfully ablated, we observed a weakened calcium response in PCs after the ablation (see Figure 2M), suggesting our tracing approach specifically identifies synaptically coupled neurons. We have clarified this point in the revised manuscript (page 5).

      We agree that verifying the IO inputs to PCs would strengthen the validity of our findings. However, in our experiments, the probability of tracing upstream IO cells was relatively low. This may be due to the developmental immaturity of the synapse and the fact that each PC typically receives input from a single IO cell. Additionally, the deep and distant anatomical location of the IO presents technical challenges for paired electrical stimulationcalcium imaging study. To address these limitations, we are currently exploring the integration of viral tracing and optogenetics to further investigate IO-PC connectivity in future studies.

      (9) It would be useful if the manuscript discussed the efficacy of trans-synaptic labelling. What fraction of granule cell / olivary inputs to a particular Purkinje cell do the authors think their method captures?

      This is an important point for assessing the efficacy of our trans-synaptic labeling. Ideally, electron microscopy (EM) data would provide the most precise evaluation. In the absence of EM data, we estimated the number of GCs, IOs and PCs using light microscopy-based cell counting. 

      At approximately 7 dpf, we manually counted 327 ± 14 PCs and 2318 ± 70 GCs in the Tg(2×en.cpce-E1B:tdTomato-CAAX) and Tg(cbln12:GAL4FF);Tg(5×UAS:EGFP) zebrafish cerebellum, across all subregions (Va, CCe, EG, and LCa). Given the developmental increase in the number of GCs and the fact that some GCs that have exclusively ipsilateral projections, and that a single PC would not receive input from all parallel fibers, we estimate that by 10–14 dpf, a single PC receives approximately 1000– 2000 GC inputs. Under optimal tracing conditions, we observed an average of 20 labeled GC inputs per PC, yielding a capture fraction of ~1–2%. Although this represents only a subset of total inputs, it is consistent with mammalian studies (Wall et al., 2010; Callaway et al., 2015), suggesting inherent limitations of this viral labeling approach.

      For IO inputs, we counted 325 ± 26 inferior olivary neurons in Tg(elavl3:H2B-GCaMP6s) fish. A single PC likely receives input from one IO neuron, though an IO neuron may innervate multiple PCs. Accordingly, the observed capture rate for IO inputs was lower (7 out of 248 starters). 

      Further optimization is required to enhance the tracing efficiency. We have now incorporated a Discussion on this point in the revised manuscript (page 8).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The manuscript proposes that 5mC modifications to DNA, despite being ancient and widespread throughout life, represent a vulnerability, making cells more susceptible to both chemical alkylation and, of more general importance, reactive oxygen species. Sarkies et al take the innovative approach of introducing enzymatic genome-wide cytosine methylation system (DNA methyltransferases, DNMTs) into E. coli, which normally lacks such a system. They provide compelling evidence that the introduction of DNMTs increases the sensitivity of E. coli to chemical alkylation damage. Surprisingly they also show DNMTs increase the sensitivity to reactive oxygen species and propose that the DNMT generated 5mC presents a target for the reactive oxygen species that is especially damaging to cells. Evidence is presented that DNMT activity directly or indirectly produces reactive oxygen species in vivo, which is an important discovery if correct, though the mechanism for this remains obscure.

      Strengths:

      This work is based on an interesting initial premise, it is well-motivated in the introduction and the manuscript is clearly written. The results themselves are compelling.

      We thank the reviewer for their positive response to our study.  We also really appreciate the thoughtful comments raised.  We have addressed the comments raised as detailed below. 

      Weaknesses:

      I am not currently convinced by the principal interpretations and think that other explanations based on known phenomena could account for key results. Specific points below.

      (1) As noted in the manuscript, AlkB repairs alkylation damage by direct reversal (DNA strands are not cut). In the absence of AlkB, repair of alklylation damage/modification is likely through BER or other processes involving strand excision and resulting in single stranded DNA. It has previously been shown that 3mC modification from MMS exposure is highly specific to single stranded DNA (PMID:20663718) occurring at ~20,000 times the rate as double stranded DNA. Consequently, the introduction of DNMTs is expected to introduce many methylation adducts genome-wide that will generate single stranded DNA tracts when repaired in an AlkB deficient background (but not in an AlkB WT background), which are then hyper-susceptible to attack by MMS. Such ssDNA tracts are also vulnerable to generating double strand breaks, especially when they contain DNA polymerase stalling adducts such as 3mC. The generation of ssDNA during repair is similarly expected follow the H2O2 or TET based conversion of 5mC to 5hmC or 5fC neither of which can be directly repaired and depend on single strand excision for their removal. The potential importance of ssDNA generation in the experiments has not been considered.

      We thank the reviewer for this interesting and insightful suggestion.  Our interpretation of our findings is that a subset of MMS-induced DNA damage, specifically 3mC, overlaps with the damage introduced by DNMTs and this accounts for increased sensitivity to MMS when DNMTs are expressed.  However, the idea that the introduction of 3mC by DNMT actually makes the DNA more liable to damage by MMS, potentially through increasing the level of ssDNA, is also a potential explanation, which could operate in addition to the mechanism that we propose.

      (2) The authors emphasise the non-additivity of the MMS + DNMT + alkB experiment but the interpretation of the result is essentially an additive one: that both MMS and DNMT are introducing similar/same damage and AlkB acts to remove it. The non-additivity noted would seem to be more consistent with the ssDNA model proposed in #1. More generally non-additivity would also be seen if the survival to DNA methylation rate is non-linear over the range of the experiment, for example if there is a threshold effect where some repair process is overwhelmed. The linearity of MMS (and H2O2) exposure to survival could be directly tested with a dilution series of MMS (H2O2).

      We thank the reviewer for this point.  As in the response to point #1, the reviewer’s hypothesis of increased potency of MMS, potentially through increased ssDNA, downstream of 3mC induction by DNMT, is a good one.  We have added a dose-response curve for DNMT-expressing cells to MMS to the revised version of the manuscript.  This shows that there is a non-linear response to MMS in the WT background.  Sensitivity is exacerbated by expression of DNMT and alkB mutation individually but there is also a strong non-additive effect that is particularly marked at low MMS concentrations where sensitivity is much higher in the double mutant than predicted from the two single mutants.  This is consistent with induction of DNA damage by DNMT that is repaired by alkB because alkB can be ‘overwhelmed’ even in WT backgrounds as the reviewer suggests.  However, it is also perfectly possible that the effect is due to increased levels of DNA damage induction in DNMT-expressing cells.  Both these results are compatible with our central hypothesis, namely that DNMT expression induces 3mC.  We have included these results along with discussion of them in the revised text in the results section:

      In order to investigate the non-additivity between DNMT expression and alkB mutation further, we investigated the effect of MMS over a range of concentrations for the different strains (Supplemental Figure 1A).  We quantified the non-additivity by comparing between the survival of alkB expressing DNMT to the predicted combined effect of either alkB mutation alone or DNMT expression alone(Supplemental Figure 1B).  Significantly reduced survival than expected was observed, most notably at low concentrations of MMS, which could be due to the saturation of the effect at high concentrations of MMS for alkB mutants expressing DNMT, where extremely high levels of sensitivity were observed.  The non-linear shape of the graph observed for WT cells expressing DNMTs further suggests that the ability of AlkB to repair the DNA is overwhelmed at high MMS concentrations even in the WT background.  These results are consistent with the idea that AlkB repairs a form of DNA damage from MMS that is more prevalent when DNMT is expressed.  This could be because DNMT induces 3mC, repaired by AlkB, and further 3mC is induced by MMS leading to much higher 3mC levels in the absence of AlkB activity.  Alternatively, 3mC induction by DNMT may lead to increased levels of ssDNA, particularly in alkB mutants, which could increase the risk of further DNA damage by MMS exposure and heighten sensitivity.  Either of these mechanisms are consistent with induction of 3mC by DNMT, and  indicate that the induction of DNA damage by DNMT expression has a fitness cost for cells when exposed to genotoxic stress in their environment. 

      (3) The substantial transcriptional changes induced by DNMT expression (Supplemental Figure 4) are a cause for concern and highlight that the ectopic introduction of methylation into a complex system is potentially more confounded than it may at first seem. Though the expression analysis shows bulk transcription properties, my concern is that the disruptive influence of methylation in a system not evolved with it adds not just consistent transcriptional changes but transcriptional heterogeneity between cells which could influence net survival in a stressed environment. In practice I don't think this can be controlled for, possibly quantified by single-cell RNA-seq but that is beyond the reasonable scope of this paper.

      We fully agree with the reviewer and, indeed, we are very interested in what is driving the transcriptional changes that we observed.  Work is currently underway in the lab to investigate this further but, as the reviewer suggests, is beyond the scope of this paper.  Importantly, we have used the transcriptional data to determine that the effect of DNMTs on ROS is unlikely to be due to failure of ROS-induced detoxification mechanisms by investigating the expression of oxyR regulated genes.  Nevertheless we have explicitly mentioned the concern raised by the reviewer in the revised manuscript as follows:

      “The substantial transcriptional responses could potentially affect how individual cells respond to genotoxic stress and thus could be contributing to some of the excess sensitivity to MMS and H2O2 in cells expressing DNMTs. However, the induction of oxyR regulated genes such as catalase was unaffected by 5mC (Supplementary Figure 4B).  Thus, the increased sensitivity to H2O2 is unlikely to be caused by failure of detoxification gene induction by DNMT expression.”

      (4) Figure 4 represents a striking result. From its current presentation it could be inferred that DNMTs are actively promoting ROS generation from H2O2 and also to a lesser extent in the absence of exogenous H2O2. That would be very surprising and a major finding with far-reaching implications. It would need to be further validated, for example by in vitro reconstitution of the reaction and monitoring ROS production. Rather, I think the authors are proposing that some currently undefined, indirect consequence of DNMT activity promotes ROS generation, especially when exogenous H2O2 is available. It would help if this were clarified.

      We thank the reviewer for picking this up.  In the discussion, we raise two possible explanations for why DNMT (even without H2O2) increases the ROS levels.  One idea is direct activity of DNMT, and one is through the product of DNMT activity (5mC) acting as a platform to generate more ROS from endogenous or exogenous sources.  Whilst we attempted to measure ROS from mSSSI activity in vitro, this experiment gave inconsistent results and therefore we cannot distinguish between these two possibilities.  However, we argued that direct activity is less likely, exactly as the reviewer points out.  We have clarified our discussion in the revised version, rewriting the entire section titled

      Oxidative stress as a new source of DNA damage induction by DNMT expression to more clearly set out these possibilities. 

      Reviewer #2 (Public review):

      5-methylcytosine (5mC) is a key epigenetic mark in DNA and plays a crucial role in regulating gene expression in many eukaryotes including humans. The DNA methyltransferases (DNMTs) that establish and maintain 5mC, are conserved in many species across eukaryotes, including animals, plants, and fungi, mainly in a CpG context. Interestingly, 5mC levels and distributions are quite variable across phylogenies with some species even appearing to have no such DNA methylation.

      This interesting and well-written paper discusses the continuation of some of the authors' work published several years ago. In that previous paper, the laboratory demonstrated that DNA methylation pathways coevolved with DNA repair mechanisms, specifically with the alkylation repair system. Specifically, they discovered that DNMTs can introduce alkylation damage into DNA, specifically in the form of 3-methylcytosine (3mC). (This appears to be an error in the DNMT enzymatic mechanism where the generation 3mC as opposed to its preferred product 5-methylcytosine (5mC), is caused by the flipped target cytosine binding to the active site pocket of the DNMT in an inverted orientation.) The presence of 3mC is potentially toxic and can cause replication stress, which this paper suggests may explain the loss of DNA methylation in different species. They further showed that the ALKB2 enzyme plays a crucial role in repairing this alkylation damage, further emphasizing the link between DNA methylation and DNA repair.

      The co-evolution of DNMTs with DNA repair mechanisms suggests there can be distinct advantages and disadvantages of DNA methylation to different species which might depend on their environmental niche. In environments that expose species to high levels of DNA damage, high levels of 5mC in their genome may be disadvantageous. This present paper sets out to examine the sensitivity of an organism to genotoxic stresses such as alkylation and oxidation agents as the consequence of DNMT activity. Since such a study in eukaryotes would be complicated by DNA methylation controlling gene regulation, these authors cleverly utilize Escherichia coli (E.coli) and incorporate into it the DNMTs from other bacteria that methylate the cytosines of DNA in a CpG context like that observed in eukaryotes; the active sites of these enzymes are very similar to eukaryotic DNMTs and basically utilize the same catalytic mechanism (also this strain of E.coli does not specifically degrade this methylated DNA) .

      The experiments in this paper more than adequately show that E. coli expression of these DNMTs (comparing to the same strain without the DNMTS) do indeed show increased sensitivity to alkylating agents and this sensitivity was even greater than expected when a DNA repair mechanism was inactivated. Moreover, they show that this E. coli expressing this DNMT is more sensitive to oxidizing agents such as H2O2 and has exacerbated sensitivity when a DNA repair glycosylase is inactivated. Both propensities suggest that DNMT activity itself may generate additional genotoxic stress. Intrigued that DNMT expression itself might induce sensitivity to oxidative stress, the experimenters used a fluorescent sensor to show that H2O2 induced reactive oxygen species (ROS) are markedly enhanced with DNMT expression. Importantly, they show that DNMT expression alone gave rise to increased ROS amounts and both H2O2 addition and DNMT expression has greater effect that the linear combination of the two separately. They also carefully checked that the increased sensitivity to H2O2 was not potentially caused by some effect on gene expression of detoxification genes by DNMT expression and activity. Finally, by using mass spectroscopy, they show that DNMT expression led to production of the 5mC oxidation derivatives 5-hydroxymethylcytosine (5hmC) and 5-formylcytosine (5fC) in DNA. 5fC is a substrate for base excision repair while 5hmC is not; more 5fC was observed. Introduction of non-bacterial enzymes that produce 5hmC and 5fC into the DNMT expressing bacteria again showed a greater sensitivity than expected. Remarkedly, in their assay with addition of H2O2, bacteria showed no growth with this dual expression of DNMT and these enzymes.

      Overall, the authors conduct well thought-out and simple experiments to show that a disadvantageous consequence of DNMT expression leading to 5mC in DNA is increased sensitivity to oxidative stress as well as alkylating agents.

      Again, the paper is well-written and organized. The hypotheses are well-examined by simple experiments. The results are interesting and can impact many scientific areas such as our understanding of evolutionary pressures on an organism by environment to impacting our understanding about how environment of a malignant cell in the human body may lead to cancer.

      We thank the reviewer for their response to our study, and value the time taken to produce a public review that will aid readers in understanding the key results of our study. 

      Reviewer #3 (Public review):

      Summary:

      Krwawicz et al., present evidence that expression of DNMTs in E. coli results in (1) introduction of alkylation damage that is repaired by AlkB; (2) confers hypersensitivity to alkylating agents such as MMS (and exacerbated by loss of AlkB); (3) confers hypersensitivity to oxidative stress (H2O2 exposure); (4) results in a modest increase in ROS in the absence of exogenous H2O2 exposure; and (5) results in the production of oxidation products of 5mC, namely 5hmC and 5fC, leading to cellular toxicity. The findings reported here have interesting implications for the concept that such genotoxic and potentially mutagenic consequences of DNMT expression (resulting in 5mC) could be selectively disadvantageous for certain organisms. The other aspect of this work which is important for understanding the biological endpoints of genotoxic stress is the notion that DNA damage per se somehow induces elevated levels of ROS.

      Strengths:

      The manuscript is well-written, and the experiments have been carefully executed providing data that support the authors' proposed model presented in Fig. 7 (Discussion, sources of DNA damage due to DNMT expression).

      Weaknesses:

      (1) The authors have established an informative system relying on expression of DNMTs to gauge the effects of such expression and subsequent induction of 3mC and 5mC on cell survival and sensitivity to an alkylating agent (MMS) and exogenous oxidative stress (H2O2 exposure). The authors state (p4) that Fig. 2 shows that "Cells expressing either M.SssI or M.MpeI showed increased sensitivity to MMS treatment compared to WT C2523, supporting the conclusion that the expression of DNMTs increased the levels of alkylation damage." This is a confusing statement and requires revision as Fig. 2 does ALL cells shown in Fig. 2 are expressing DNMTs and have been treated with MMS. It is the absence of AlkB and the expression of DNMTs that that causes the MMS sensitivity.

      We thank the reviewer for this and agree that this needs to be clarified with regards to the figure presented and will do so in the revised manuscript. The key comparison is between the active and inactive mSSSI which shows increased sensitivity when active methyltransferases are expressed.  We have clarified this in the revised version of the manuscript as follows:

      “Cells expressing either M.SssI or M.MpeI showed increased sensitivity to MMS treatment compared to cells expressing inactive M.SssI”

      (2) It would be important to know whether the increased sensitivity (toxicity) to DNMT expression and MMS is also accompanied by substantial increases in mutagenicity. The authors should explain in the text why mutation frequencies were not also measured in these experiments.

      This is an important point because it is not immediately obvious that increased sensitivity would be associated with increased mutagenicity (if, for example, 3mC was never a cause of innacurate DNA repair even in the absence of AlkB).  We have now added a Rif resistance assay which demonstrates increased mutagenesis in the presence of DNMT, and that this is exacerbated by loss of AlkB. This is now added as supplemental figure 2 and described in the manuscript as follows:

      “One potential consequence of DNMT activity in inducing DNA damage might be increased mutagenesis.  To test this we performed a rifampicin resistance mutagenesis assay, in the absence of MMS, to test whether DNMT induced damage was sufficient to lead to mutation rate increase.  Mutation rate was increased by DNMT expression (p=1.6e-12; two way anova; Supplemental Figure 2) and alkB mutation (two way anova) separately (p<1e-16).  Moreover, there was a significant interaction such that combined alkB mutation and DNMT expression led to a further increased mutation rate compared to the expectation from alkB mutation and DNMT expression separately (p = 7.9e-10; Supplemental Figure 2).  Importantly, DNMT induction alone would be expected to lead to increased mutations due to cytosine deamination(Sarkies, 2022a); however, there is a synergistic effect on mutations when this is combined with loss of AlkB function in alkB mutants. This is consistent with 3mC induction by DNMTs which is repaired by AlkB in WT cells but leads to mutations in alkB mutant cells.

      (3) Materials and Methods. ROS production monitoring. The "Total Reactive Oxygen Species (ROS) Assay Kit" has not been adequately described. Who is the Vendor? What is the nature of the ROS probes employed in this assay? Which specific ROS correspond to "total ROS"?

      The ROS measurement was with a kit from ThermoFisher: https://www.thermofisher.com/order/catalog/product/88-5930-74.  The probe is DCFH-DA.  This is a general ROS sensor that is oxidised by a large number of cellular reactive oxygen species hence we cannot attribute the signal to a single species.  Use of a technique with the potential to more precisely identify the species involved is something we plan to do in future, but is beyond what we can do as part of this study.  We have added a comment as to the specificity of the ROS sensor in the revised version as follows:

      “The ROS detection reagent in this system is DCFH-DA, a generalised ROS sensor that is not specific to any particular ROS molecule.”     

      (4) The demonstration (Fig. 4) that DNMT expression results in elevated ROS and its further synergistic increase when cells are also exposed to H2O2 is the basis for the authors' discussion of DNA damage-induced increases in cellular ROS. S. cerevisiae does not possess DNMTs/5mC, yet exposure to MMS also results in substantial increases in intracellular ROS (Rowe et al, (2008) Free Rad. Biol. Med. 45:1167-1177. PMC2643028). The authors should be aware of previous studies that have linked DNA damage to intracellular increases in ROS in other organisms and should comment on this in the text.

      We thank the reviewer for this point.  We note that the increased ROS that we observed occur in the presence of DNMTs alone and in the presence of H2O2, not in the presence of MMS; however, the point that DNA damage in general can promote increased ROS in some circumstances is well taken.  We have included a comment on this in the revised version as follows:

      “We believe this is a plausible mechanism to explain both increased ROS and increased sensitivity to oxidative stress when DNMT is expressed.  However, other explanations are possible, and it is notable that DNA damaging agents such as MMS can lead to ROS generation(Rowe et al., 2008).  A more detailed chemical and kinetic study of the ROS formation in DNMT-expressing cells would be needed to resolve these questions.”

    1. Author response:

      Reviewer #1 (Public review):

      In the current article, Octavia Soegyono and colleagues study "The influence of nucleus accumbens shell D1 and D2 neurons on outcome-specific Pavlovian instrumental transfer", building on extensive findings from the same lab. While there is a consensus about the specific involvement of the Shell part of the Nucleus Accumbens (NAc) in specific stimulus-based actions in choice settings (and not in General Pavlovian instrumental transfer - gPIT, as opposed to the Core part of the NAc), mechanisms at the cellular and circuitry levels remain to be explored. In the present work, using sophisticated methods (rat Cre-transgenic lines from both sexes, optogenetics, and the well-established behavioral paradigm outcome-specific PIT-sPIT), Octavia Soegyono and colleagues decipher the differential contribution of dopamine receptors D1 and D2 expressing spiny projection neurons (SPNs).

      After validating the viral strategy and the specificity of the targeting (immunochemistry and electrophysiology), the authors demonstrate that while both NAc Shell D1- and D2-SPNs participate in mediating sPIT, NAc Shell D1-SPNs projections to the Ventral Pallidum (VP, previously demonstrated as crucial for sPIT), but not D2-SPNs, mediates sPIT. They also show that these effects were specific to stimulus-based actions, as value-based choices were left intact in all manipulations.

      This is a well-designed study, and the results are well supported by the experimental evidence. The paper is extremely pleasant to read and adds to the current literature.

      We thank the Reviewer for their positive assessment.

      Reviewer #2 (Public review):

      Summary:

      This manuscript by Soegyono et al. describes a series of experiments designed to probe the involvement of dopamine D1 and D2 neurons within the nucleus accumbens shell in outcome-specific Pavlovian-instrumental transfer (osPIT), a well-controlled assay of cue-guided action selection based on congruent outcome associations. They used an optogenetic approach to phasically silence NAc shell D1 (D1-Cre mice) or D2 (A2a-Cre mice) neurons during a subset of osPIT trials. Both manipulations disrupted cue-guided action selection but had no effects on negative control measures/tasks (concomitant approach behavior, separate valued guided choice task), nor were any osPIT impairments found in reporter-only control groups. Separate experiments revealed that selective inhibition of NAc shell D1 but not D2 inputs to ventral pallidum was required for osPIT expression, thereby advancing understanding of the basal ganglia circuitry underpinning this important aspect of decision making.

      Strengths:

      The combinatorial viral and optogenetic approaches used here were convincingly validated through anatomical tract-tracing and ex vivo electrophysiology. The behavioral assays are sophisticated and well-controlled to parse cue and value-guided action selection. The inclusion of reporter-only control groups is rigorous and rules out nonspecific effects of the light manipulation. The findings are novel and address a critical question in the literature. Prior work using less decisive methods had implicated NAc shell D1 neurons in osPIT but suggested that D2 neurons may not be involved. The optogenetic manipulations used in the current study provide a more direct test of their involvement and convincingly demonstrate that both populations play an important role. Prior work had also implicated NAc shell connections to ventral pallidum in osPIT, but the current study reveals the selective involvement of D1 but not D2 neurons in this circuit. The authors do a good job of discussing their findings, including their nuanced interpretation that NAc shell D2 neurons may contribute to osPIT through their local regulation of NAc shell microcircuitry.

      We thank the Reviewer for their positive assessment.

      Weaknesses:

      The current study exclusively used an optogenetic approach to probe the function of D1 and D2 NAc shell neurons. Providing a complementary assessment with chemogenetics or other appropriate methods would strengthen conclusions, particularly the novel demonstration of D2 NAc shell involvement. Likewise, the null result of optically inhibiting D2 inputs to the ventral pallidum leaves open the possibility that a more complete or sustained disruption of this pathway may have impaired osPIT.

      We acknowledge the reviewer's valuable suggestion that demonstrating NAc-S D1- and D2-SPN engagement in outcome-specific PIT through another technique would strengthen our optogenetic findings. Several approaches could provide this validation. Chemogenetic manipulation, as the reviewer suggested, represents one compelling option. Alternatively, immunohistochemical assessment of phosphorylated histone H3 at serine 10 (P-H3) offers another promising avenue, given its established utility in reporting striatal SPN plasticity in the dorsal striatum (Matamales et al., 2020). We hope to complete such an assessment in future work since it would address the limitations of previous work that relied solely on ERK1/2 phosphorylation measures in NAc-S SPNs (Laurent et al., 2014).

      Regarding the null result from optical silencing of D2 terminals in the ventral pallidum, we agree with the reviewer's assessment. While we acknowledge this limitation in the current manuscript (see discussion), we aim to address this gap in future studies to provide a more complete mechanistic understanding of the circuit.

      Reviewer #3 (Public review):

      Summary:

      The authors present data demonstrating that optogenetic inhibition of either D1- or D2-MSNs in the NAc Shell attenuates expression of sensory-specific PIT while largely sparing value-based decision on an instrumental task. They also provide evidence that SS-PIT depends on D1-MSN projections from the NAc-Shell to the VP, whereas projections from D2-MSNs to the VP do not contribute to SS-PIT.

      Strengths:

      This is clearly written. The evidence largely supports the authors' interpretations, and these effects are somewhat novel, so they help advance our understanding of PIT and NAc-Shell function.

      We thank the Reviewer for their positive assessment.

      Weaknesses:

      I think the interpretation of some of the effects (specifically the claim that D1-MSNs do not contribute to value-based decision making) is not fully supported by the data presented.

      We appreciate the reviewer's comment regarding the marginal attenuation of value-based choice observed following NAc-S D1-SPN silencing. While this manipulation did produce a slight reduction in choice performance, the behavior remained largely intact. We are hesitant to interpret this marginal effect as evidence for a direct role of NAc-S D1-SPNs in value-based decision-making, particularly given the substantial literature demonstrating that NAc-S manipulations typically preserve such choice behavior (Corbit & Balleine, 2011; Corbit et al., 2001; Laurent et al., 2012). Notably, previous work has shown that NAc-S D1 receptor blockade impairs outcome-specific PIT while leaving value-based choice unaffected (Laurent et al., 2014). We favor an alternative explanation for our observed marginal reduction. As documented in Supplemental Figure 1, viral transduction extended slightly into the nucleus accumbens core (NAc-C), a region established as critical for value-based decision-making (Corbit & Balleine, 2011; Corbit et al., 2001; Laurent et al., 2012). The marginal impairment may therefore reflect inadvertent silencing of a small NAc-C D1-SPN population rather than a functional contribution from NAc-S D1-SPNs. Future studies specifically targeting larger NAc-C D1-SPN populations would help clarify this possibility and provide definitive resolution of this question.

    1. This op-ed addresses the issue with the exponential increase in publications and how this is leading to a lower quality of peer review which, in turn, is resulting in more bad science being published. It is a well-written article that tackles a seemingly eternal topic. This piece focussed more on the positives and potential actions which is nice to see as this is a topic that can become stuck in the problems. There are places throughout that would benefit from more clarity and at times there appears to be a bias towards publishers, almost placing blame on researchers. Very simple word changes or headings could immediately resolve any doubt here as I don't believe this is the intention of the article at all.

      Additionally, this article is very focussed on peer review (a positive) but I think that it would benefit from small additions throughout that zoom out from this and place the discussion in the context of the wider issues - for example you cannot change peer review incentives without changing the entire incentives around "service" activities including teaching, admin etc. This occurs to a degree with the discussion on other outputs, including preprints and data. Moreover, when discussing service type activities, there is data that reveals certain demographics deliberately avoid this work. Adding this element into the article would provide a much stronger argument for change (and do some good in the new current political climate).

      Overall, I thought this was a great piece when it was first posted online and does exactly what a good op-ed should - provoke thought and discussion. Below are some specific comments, in reading order. I do not believe that there are any substantial or essential changes required, particularly given that this is an op-ed article.

      -----

      Quote: "Academia is undergoing a rapid transformation characterized by exponential growth of scholarly outputs."

      Comment: There's an excellent paper providing evidence to this: https://direct.mit.edu/qss/article/5/4/823/124269/The-strain-on-scientific-publishing which would be a very positive addition

      Quote: "it’s challenging to keep up with the volume at which research publications are produced"

      Comment: Might be nice to add that this was a complaint dating back since almost the beginning of sharing research via print media, just to reinforce that this is a very old point.

      Quote: "submissions of poor-quality manuscripts"

      Comment: The use of "poor quality" here is unnecessary. Just because a submission is not accepted, it has no reflection on "quality". As such this does seem to needlessly diminish work rejected by one journal

      Quote: "Maybe there are too many poor quality journals too - responding to an underlying demand to publish low quality papers."

      Comment: This misses the flip side - poor quality journals encourage and actively drive low quality & outright fraudulent submissions due to the publisher dominance in the assessment of research and academics.

      Quote: "even after accounting for quality,"

      Comment: Quality is mentioned here but has yet to be clearly defined. What is "quality"? - how many articles a journal publishes? The "prestige" of a journal? How many people are citing the articles?

      Quote: "Researchers can – and do – respond to the availability by slicing up their work (and their data) into minimally publishable units"

      Comment: I fully agree that some researchers do exactly this. However, again, this seems to be blaming researchers for creating this firehose problem. I think this point could be reworded to not place so much blame or be substantiated with evidence that this is a widespread practice - my experience has been very mixed in that I've worked for people who do this almost to the extreme (and have very high self-citations) and also worked for people who focus on the science and making it as high quality and robust as possible. I agree many respond to the explosion of journals and varied quality in a negative manner but the journals, not researchers are the drivers here.

      Quote: "least important aspect of the expected contributions of scholars."

      Comment: I think it may be worth highlighting here that sometimes specific demographics (white males) actively avoid these kinds of service activities - there's a good study on this providing data in support of this. It adds an extra dimension into the argument for appropriate incentives and the importance & challenges of addressing this.

      Quote: "high quality peer review"

      Comment: Just another comment on the use of "quality'. This is not defined and I think when discussing these topics it is vital to be clear what one means by "high quality". For example, a high quality peer review that is designed as quality control would be detecting gross defects and fraud, preventing such work from being published (peer review does not reliably achieve this). In contrast, a high quality peer review designed to help authors improve their work and avoid hyperbole would be very detailed and collegial, not requesting large numbers of additional experiments.

      Quote: "conferring public trust in the oversight of science"

      Comment: I'm not convinced of this. Conveying peer review as a stamp of approval or QC leads to reduced trust when regular examples emerge with peer review failures - just look at Hydroxychloroquine and how peer review was used to justify that during COVID or the MMR/autism issues that are still on-going even after the work was retracted. I think this should be much more carefully worded, removed or expanded on to provide this perspective - this occurs slightly in the following sentence but it is very important to be clear on this point.

      Quote: "Researchers hold an incredible amount of market power in scholarly publishing"

      Comment: I like the next few paragraphs but, again, this seems to be blaming researchers when they in fact hold no/little power. I agree that researchers *could* use market pressure but this is entirely unrealistic when their careers depend on publishing X papers in X journal. An argument as to why science feels increasingly non-collaborative perhaps. Funders can have immediate and significant changes. Institutions adopting reward structures, such as teaching for example, would have significant impacts on researcher behaviour. Researchers are adapting to the demands the publication system creates - more journals, greater quantity and reduced quality whilst maintaining control over the assessment - eLife being removed from Wos/Scopus is a prime example of publishers (via their parent companies) preventing innovation or even rather basic improvements.

      Quote: "With preprint review, authors participate in a system that views peer review not as a gatekeeping hurdle to overcome to reach publication but as a participatory exercise to improve scholarship."

      Comment: This is framing that I really like; improving scholarship, not quality control.

      Quote: "buy"

      Comment: typo

      Quote: "adoption of preprint review can shift the inaccurate belief that all preprints lack review"

      Comment: Is this the right direction for preprints though? If we force all preprints to be reviewed and only value reviewed-preprints, then we effectively dismantle the benefits of preprints and their potential that we've been working so hard to build. A recent op-ed by Alice Fleerackers et al provided an excellent argument to this effect. More a question than a suggestion for anything to change.

      Quote: "between all of those stakeholders to work together without polarization"

      Comment: I disagree here - publishers have repeatedly shown that their only real interest is money. Working with them risks undermining all of the effort (financial, careers, reputation, time) that advocates for change put in. The OA movement should also highlight perfectly why this is such a bad route to go down (again). Publishers grip on preprint servers is a great example - those servers are hard to use as a reader, lack APIs and access to data, are not innovative or interacting with independent services. The community should make the rules and then publishers abide by and within them. Currently the publishers make all of the rules and dominate. Indeed, this is possibly the biggest ommision from this article - the total dominance of publishers across the entire ecosystem. You can't talk about change without highlighting that the publishers don't just own journals but the reference managers, the assessment systems, the databases etc. I may be an outlier on this point but for all of the people I interact with (often those at the bottom of the ladder) this is a strong feeling. Again, not a suggestion for anything to change and indeed the point of an op-ed is to stimulate thought and discussion so dissent is positive.

      Note that these annotations were made in hypothes.is and are available here, linked in-text for ease - comments are duplicated in this review.

    2. Summary of the essay

      In this essay, the author seeks to explain the ‘firehose’ problem in academic research, namely the rapid growth in the number of articles but also the seemingly concurrent decline in quality. The explanation, he concludes, lies in the ‘superstructure’ of misaligned incentives and feedback loops that primarily drive publisher and researcher behaviour, with the current publish or perish evaluation system at the core. On the publisher side, these include commercial incentives driving both higher acceptance rates in existing journals and the launch of new journals with higher acceptance rates. At the same time, publishers seek to retain reputational currency by maintaining consistency and therefore brand power of scarcer, legacy-prestige journals. The emergence of journal cascades (automatic referrals from one journal to another journal within the same publisher) and the introduction of APCs (especially for special issues) also contribute to commercial incentives driving article growth. On the researcher side, he argues that there is an apparent demand from researchers for more publishing outlets and simultaneous salami slicing by researchers because authors feel they have to distribute relatively more publications among journals that are perceived to be of lower quality (higher acceptance rates) in order to gain equivalent prestige to that of a higher impact paper. The state of peer review also impacts the firehose. The drain of PhD qualified scientists out of academia, compounded by a lack of recognition for peer review, further contributes to the firehose problem because there are insufficient reviewers in the system, especially for legitimate journals. Moreover, what peer review is done is no guarantee of quality (in highly selective journals as well as ‘predatory’). One of his conclusions is that there is not just a crisis in scholarly publishing but in peer review specifically and it is this crisis that will undermine science the most. Add AI into the mix of this publish or perish culture, and he predicts the firehose will burst.

      He suggests that the solution lies in researchers taking back power themselves by writing more but ‘publishing’ less. By writing more he means outputs beyond traditional journal publications such as policy briefs, blogs, preprints, data, code and so on, and that these should count as much as peer-reviewed publications. He places special emphasis on the potential role of preprints and on open and more collegiate preprint review acting as a filter upstream of the publishing firehouse. He ends with a call for more collegiality across all stakeholders to align the incentives and thus alleviate the pressure causing the firehose in the first place.

      General Comment

      I enjoyed reading the essay and think the author does a good job of exposing multiple incentives and competing interests in the system. Although discussion of perverse incentives has been raised in many articles and blog posts, the author specifically focuses on some of the key commercial drivers impacting publishing and the responses of researchers to those drivers. I found the essay compellingly written and thought provoking although it took me a while to work through the various layers of incentives.  In general, I agree with the incentives and drivers he has identified and especially his call for stakeholders to avoid polarization and work together to repair the system. Although I appreciate the need to have a focused argument I did miss a more in-depth discussion about the equally complex layers of incentives for institutions, funders and other organisations (such as Clarivate) that also feed the firehose.

      I note that my perspective comes from a position of being deeply embedded in publishing for most of my career. This will have also impacted what I took away from the essay and the focus of my comments below.

      Main comments

      1. I especially liked the idea of a ‘superstructure’ of incentives as I think that gives a sense of the size and complexity of the problem. At the same time, by focusing on publisher incentives and researchers’ response to them he has missed out important parts of the superstructure contributing to the firehose, namely the role of institutions and funders in the system. Although this is implicit, I think it would have been worth noting more, in particular:

        • He mentions institutions and the role of tenure and promotion towards the end but not the extent of the immense and immobilizing power this wields across the system (despite initiatives such as DORA and CoARA).

        • Most review panels (researchers) assessing grants for funders are also still using journal publications as a proxy for quality, even if the funder policy states journal name and rank should not be used

        • Many Institutions/Universities still rely on number and venue of publications. Although some notable institutions are moving away from this, the impact factor/journal rank is still largely relied on. This seems especially the case in China and India for example, which has shown a huge growth in research output. Although the author discusses the firehose, it would have been interesting to see a regional breakdown of this.

        • Libraries also often negotiate with publishers based on volume of articles – i.e they want evidence that they are getting more articles as they renegotiate a specific contract (e.g. Transformative agreements), rather than e.g. also considering the quality of service.

        • Institutions are also driven by rankings in a parallel way to researchers being assessed based on journal rank (or impact factor). How University Rankings are calculated is also often opaque (apart from the Leiden rankings) but publications form a core part. This further incentivises institutions to select researchers/faculty based on the number and venue of their publications in order to promote their own position in the rankings (and obtain funding)

      2. The essay is also about power dynamics and where power in the system lies. The implication in the essay is that power lies with the publishers and this can be taken back by researchers. Publishers do have power, especially those in possession of high prestige journals and yet publishers are also subject to the power of other parts of the system, such as funder and institutional evaluation policies. Crucially, other infrastructure organisations, such as Clarivate, that provide indexing services and citation metrics also exert a strong controlling force on the system, for example:

        • Only a subset of journals are ever indexed by Clarivate. And funders and Institutions also use the indexing status of a journal as a proxy of quality. A huge number of journals are thus excluded from the evaluation system (primarily in the arts and humanities but also many scholar-led journals from low and middle income countries and also new journals). This further exacerbates the firehose problem because researchers often target only indexed journals. I’d be interested to see if the firehose problem also exists in journals that are not traditionally indexed (although appreciate this is also likely to be skewed by discipline)

        • Indexers also take on the role of arbiters of journal quality and can choose to delist or list journals accordingly. Listing or delisting has a huge impact on the submission rates to journals that can be worth millions of dollars to a publisher, but it is often unclear how quality is assessed and there seems to be a large variance in who gets listed or not.

        • Clarivate are also paid large fees by publishers to use their products, which creates a potential conflict of interest for the indexer as delisting journals from major publishers could potentially cause a substantial loss of revenue if they withdraw their fees. Also Clarivate relies on publishers to create the journals on which their products are based which may also create a conflict if Clarivate wishes to retain the in-principle support of those publishers.

        • The delisting of elife recently, even though it is an innovator and of established quality, shows the precariousness of journal indexing.

      3. All the stakeholders in the system seem to be essentially ‘following the money’ in one way or another – it’s just that the currency for researchers, institutions, publishers and others varies. Publishers – both commercial and indeed most not-for profit -  follow the requirements of the majority of their ‘customers’  (and that’s what authors, institutions, subscribers etc are in this system) in order to ensure both sustainability and revenue growth. This may be a legacy of the commercialisation of research in the 20th Century but we should not be surprised that growth is a key objective for any company. It is likely that commercial players will continue to play an important role in science and science communication; what needs to be changed are the requirements of the customers.

      4. The root of the problem, as the author notes, is what is valued in the system, which is still largely journal publications. The author’s solution is for researchers to write more – and for value to be placed on this greater range of outputs by all stakeholders. I agree with this sentiment – I am an ardent advocate for Open Science. And yet, I also think the focus on outputs per se and not practice or services is always going to lead to the system being gamed in some way in order to increase the net worth of a specific actor in the system. Preprints and preprint review itself could be subject to such gaming if value is placed on e.g. the preprint server or the preprint-review platform as a proxy of preprint and then researcher quality.

      5. I think the only way to start to change the system is to start placing much more value on both the practices of researchers (as well as outputs) and on the services provided by publishers. Of course saying this is much easier than implementing it.

      Other comments

      1. A key argument is that higher acceptance rates actually create a perverse incentive for researchers to submit as many manuscripts as possible because they are more likely to get accepted in journals with higher acceptance rates. I disagree that higher acceptance rates per se are the main incentive for researchers to publish more. More powerful is the fact that those responsible for grants and promotion continue to use quantity of journal articles as a proxy for research quality.

      2. Higher acceptance rates are not necessarily an indicator of low quality or a bad thing if it means that null, negative and inconclusive results are also published

      3. The author states that Journal Impact Factors might have been an effective measure of quality in the past.  I take issue with this because the JIF has, as far as I know, always been driven by relatively few outliers (papers with very high citations) and I don’t know of evidence to show that this wasn’t also true in the past. It also makes the assumption that citations = quality.

      4. The author asks at one point “Why would field specialization need a lower threshold for publication if the merits of peer review are constant? ” I can see a case for lower thresholds, however, when the purpose of peer review is primarily to select for high impact, rather than rigour, of the science conducted. A similar case might be made for multidisciplinary research, where peer reviewers tend to assess an article from their discipline’s perspective and reject it because the part that is relevant to them is not interesting enough… Of course, this all points to the inherent problems with peer review (with which I agree with the author)

      5. The author puts his essay in appropriate context, drawing on a range of sources to support his argument. I particularly like that he tried to find source material that was openly available.

      6. He cites 2 papers by Bjoern Brembs to substantiate the claim that there is potentially poorer review in higher prestige journals than in lower ranked journals. These papers were published in 2013 and 2018 and the conclusions relied, in part, on the fact that higher ranked journals had more retractions. Apart from a potential reporting bias, given the flood of retractions across multiple journals in more recent years, I doubt this correlation now exists?

      7. The author works out submission rates from the published acceptance rates of journals. The author acknowledges this is only approximate and discusses several factors that could inflate or deflate it. I can add a few more variables that could impact the estimate, including: 1) the number of articles a publisher/journal rejects before articles are assigned to any editor (e.g. because of plagiarism, reporting issues or other research integrity issues), 2) the extent to which articles are triaged and rejected by editors before peer review (e.g. because it is out of scope or not sufficiently interesting to peer review); the number of articles rejected after peer review;  and 4) the extent to which authors independently withdraw an article at any stage of the process. When publishers publish acceptance rates, they don’t make it clear what goes into the numerator or the denominator and there are no community standards around this. The author rightly notes this process is too opaque.

      Catriona J. MacCallum

      As is my practice, I do not wish to remain anonymous. Please also note that I work for a large commercial publisher and am writing this review in an independent capacity such that this review reflects my own opinion, which are not necessarily those of my employer.

    3. This is a well written and clear enough piece that may be helpful for a reader new to the topic. To people familiar with the field there is not so much which is new here. The final recommendation is not well expressed. As currently put it is, I think, wrong. But it is a provocative idea. I comment section by section below.

      The first paragraphs repeat well established facts that there are too many papers. Seppelt et al’s contribution is missing here. It also reproduces the disengenuous claim, by a publisher’s employee, that publishers ‘only’ respond to demand. I do not think that is true. They create demand. They encourage authors to write and submit papers, as anyone who has been emailed by MDPI recently can testify. Why repeat something which is so inaccuate?

      The section on ‘upstream of the nozzle’ is rather confusing. I think the author is trying to establish if more work is being submitted. But this cannot be deduced from the data presented. No trends are given. Rejection rates will be a poor guide if the same paper is being rejected by several journals. I was also confused by the sources used to track growth in papers – why not just use Dimensions data? The final paragraph again repeats well known facts about the proliferation of outlets and salami slicing. Thus far the article has not introduced new arguments.

      Minor points in this section:

      • there are some unsupported claims. Eg ‘This is a practice that is often couched within the seemingly innocuous guise of field specialty journals.’

      • I also do not understand the logic of this rather long sentence: ‘The expansion of journals with higher acceptance rates alters the rational calculus for researchers - all things being equal higher acceptance rates create a perverse incentive to submit as many manuscripts as possible since the underlying probability of acceptance is simply higher than if those same publications were submitted to a journal with a lower acceptance rate, and hence higher prestige.’ I suggest it be rephrased

      The section on peer review (Who’s testing the water) is mostly a useful review of the issues. But there are some problems which need addressing. Bizarrely, when discussing whether there enough scientists, it fails to mention Hanson et al’s global study, despite linking to it’s preprint in the opening lines. Instead the author adopts a parochial North American approach and refers only to PhDs coming from the US. It is not adequate to take trends in one country to cannot explain an international publishing scene. These are not the ‘good data’ the author claims. Likewise the value of data on doctorates not going onto a post-doc hinges on how many post-docs there are. That trend is not supplied. This statement ‘Almost everyone getting a doctorate goes into a non-university position after graduation’ may be true, but no supporting data are supplied to justify it. Nor do we know what country, or countries, the author is referring to.

      The section ‘A Sip from the Spring’ makes the mistaken claim that researchers hold market power. This is not true. Researchers institutions, their libraries and governments are the main source of publisher income. It is here that the key proposal for improvement is made: researcher can write more and publish less. But if the problem is that there is too much poorly reviewed literature then this cannot be the solution. Removing all peer review, would mean there is even more material to read whose appearance is not slowed up by peer review at all. If peer review is becoming inadequate, evading it entirely is hardly a solution.

      This does not mean we should not release pre-prints. The author is right to advocate them, but the author is mistaken to think that this will reduce publishing pressures. The clue is in their name ‘pre-print’. Publication is intended.

      Missing from the author’s argument is recognition of the important role that communities of researchers form, and the roles that journals play in providing venues for conversation, disagreement and disucssion. They provide a filter. Yes researchers produce other material than publications as the author states: ‘grant proposals, editorials, policy briefs, blog posts, teaching curricula and lectures, software code and documentation, dataset curation, and labnotes and codebooks.’ I would add email and whatsapp messages to that list. But adding all that to our reading lists will not reduce the volume of things to be read. It must increase it. And it would make it harder to marshall and search all those words.

      But the idea is provocative nonetheless. Running through this paper, and occasionally made explicit, is the fact that publishers earn billions from their ‘service’ to academia. They have a strong commercial interest in our publishing more, and in competing with each other to produce a larger share of the market. If writing more, and publishing less, means we need to find ways of directing our thoughts so that they earn less money for publishers, then that could bring real change to the system.

      A minor point: the fire hose analogy is fully exploited and rather laboured in this paper. But it is a North American term and image, that does not travel so easily.

    4. A few months back, Upstream editor Martin Fenner suggested that I submit my Upstream blog post titled, Drinking from the Firehose? Write More and Publish Less, for peer-review as a sort of experiment for Upstream through MetaROR. MetaROR, a relative newcomer to the scholarly communication community, provides the review and curate steps in the "publish-review-curate" model for meta-research.

      While I do not consider myself a meta-researcher (scholars who conduct research on research) many of my positions on science policy have implications on the field (especially, those on transparency, openness, and reproducibility). I think the main call in my blog post for reform in scholarly communication – namely, to stop publishing in traditional journals as much and start rewarding a broader swath of scholarly activities like data sharing – is particularly appealing to meta-researchers who rely on non-publication outputs for their work. So, I submitted. The article was openly reviewed, and MetaROR provided an editorial assessment. Here, I reply to the reviewers and contribute to the curation of the original post.

      The reviews are very high-quality - in fact, they are some of the most well-reasoned reviews I've received in the 20 years I've been a scholar. If MetaROR represents the future of peer-review through the publish-review-curate model, scholarly communication is about to get a whole lot better. You can read the open reviews of my blog post here. The revised version of the editorial is here.

      Like traditional peer-review, each individual reviewer provided their feedback independently of the others and the handling editor did not curate the reviews. I prefer when editors do such curation since it helps to organize the response in a way that reduces redundancy. This is one of the main benefits of the group-based peer review systems - such as PREreview's Live Review. Also, there was no easy way (or at least not an obvious one) to export the reviews in plaintext from MetaROR so I could respond point-by-point in software of my choice. Below is an attempt to organize my response roughly around the major criticisms and suggestions in the review. Because this was an opinion piece and not research, I'm not going to respond to every point anyway – nearly all of which I would accept and revise accordingly had this been a research article.

      Too Easy on the Publishers, Too Hard on Researchers

      All three reviewers expressed some dismay over how light my criticism of the publishers was in my blog piece. I do not disagree. The reviewers rightfully point out that the publishers play outsized role in the inequity created in the scholarly communication space. However, I am choosing not to revise here much as the essay was already too long - it would have taken a tome to articulate my criticism of the publishers. That's out of scope. However, I revised the first paragraph in the conclusion to state:

      The publishers are incentivized to avoid any other form of reform - this is the rational option that publishers choose in response to the apparent demand from researchers - as Ciavarella rightly pointed out.

      Two of the reviewers also thought I was too harsh on researchers. I don't think that I was overly harsh. All three agree with me that researchers have some market role here but disagree with the extent to which they can exert influence. One reviewer claims researchers have no market power (to which I respectfully disagree). I've clarified in the paper that: 'the power any individual researcher has here is small. Collective action is needed.' I reject that researchers are blameless for the status quo - complacency empowers the publishers. Unfortunately, it's also baked into the superstructure of the reward system that is perpetuated by publisher-controlled market forces. I also added the following sentiment along these lines when discussing market-power of researchers:

      It's free to share and read research without the need for costly, anticompetitive gatekeeping. Leveraging that freedom is an untapped source of market power.

      Focus More on Institutions and Funders and Communities

      Two of the three reviewers thought I needed to draw more attention to the roles, demands, and influence that academic institutions, publisher consortia, libraries, indexing services, scholarly societies, and grassroots research organizations have in this ecosystem. I agree with all these points - and had Clarivate's irresponsible delisting of eLife in the Web of Science happened before I wrote the original piece, I would have highlighted that as one reviewer suggested.

      No New Arguments or Analysis

      The reviewers felt that, while well-articulated, the arguments I was espousing are not novel. First, I think it is worthy to renew the idea that we should be more selective in choosing what to publish in journals. Focusing on quality over quantity and valuing activities beyond journal publications should be repeated often until it's common practice.

      One comment called for more data and analysis, and another wanted some additional research cited. I think that's a great idea and I hope the reviewers can do that work or perhaps the open review will inspire others to do so.

      In response to the criticism that preprints themselves both presuppose an eventual traditional publication and that they could be gamed, I revised that section accordingly:

      There is risk of gaming preprints and preprint review just as there is in traditional publishing, such as by placing value on a paper for where it appears or how it was reviewed without considering its quality or contribution to science.

      One reviewer misunderstood my point about preprints altogether:

      Removing all peer review, would mean there is even more material to read whose appearance is not slowed up by peer review at all. If peer review is becoming inadequate, evading it entirely is hardly a solution. This does not mean we should not release pre-prints. The author is right to advocate them, but the author is mistaken to think that this will reduce publishing pressures. The clue is in their name ‘pre-print’. Publication is intended.

      I am absolutely not arguing for tossing out peer review. I strongly believe peer review is valuable but currently broken. Moreover, I reject that peer review needs to happen behind the gatekeeping of publishers. I revised to clarify here and added a footnote based on this reviewer's latter observation.

      Peer-review remains a critical check for pollutants in the waters - but the prevailing model needs significant reform. The traditional opaque, uncompensated system has eroded the quality, transparency, timeliness, and appropriateness of peer review due to competing priorities and a lack of appropriately aligned incentive structures. Novel models of peer review including, publish-review-curate and preprint review, and compensated review - ideally all done transparently and with conflicts of interest declared out in the open. At the same time, not all manuscripts need review to have value and most preprints with value (even those with reviews) should not be published in journals.

      New footnote: The term 'preprint' is evolving - what was once a moniker for non-peer reviewed manuscript intended to eventually become reviewed and published (or more likely, rejected) now scopes-in other forms including publish-curate-review and manuscripts with preprint reviews. A new labeling and metadata system is desperately needed to highlight the state of review of a particular manuscript in a record of versions. Version control systems and badging are ubiquitous in the open-source software community and could be easily adopted here.

      Volume is Volume is Volume

      Probably the most important critique among the set of reviews points out an apparent recursion in the logic of the thesis that I need to clarify: you can't solve the firehose problem by writing more, as that just adds more volume to the flow. My revision to the conclusion clarifies my intent: what I'm proposing is to stop sending so many papers to journals for publication and to choose preprints more often for reading, reviewing, and writing. At the same time the system should, maintain or increase non-publication scholarly outputs and reward those too.

      "Write-More" here is a placeholder for all the non-publication writing scholars do and should get credit for from their institutions and fields. Again, I happen to focus on writing because that's what I care about in this editorial and it would take volumes to pontificate on all the other services and activities that happen within the academy that are not properly rewarded.

      Summary

      Having my blog post peer-reviewed through MetaROR was a positive experience and I recommend the service. However, my post was still just an editorial – my opinions and thoughts – not research. Had this been a research article, however, the reviews as presented would have been a very good roadmap to improving the paper. For MetaROR, I have two suggestions: 1) the editorial assessment could be improved by organizing the key points and 2) create a way to have all reviews downloadable in plaintext for ease of importing into an editor.

      Acknowledgments

      Special thanks are owed to the reviewers, Catriona MacCallum, Dan Brockington, and Jonny Coates, the MetaROR handling editor Ludo Waltman, and to Upstream Editor and Front Matter founder Martin Fenner for the crazy idea to peer-review a blog post.

      Disclosure

      The opinions expressed here are my own and may not represent those of my employer, my associates, or the reviewers. I have no conflicts of interest to disclose.

      This author response was previously published on Upstream.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      Measurement of BOLD MR imaging has regularly found regions of the brain that show reliable suppression of BOLD responses during specific experimental testing conditions. These observations are to some degree unexplained, in comparison with more usual association between activation of the BOLD response and excitatory activation of the neurons (most tightly linked to synaptic activity) in the same brain location. This paper finds two patients whose brains were tested with both non-invasive functional MRI and with invasive insertion of electrodes, which allowed the direct recording of neuronal activity. The electrode insertions were made within the fusiform gyrus, which is known to process information about faces, in a clinical search for the sites of intractable epilepsy in each patient. The simple observation is that the electrode location in one patient showed activation of the BOLD response and activation of neuronal firing in response to face stimuli. This is the classical association. The other patient showed an informative and different pattern of responses. In this person, the electrode location showed a suppression of the BOLD response to face stimuli and, most interestingly, an associated suppression of neuronal activity at the electrode site.

      Strengths:

      Whilst these results are not by themselves definitive, they add an important piece of evidence to a long-standing discussion about the origins of the BOLD response. The observation of decreased neuronal activation associated with negative BOLD is interesting because, at various times, exactly the opposite association has been predicted. It has been previously argued that if synaptic mechanisms of neuronal inhibition are responsible for the suppression of neuronal firing, then it would be reasonable

      Weaknesses:

      The chief weakness of the paper is that the results may be unique in a slightly awkward way. The observation of positive BOLD and neuronal activation is made at one brain site in one patient, while the complementary observation of negative BOLD and neuronal suppression actually derives from the other patient. Showing both effects in both patients would make a much stronger paper.

      We thank reviewer #1 for their positive evaluation of our paper. Obviously, we agree with the reviewer that the paper would be much stronger if BOTH effects – spike increase and decrease – would be found in BOTH patients in their corresponding fMRI regions (lateral and medial fusiform gyrus) (also in the same hemisphere). Nevertheless, we clearly acknowledge this limitation in the (revised) version of the manuscript (p.8: Material and Methods section).

      Note that with respect to the fMRI data, our results are not surprising, as we indicate in the manuscript: BOLD increases to faces (relative to nonface objects) are typically found in the LatFG and BOLD decreases in the medialFG (in the revised version, we have added the reference to an early neuroimaging paper that describes this dissociation clearly:

      Pelphrey, K. A., Mack, P. B., Song, A., Güzeldere, G., & McCarthy, G. Faces evoke spatially differentiated patterns of BOLD activation and deactivation. Neuroreport 14, 955–959 (2003).

      This pattern of increase/decrease in fMRI can be appreciated in both patients on Figure 2, although one has to consider both the transverse and coronal slices to appreciate it.

      Regarding electrophysiological data, in the current paper, one could think that P1 shows only increases to faces, and P2 would show only decreases (irrespective of the region). However, that is not the case since 11% of P1’s face-selective units are decreases (89% are increases) and 4% of P2’s face-selective units are increases. This has now been made clearer in the revised manuscript (p.5).

      As the reviewer is certainly aware, the number and positions of the electrodes are based on strict clinical criteria, and we will probably never encounter a situation with two neighboring (macro-micro hybrid electrodes), one with microelectrodes ending up in the lateral MidFG, the other in the medial MidFG, in the same patient. If there is no clinical value for the patient, this cannot be done.

      The only thing we can do is to strengthen these results in the future by collecting data on additional patients with an electrode either in the lateral or the medial FG, together with fMRI. But these are the only two patients we have been able to record so far with electrodes falling unambiguously in such contrasted regions and with large (and comparable) measures.

      While we acknowledge that the results may be unique because of the use of 2 contrasted patients only (and this is why the paper is a short report), the data is compelling in these 2 cases, and we are confident that it will be replicated in larger cohorts in the future.

      Finally, information regarding ethics approval has been provided in the paper.

      Reviewer #2 (Public review):

      Summary:

      This is a short and straightforward paper describing BOLD fMRI and depth electrode measurements from two regions of the fusiform gyrus that show either higher or lower BOLD responses to faces vs. objects (which I will call face-positive and facenegative regions). In these regions, which were studied separately in two patients undergoing epilepsy surgery, spiking activity increased for faces relative to objects in the face-positive region and decreased for faces relative to objects in the face-negative region. Interestingly, about 30% of neurons in the face-negative region did not respond to objects and decreased their responses below baseline in response to faces (absolute suppression).

      Strengths:

      These patient data are valuable, with many recording sessions and neurons from human face-selective regions, and the methods used for comparing face and object responses in both fMRI and electrode recordings were robust and well-established. The finding of absolute suppression could clarify the nature of face selectivity in human fusiform gyrus since previous fMRI studies of the face-negative region could not distinguish whether face < object responses came from absolute suppression, or just relatively lower but still positive responses to faces vs. objects.

      Weaknesses:

      The authors claim that the results tell us about both 1) face-selectivity in the fusiform gyrus, and 2) the physiological basis of the BOLD signal. However, I would like to see more of the data that supports the first claim, and I am not sure the second claim is supported.

      (1) The authors report that ~30% of neurons showed absolute suppression, but those data are not shown separately from the neurons that only show relative reductions. It is difficult to evaluate the absolute suppression claim from the short assertion in the text alone (lines 105-106), although this is a critical claim in the paper.

      We thank reviewer #2 for their positive evaluation of our paper. We understand the reviewer’s point, and we partly agree. Where we respectfully disagree is that the finding of absolute suppression is critical for the claim of the paper: finding an identical contrast between the two regions in terms of RELATIVE increase/decrease of face-selective activity in fMRI and spiking activity is already novel and informative. Where we agree with the reviewer is that the absolute suppression could be more documented: it wasn’t, due to space constraints (brief report). We provide below an example of a neuron showing absolute suppression to faces (P2), as also requested in the recommendations to authors. In the frequency domain, there is only a face-selective response (1.2 Hz and harmonics) but no significant response at 6 Hz (common general visual response). In the time-domain, relative to face onset, the response drops below baseline level. It means that this neuron has baseline (non-periodic) spontaneous spiking activity that is actively suppressed when a face appears.

      Author response image 1.

      (2) I am not sure how much light the results shed on the physiological basis of the BOLD signal. The authors write that the results reveal "that BOLD decreases can be due to relative, but also absolute, spike suppression in the human brain" (line 120). But I think to make this claim, you would need a region that exclusively had neurons showing absolute suppression, not a region with a mix of neurons, some showing absolute suppression and some showing relative suppression, as here. The responses of both groups of neurons contribute to the measured BOLD signal, so it seems impossible to tell from these data how absolute suppression per se drives the BOLD response.

      It is a fact that we find both kinds of responses in the same region. We cannot tell with this technique if neurons showing relative vs. absolute suppression of responses are spatially segregated for instance (e.g., forming two separate sub-regions) or are intermingled. And we cannot tell from our data how absolute suppression per se drives the BOLD response. In our view, this does not diminish the interest and originality of the study, but the statement "that BOLD decreases can be due to relative, but also absolute, spike suppression in the human brain” has been rephrased in the revised manuscript: "that BOLD decreases can be due to relative, or absolute (or a combination of both), spike suppression in the human brain”.

      Reviewer #3 (Public review):

      In this paper the authors conduct two experiments an fMRI experiment and intracranial recordings of neurons in two patients P1 and P2. In both experiments, they employ a SSVEP paradigm in which they show images at a fast rate (e.g. 6Hz) and then they show face images at a slower rate (e.g. 1.2Hz), where the rest of the images are a variety of object images. In the first patient, they record from neurons over a region in the mid fusiform gyrus that is face-selective and in the second patient, they record neurons from a region more medially that is not face selective (it responds more strongly to objects than faces). Results find similar selectivity between the electrophysiology data and the fMRI data in that the location which shows higher fMRI to faces also finds face-selective neurons and the location which finds preference to non faces also shows non face preferring neurons.

      Strengths:

      The data is important in that it shows that there is a relationship between category selectivity measured from electrophysiology data and category-selective from fMRI. The data is unique as it contains a lot of single and multiunit recordings (245 units) from the human fusiform gyrus - which the authors point out - is a humanoid specific gyrus.

      Weaknesses:

      My major concerns are two-fold:

      (i) There is a paucity of data; Thus, more information (results and methods) is warranted; and in particular there is no comparison between the fMRI data and the SEEG data.

      We thank reviewer #3 for their positive evaluation of our paper. If the reviewer means paucity of data presentation, we agree and we provide more presentation below, although the methods and results information appear as complete to us. The comparison between fMRI and SEEG is there, but can only be indirect (i.e., collected at different times and not related on a trial-by-trial basis for instance). In addition, our manuscript aims at providing a short empirical contribution to further our understanding of the relationship between neural responses and BOLD signal, not to provide a model of neurovascular coupling.

      (ii) One main claim of the paper is that there is evidence for suppressed responses to faces in the non-face selective region. That is, the reduction in activation to faces in the non-face selective region is interpreted as a suppression in the neural response and consequently the reduction in fMRI signal is interpreted as suppression. However, the SSVEP paradigm has no baseline (it alternates between faces and objects) and therefore it cannot distinguish between lower firing rate to faces vs suppression of response to faces.

      We understand the concern of the reviewer, but we respectfully disagree that our paradigm cannot distinguish between lower firing rate to faces vs. suppression of response to faces. Indeed, since the stimuli are presented periodically (6 Hz), we can objectively distinguish stimulus-related activity from spontaneous neuronal firing. The baseline corresponds to spikes that are non-periodic, i.e., unrelated to the (common face and object) stimulation. For a subset of neurons, even this non-periodic baseline activity is suppressed, above and beyond the suppression of the 6 Hz response illustrated on Figure 2. We mention it in the manuscript, but we agree that we do not present illustrations of such decrease in the time-domain for SU, which we did not consider as being necessary initially (please see below for such presentation).

      (1) Additional data: the paper has 2 figures: figure 1 which shows the experimental design and figure 2 which presents data, the latter shows one example neuron raster plot from each patient and group average neural data from each patient. In this reader's opinion this is insufficient data to support the conclusions of the paper. The paper will be more impactful if the researchers would report the data more comprehensively.

      We answer to more specific requests for additional evidence below, but the reviewer should be aware that this is a short report, which reaches the word limit. In our view, the group average neural data should be sufficient to support the conclusions, and the example neurons are there for illustration. And while we cannot provide the raster plots for a large number of neurons, the anonymized data is made available at:

      (a) There is no direct comparison between the fMRI data and the SEEG data, except for a comparison of the location of the electrodes relative to the statistical parametric map generated from a contrast (Fig 2a,d). It will be helpful to build a model linking between the neural responses to the voxel response in the same location - i.e., estimate from the electrophysiology data the fMRI data (e.g., Logothetis & Wandell, 2004).

      As mentioned above the comparison between fMRI and SEEG is indirect (i.e., collected at different times and not related on a trial-by-trial basis for instance) and would not allow to make such a model.

      (b) More comprehensive analyses of the SSVEP neural data: It will be helpful to show the results of the frequency analyses of the SSVEP data for all neurons to show that there are significant visual responses and significant face responses. It will be also useful to compare and quantify the magnitude of the face responses compared to the visual responses.

      The data has been analyzed comprehensively, but we would not be able to show all neurons with such significant visual responses and face-selective responses.

      (c) The neuron shown in E shows cyclical responses tied to the onset of the stimuli, is this the visual response?

      Correct, it’s the visual response at 6 Hz.

      If so, why is there an increase in the firing rate of the neuron before the face stimulus is shown in time 0?

      Because the stimulation is continuous. What is displayed at 0 is the onset of the face stimulus, with each face stimulus being preceded by 4 images of nonface objects.

      The neuron's data seems different than the average response across neurons; This raises a concern about interpreting the average response across neurons in panel F which seems different than the single neuron responses

      The reviewer is correct, and we apologize for the confusion. This is because the average data on panel F has been notch-filtered for the 6 Hz (and harmonic responses), as indicated in the methods (p.11): ‘a FFT notch filter (filter width = 0.05 Hz) was then applied on the 70 s single or multi-units time-series to remove the general visual response at 6 Hz and two additional harmonics (i.e., 12 and 18 Hz)’.

      Here is the same data without the notch-filter (the 6Hz periodic response is clearly visible):

      Author response image 2.

      For sake of clarity, we prefer presenting the notch-filtered data in the paper, but the revised version makes it clear in the figure caption that the average data has been notch-filtered.

      (d) Related to (c) it would be useful to show raster plots of all neurons and quantify if the neural responses within a region are homogeneous or heterogeneous. This would add data relating the single neuron response to the population responses measured from fMRI. See also Nir 2009.

      We agree with the reviewer that this is interesting, but again we do not think that it is necessary for the point made in the present paper. Responses in these regions appear rather heterogenous, and we are currently working on a longer paper with additional SEEG data (other patients tested for shorter sessions) to define and quantify the face-selective neurons in the MidFusiform gyrus with this approach (without relating it to the fMRI contrast as reported here).

      (e) When reporting group average data (e.g., Fig 2C,F) it is necessary to show standard deviation of the response across neurons.

      We agree with the reviewer and have modified Figure 2 accordingly in the revised manuscript.

      (f) Is it possible to estimate the latency of the neural responses to face and object images from the phase data? If so, this will add important information on the timing of neural responses in the human fusiform gyrus to face and object images.

      The fast periodic paradigm to measure neural face-selectivity has been used in tens of studies since its original reports:

      In this paradigm, the face-selective response spreads to several harmonics (1.2 Hz, 2.4 Hz, 3.6 Hz, etc.) (which are summed for quantifying the total face-selective amplitude). This is illustrated below by the averaged single units’ SNR spectra across all recording sessions for both participants.

      Author response image 3.

      There is no unique phase-value, each harmonic being associated with a phase-value, so that the timing cannot be unambiguously extracted from phase values. Instead, the onset latency is computed directly from the time-domain responses, which is more straightforward and reliable than using the phase. Note that the present paper is not about the specific time-courses of the different types of neurons, which would require a more comprehensive report, but which is not necessary to support the point made in the present paper about the SEEG-fMRI sign relationship.

      (g) Related to (e) In total the authors recorded data from 245 units (some single units and some multiunits) and they found that both in the face and nonface selective most of the recoded neurons exhibited face -selectivity, which this reader found confusing: They write “ Among all visually responsive neurons, we found a very high proportion of face-selective neurons (p < 0.05) in both activated and deactivated MidFG regions (P1: 98.1%; N = 51/52; P2: 86.6%; N = 110/127)’. Is the face selectivity in P1 an increase in response to faces and P2 a reduction in response to faces or in both it’s an increase in response to faces

      Face-selectivity is defined as a DIFFERENTIAL response to faces compared to objects, not necessarily a larger response to faces. So yes, face-selectivity in P1 is an increase in response to faces and P2 a reduction in response to faces.

      Additional methods

      (a) it is unclear if the SSVEP analyses of neural responses were done on the spikes or the raw electrical signal. If the former, how is the SSVEP frequency analysis done on discrete data like action potentials?

      The FFT is applied directly on spike trains using Matlab’s discrete Fourier Transform function. This function is suitable to be applied to spike trains in the same way as to any sampled digital signal (here, the microwires signal was sampled at 30 kHz, see Methods).

      In complementary analyses, we also attempted to apply the FFT on spike trains that had been temporally smoothed by convolving them with a 20ms square window (Le Cam et al., 2023, cited in the paper ). This did not change the outcome of the frequency analyses in the frequency range we are interested in. We have also added one sentence with information in the methods section about spike detection (p.10).

      (b) it is unclear why the onset time was shifted by 33ms; one can measure the phase of the response relative to the cycle onset and use that to estimate the delay between the onset of a stimulus and the onset of the response. Adding phase information will be useful.

      The onset time was shifted by 33ms because the stimuli are presented with a sinewave contrast modulation (i.e., at 0ms, the stimulus has 0% contrast). 100% contrast is reached at half a stimulation cycle, which is 83.33ms here, but a response is likely triggered before reaching 100% contrast. To estimate the delay between the start of the sinewave (0% contrast) and the triggering of a neural response, we tested 7 SEEG participants with the same images presented in FPVS sequences either as a sinewave contrast (black line) modulation or as a squarewave (i.e. abrupt) contrast modulation (red line). The 33ms value is based on these LFP data obtained in response to such sinewave stimulation and squarewave stimulation of the same paradigm. This delay corresponds to 4 screen refresh frames (120 Hz refresh rate = 8.33ms by frame) and 35% of the full contrast, as illustrated below (please see also Retter, T. L., & Rossion, B. (2016). Uncovering the neural magnitude and spatio-temporal dynamics of natural image categorization in a fast visual stream. Neuropsychologia, 91, 9–28).

      Author response image 4.

      (2) Interpretation of suppression:

      The SSVEP paradigm alternates between 2 conditions: faces and objects and has no baseline; In other words, responses to faces are measured relative to the baseline response to objects so that any region that contains neurons that have a lower firing rate to faces than objects is bound to show a lower response in the SSVEP signal. Therefore, because the experiment does not have a true baseline (e.g. blank screen, with no visual stimulation) this experimental design cannot distinguish between lower firing rate to faces vs suppression of response to faces.

      The strongest evidence put forward for suppression is the response of non-visual neurons that was also reduced when patients looked at faces, but since these are non-visual neurons, it is unclear how to interpret the responses to faces.

      We understand this point, but how does the reviewer know that these are non-visual neurons? Because these neurons are located in the visual cortex, they are likely to be visual neurons that are not responsive to non-face objects. In any case, as the reviewer writes, we think it’s strong evidence for suppression.

      We thank all three reviewers for their positive evaluation of our paper and their constructive comments.

    1. Author response:

      Evidence reducibility and clarity

      Reviewer 1:

      In this manuscript, the role of the insulin receptor and the insulin growth factor receptor was investigated in podocytes. Mice, were both receptors were deleted, developed glomerular dysfunction and developed proteinuria and glomerulosclerosis over several months. Because of concerns about incomplete KO, the authors generated podocyte cell lines where both receptors were deleted. Loss of both receptors was highly deleterious with greater than 50% cell death. To elucidate the mechanism, the authors performed global proteomics and find that spliceosome proteins are downregulated. They confirm this by using long-range sequencing. These results suggest a novel role for these pathways in podocytes.

      Thank you

      This is primarily a descriptive study and no technical concerns are raised. The mechanism of how insulin and IGF1 signaling are linked to the spiceosome is not addresed.

      We do not think the paper is descriptive as we used non-biased phospho and total proteomics in the DKO cells to uncover the alterations in the spliceosome (that have not been previously described) that were detrimental. However, we are happy to look further into the underlying mechanism.

      We would propose:

      (1) Stimulating/inhibiting insulin/IGF signalling pathways in the Wild-type and DKO knockout cells and check expression levels and/or phosphorylation status of splice factors (including those in Figure 3E) and those revealed by phospho-proteomic data; a variety of inhibitors of insulin/IGF1 pathways could also be used along the pathways that are shown in Fig 2.

      (2) Looking at the RNaseq data bioinformatically in more detail – the introns/exons that move up or down are targets of the splice factors involved; most splice factors binding sequences are known, so it should be possible to ask bioinformatically – from the sequences around the splice sites of the exons and introns that move in the DKO, which splice factors binding sites are seen most frequently? To uncover splice factors/RNA-binding proteins (RBPs) that are involved in the insulin signaling we will use a software named MATT which was specifically designed to look for RNA-binding motifs (PMID 30010778). In brief, using the long-sequencing data, we will test 250 nt sequences flanking the splice sites of all regulated splicing events (intronic and exonic) against all RNA- binding proteins in the CISBP-RNA database (PMID 23846655) using MATT. This will result in a list of RBPs potentially involved in the insulin signaling. We will validate these by activating insulin signaling (similar to Figures 2 B,C) and probe whether the RBPs are activated (e.g. phosphorylated or change in expression) or we will manipulate expression of the candidate RBPs and measure how they affect the insulin signaling.

      (3) Examining the phospho and total proteomic data for IGF1R and Insulin receptor knockout alone podocytes (which we have already generated) and analysing these in more detail and include this data set to elucidate the relative importance of both receptors to spliceosome function.

      The phenotype of the mouse is only superficially addressed. The main issues are that the completeness of the mouse KO is never assessed nor is the completeness of the KO in cell lines. The absence of this data is a significant weakness.

      We apologise for not making clear but we did assess the level of receptor knockdown in the animal and cell models.  The in vivo model showed variable and non-complete levels of insulin receptor and IGF1 receptor podocyte knock down (shown in supplementary figure 1B). This is why we made the in vitro  floxed podocyte cell lines in which we could robustly knockdown both the insulin receptor and IGF1 receptor (shown in Figure 2A)

      The mouse experiments would be improved if the serum creatinines were measured to provide some idea how severe the kidney injury is.

      We can address this:

      We have further urinary Albumin:creatinine ratio (uACR) data at 12, 16 and 20 weeks. We also have more blood tests of renal function that can be added. There is variability in creatinine levels which is not uncommon in transgenic mouse models (probably partly due to variability in receptor knock down with cre-lox system). This is part of rationale of developing the robust double receptor knockout cell models where we knocked out both receptors by >80%.

      An attempt to rescue the phenotype by overexpression of SF3B4 would also be useful. If this didn't work, an explanation in the text would suffice.

      We would consider  over express SF3BF4 in the Wild type and DKO cells and assess the effects on spliceosome if deemed necessary.  However, we think it is unlikely to rescue the phenotype as so many other spliceosome components are downregulated in the DKO cells.

      As insulin and IGF are regulators of metabolism, some assessment of metabolic parameters would be an optional add-on.

      We have some detail on this and can add to the manuscript. However it is not extensive as not a major driver of this work.

      Lastly, the authors should caveat the cell experiments by discussing the ramifications of studying the 50% of the cells that survive vs the ones that died.

      Thank you, we appreciate this and this was the rationale behind cells being studied after 2 days differentiation before significant cell loss in order to avoid the issue of studying the 50% of cells that survive.

      Reviewer 2:

      In this manuscript, submitted to Review Commons (journal agnostic), Coward and colleagues report on the role of insulin/IGF axis in podocyte gene transcription. They knocked out both the insulin and IGFR1 mice. Dual KO mice manifested a severe phenotype, with albuminuria, glomerulosclerosis, renal failure and death at 4-24 weeks.

      Long read RNA sequencing was used to assess splicing events. Podocyte transcripts manifesting intron retention were identified. Dual knock-out podocytes manifested more transcripts with intron retention (18%) compared wild-type controls (18%), with an overlap between experiments of ~30%.

      Transcript productivity was also assessed using FLAIR-mark-intron-retention software. Intron retention w seen in 18% of ciDKO podocyte transcripts compared to 14% of wild-type podocyte transcripts (P=0.004), with an overlap between experiments of ~30% (indicating the variability of results with this method). Interestingly, ciDKO podocytes showed downregulation of proteins involved in spliceosome function and RNA processing, as suggested by LC/MS and confirmed by Western blot.

      Pladienolide (a spliceosome inhibitor) was cytotoxic to HeLa cells and to mouse podocytes but no toxicity was seen in murine glomerular endothelial cells.<br /> Specific comments.

      The manuscript is generally clear and well-written. Mouse work was approved in advance. The six figures are generally well-designed, bars/superimposed dot-plots.

      Thank you

      Evaluation.

      Methods are generally well described. It would be helpful to say that tissue scoring was performed by an investigator masked to sample identity.

      We did this and will add this information to the methods/figure legend.

      Specific comments.

      (1) Data are presented as mean/SEM. In general, mean/SD or median/IQR are preferred to allow the reader to evaluate the spread of the data. There may be exceptions where only SEM is reasonable.

      Graphs can be changed to SD rather than SEM.

      (2) It would be useful to for the reader to be told the number of over-lapping genes (with similar expression between mouse groups) and the results of a statistical test comparing WT and KO mice. The overlap of intron retention events between experimental repeats was about 30% in both knock-out podocytes. This seems low and I am curious to know whether this is typical for typical for this method; a reference could be helpful.

      This is an excellent question. We had 30% overlap as the parameters used for analysis were very stringent. We suspect we could get more than 30% by being less stringent, which still be considered as similar events if requested. Our methods were based on FLAIR analysis (PMID: 32188845)

      (3) Please explain "adjusted p value of 0.01." It is not clear how was it adjusted. The number of differentially-expressed proteins between the two cell types was 4842.

      We used the Benjamini-Hochberg method to adjust our data. We think the reviewer is referring to the transcriptomic data and not the proteomic data.

      Minor comments

      Page numbers in the text would help the reviewer communicate more effectively with the author.

      We will do this

      Reviewer 3:

      These investigators have previously shown important roles for either insulin receptor (IR) or insulin-like growth factor receptor (IGF1R) in glomerular podocyte function. They now have studied mice with deletion of both receptors and find significant podocyte dysfunction. They then made a podocyte cell line with inducible deletion of both receptors and find abnormalities in transcriptional efficiency with decreased expression of spliceosome proteins and increased transcripts with impaired splicing or premature termination.

      The studies appear to be performed well and the manuscript is clearly written.

      Thank you

      Referees cross-commenting

      I am in agreement with Reviewer 1 that the studies are overly descriptive and do not provide sufficient mechanism and the lack of more investigation of the in vivo model is a significant weakness.

      Please see our responses to reviewer 1 above.

      Significance

      Reviewer 1:

      With the GLP1 agonists providing renal protection, there is great interest in understanding the role of insulin and other incretins in kidney cell biology. It is already known that Insulin and IGFR signaling play important roles in other cells of the kidney. So, there is great interest in understanding these pathways in podocytes. The major advance is that these two pathways appear to have a role in RNA metabolism, the major limitations are the lack of information regarding the completeness of the KO's. If, for example, they can determine that in the mice, the KO is complete, that the GFR is relatively normal, then the phenotype they describe is relatively mild.

      Thank you. The receptor  KO in the mice is unlikely to be complete (Please see comments above and Supplementary Figure 1b). There are many examples of KO models targeting other tissues showing that complete KO of these receptors seems difficult to achieve , particularly in reference to the IGF1 receptor. In the brain (which is also terminally differentiated cells PMID:28595357 (barely 50% iof IGF1R knockdown was achieved in the target cells). Ovarian granulosa cells PMID:28407051 -several tissue specific drivers tried but couldn't achieve any better than 80%. The paper states that 10% of IGF1R is sufficient for function in these cells so they conclude that their knockdown animals are probably still responding to IGF1. Finally, in our recent IGF1R podocyte knockdown model we found Cre levels were important for excision of a single floxed gene (PMID: 38706850) hence we were not surprised that trying to excise two floxed genes (insulin receptor and IGF1 receptor) was challenging. This is the rationale for making the double receptor knockout cell lines to understand process / biology in more detail.

      Reviewer 2:

      The manuscript is generally clear and well-written. Mouse work was approved in advance. The figures are generally well-designed, bars/superimposed dot-plots.

      Evaluation.

      Methods are generally well described. It would be helpful to say that tissue scoring was performed by an investigator masked to sample identity.

      Thank you we will do this.

      Reviewer 3:

      There are a number of potential issues and questions with these studies.

      (1) For the in vivo studies, the only information given is for mice at 24 weeks of age. There needs to be a full time course of when the albuminuria was first seen and the rate of development. Also, GFR was not measured. Since the podocin-Cre utilized was not inducible, there should be a determination of whether there was a developmental defect in glomeruli or podocytes. Were there any differences in wither prenatal post natal development or number of glomeruli?

      Thank you we will add in further phenotyping data. We do not think there was a major developmental phenotype as  albuminuria did not become significantly different until several months of age. We could have used a doxycycline inducible model but we know the excision efficiency is much less than the podocin-cre driven model SUPP FIGURE 1. This would likely give a very mild (if any) phenotype and not reveal the biology adequately.

      (2) Although the in vitro studies are of interest, there are no studies to determine if this is the underlying mechanism for the in vivo abnormalities seen in the mice. Cultured podocytes may not necessarily reflect what is occurring in podocytes in vivo.

      Thank you for this we are happy to employ Immunohistochemistry (IHC) and immunofluorescence (IF) using spliceosome antibodies on tissue sections from DKO and control mice to examine spliceosome changes. However, as the DKO results in podocyte loss, there may not be that many DKO podocytes still present in the tissue sections. This will be taken into consideration.

      (3) Given that both receptors are deleted in the podocyte cell line, it is not clear if the spliceosome defect requires deletion of both receptors or if there is redundancy in the effect. The studies need to be repeated in podocyte cell lines with either IR or IGFR single deletions.

      Thank you. We have full total and phospho-proteomic data sets from single insulin receptor and IGF1 receptor knockout cell lines that we will investigate for this point.

      (4) There are not studies investigating signaling mechanisms mediating the spliceosome abnormalities.

      Thank you as outlined as above to reviewer 1 point 1 we are very happy to investigate insulin / IGF signalling pathways in more detail.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Manuscript number: RC-2025-02946

      Corresponding author(s): Margaret, Frame

      Roza, Masalmeh

      [Please use this template only if the submitted manuscript should be considered by the affiliate journal as a full revision in response to the points raised by the reviewers.

      If you wish to submit a preliminary revision with a revision plan, please use our "Revision Plan" template. It is important to use the appropriate template to clearly inform the editors of your intentions.]

      1. General Statements [optional]

      This section is optional. Insert here any general statements you wish to make about the goal of the study or about the reviews.

      We thank the reviewers for recognizing the significance of our work and for their constructive feedback and suggestions, most of which we have implemented in our revised manuscript.

      2. Point-by-point description of the revisions

      This section is mandatory. *Please insert a point-by-point reply describing the revisions that were already carried out and included in the transferred manuscript. *

      Reviewer #1

      Evidence, reproducibility and clarity

      Review of Masalmeh et al. Title: "FAK modulates glioblastoma stem cell energetics..."

      Previous studies have implicated FAK and the related tyrosine kinase PYK2 in glioblastoma growth, cell migration, and invasion. Herein, using a murine stem cell model of glioblastoma, the authors used CRISPR to inactivate FAK, FAK-null cells selected and cloned, and lentiviral re-expression of murine FAK in the FAK-null cells (termed FAK Rx) was accomplished. FAK-/- cells were shown to possess epithelial characteristics whereas FAK Rx cells expressed mesenchymal markers and increased cell migration/invasion in vitro. Comparisons between FAK-/- and FAK Rx cells showed that FAK re-expressed increased mitochondrial respiration and amino acid uptake. This was associated with FAK Rx cells exhibiting filamentous mitochondrial morphology (potentially an OXPHOS phenotype) and decreased levels of MTFR1L S235 phosphorylation (implicated in mito morphology fragmentation). Mito and epithelial cell morphology of FAK-/- cells was reversed by treatment with Rho-kinase inhibitors that also increased mito metabolism and cell viability. Last, FAK-dependent glioblastoma tumor growth was shown by comparisons of FAK-/- and FAK Rx implantation studies.

      The studies by Masalmeh provide interesting findings associating FAK expression with changes in mitochondrial morphology, energy metabolism, and glutamate uptake. According to the authors model, FAK expression is supporting a glioblastoma stem cell like phenotype in vitro and tumor growth in vivo. What remains unclear is the mechanistic connection to cell changes and whether or not these are be dependent on intrinsic FAK activity or as the Frame group has previously published, potentially FAK nuclear localization. The associations with MTFR1L phosphorylation and effects by Rho kinase inhibition are likely indirect and remind this reviewer of long-ago studies with FAK-null fibroblasts that exhibit epithelial characteristics, still express PYK2, exhibited elevated RhoA GTPase activity. Some of these phenotypes were linked to changes in RhoGEF and RhoGAP signaling with FAK and/or Pyk2. At a minimum, it would be informative to know whether Pyk2 signaling is relevant for observed phenotypes and whether the authors can further support their associations with FAK-targeted or FAK-Pyk2-targeted inhibitors or PROTACs.

      Some questions that would enhance potential impact. 1. Cell generation. Please describe the analysis of FAK-/- clones in more detail. The "low viability" phenotype needs further explanation with regard to clonal expansion and growth characteristics?

      Response:

      • We included a better description and a supplementary figure in our revised manuscript to indicate that we have examined several FAK -/- clones and confirmed that our observations were not due to clonal variation; multiple clones displayed similar morphological changes (Figure S1D). We also show that the elongated mesenchymal-like morphology was observed at 48 h after nucleofecting the cells with the FAK‑expressing vector, before beginning G418 selection to enrich for cells expressing FAK (Figure S1C). We also included experiments to acutely modulate FAK signalling (detaching and seeding cells on fibronectin) (Figure S2D, E, F and Figure S3) to exclude the possibility that the profound effects are due to protocols/selection we used for generating FAK-deleted cells.
      • Regarding the term “low viability”, we have clarified in the text that there is no significant difference in cell number (Figure S1A) or ‘cell viability’ when it is assessed by trypan blue exclusion (a non-mitochondria-dependent read-out) (Figure S1B) between FAK-expressing FAK Rx and FAK-/- cells cultured for three days under normal conditions. Therefore, we agree the term ‘cell viability’ in this context could be confusing and have replace "cell viability” with “metabolic activity as measured by Alamar Blue.” in Figure 1D and Figure 5B, and the corresponding text in the original manuscript. This wording more accurately reflects the data.

      Figure 1F: need further support of MET change upon FAK KO and EMT reversion.

      Response: We have added a heatmap (Figure S1E) illustrating the changes in protein expression of core-enriched EMT/MET genes products (by proteomics) after FAK gene deletion (EMT genes as defined in Howe et al., 2018) ; this strengthens the conclusion that the MET reversion morphological phenotype is accompanied by recognised MET protein changes.

      Fig. 2: Need further support if FAK effects impact glycolysis or oxidative phosphorylation in particular as implicated by the stem cell model.

      Response: We show that FAK impacts both glycolysis (Figure 2A, 2E, and 2F) and mitochondrial oxidative phosphorylation on the basis of the oxygen consumption rate (OCR) (Figure 2B, and 2D), showing both are contributing pathways to FAK-dependent energy production. We have clarified this in the text.

      Is there a combinatorial potential between FAKi and chemotherapies used for glioblastoma. Need to build upon past studies.

      Response: Yes, previous studies suggest that inhibiting FAK can sensitize GBM cells to chemotherapy (Golubovskaya et al., 2012; Ortiz-Rivera et al., 2023). We have included a paragraph in the discussion section to make sure this is clearer. Although it is not the subject of this study, we appreciate it is useful context.

      The notation of changes in glucose transporter expression should be followed up with regard to the potential that FAK-expressing cells may have different uptake of carbon sources and other amino acids. Altered uptake could be one potential explanation for increase glycolysis and glutamine flux.

      Response: We agree with the reviewer that glucose uptake could be contributing and we include data that 2 glucose transporters are indeed FAK-regulated namely Glucose transporter 1 (GLUT1, encoded by Slc2a1 gene) and Glucose transporter 3 (GLUT 3, encoded by Slc2a3 gene) (shown in Figure S2B and C).

      It would be helpful to support the confocal microscopy of mitos with EM.

      Response:

      We are concerned (and in our experience) that Electron microscopy (EM) may introduce artefacts during sample preparation. In contrast, immunofluorescence sample preparation is less susceptible to artefacts. The SORA system we used is not a conventional point-scanning confocal microscope, but is a super-resolution module based on a spinning disk confocal platform (CSU-W1; Yokogawa) using optical pixel reassignment with confocal detection. This method enhances resolution in all dimensions with resolution in our samples measured at 120nm. This has been instructive in defining a new level of changes in mitochondrial morphology upon FAK gene deletion.

      Lack of FAK expression with increased MTFR1 phosphorylation is difficult to interpret.

      Response: We do not directly show that this phosphorylation event is causal in our experiments; however, we think it important to document this change since it has been published that phosphorylation of MTFR1 has been causally linked to the mitochondrial morphology we observed in other systems (Tilokani et al., 2022).

      Need to have better support between loss of FAK and the increase in Rho signaling. Use of Rho kinase inhibitors is very limited and the context to FAK (and or Pyk2) remains unclear. Past studies have linked integrin adhesion to ECM as a linkage between FAK activation and the transient inhibition of RhoA GTP binding. Is integrin signaling and FAK involved in the cell and metabolism phenotypes in this new model?

      Response: To better support the antagonistic effect of FAK on Rho-kinase (ROCK) signalling, we included a new experiment in which the integrin-FAK signalling pathway has been disrupted by treating FAK WT cells with an agent that causes detachment from the substratum, Accutase, and growing the cells in suspension in laminin-free medium. We present ROCK activity data, as judged by phosphorylated MLC2 at serine 19 (pMLC2 S19), relating this to induced FAK phosphorylation at Y397 (a surrogate for FAK activity) that is supressed after integrin disengagement. These measurements have been compared with conditions whereby integrin-FAK signalling is activated by growing the cells on laminin coated surfaces. We observed a time-dependent decrease in pFAK(Y397) levels (normalised to total FAK) in suspended cells compared to those spread on laminin, while pMLC2(S19) levels increased in a reciprocal manner over time in detached cells relative to spread cells (S4A and B). There is therefore an inverse relationship between integrin-FAK signalling and ROCK-MLC2 activity, consistent with findings from FAK gene deletion experiments. In the former case, we do not rely on gene deletion cell clones.

      Significance

      The studies by Masalmeh provide interesting findings associating FAK expression with changes in mitochondrial morphology, energy metabolism, and glutamate uptake. According to the authors model, FAK expression is supporting a glioblastoma stem cell like phenotype in vitro and tumor growth in vivo. What remains unclear is the mechanistic connection to cell changes and whether or not these are be dependent on intrinsic FAK activity or as the Frame group has previously published, potentially FAK nuclear localization. The associations with MTFR1L phosphorylation and effects by Rho kinase inhibition are likely indirect and remind this reviewer of long-ago studies with FAK-null fibroblasts that exhibit epithelial characteristics, still express PYK2, exhibited elevated RhoA GTPase activity. Some of these phenotypes were linked to changes in RhoGEF and RhoGAP signaling with FAK and/or Pyk2. At a minimum, it would be informative to know whether Pyk2 signaling is relevant for observed phenotypes and whether the authors can further support their associations with FAK-targeted or FAK-Pyk2-targeted inhibitors or PROTACs.

      __Response: __

      Deleting the gene encoding FAK in mouse embryonic fibroblasts leads to elevated Pyk2 expression (Sieg, 2000). However, in the GBM stem cell model we used here, Pyk2 was not expressed (determined by both transcriptomics and proteomics). We have included Figure S1E to show that PYK2 expression was undetectable in FAK -/- and FAK Rx cells at the RNA level (Figure S1F). We conclude that there is no compensatory increase in Pyk2 upon FAK loss in these cells. In the transformed neural stem cell model of GBM, we do not consistently or robustly detect nuclear FAK.

      Review #2

      Masalmeh and colleagues employ a neural stem/progenitor cell-based glioma model (NPE cells) to investigate the role of Focal Adhesion Kinase (FAK) in GBM, with a focus on potential links between the regulation of morphological/adhesive and metabolic GBM cell properties. For this, the authors employ wt cells alongside newly generated FAK-KO and -reexpressing cells, as well as pharmacological interventions to probe the relevance of specific signaling pathways. The authors´ main claims are that FAK crucially modulates glioma cell morphology, cell-cell and cell-substrate interactions and motility, as well as their metabolism, and that these effects translate to changes to relevant in vivo properties such as invasion and tumor growth.

      My main issues are with the model chosen by the authors.

      As per the methods section, generation of FAK-KO and -"Rx" NPE cells entailed protracted selection/expansion processes, which may have resulted in inadvertent selection for cellular/molecular properties unrelated to the desired one (loss or gain of FAK expression) and which may have had cascading effects on NPE cells. The authors nonetheless repeatedly claim the parameters they quantify, such as mitochondrial or cytoskeletal properties or metabolic features, to have directly resulted from FAK loss or reintroduction. Examples of such causal inferences are to be found in lines 123, 134/135, 165, 181. Such causal claims are, in my view, unsupported.

      Acute perturbation of FAK expression/activity, genetically or pharmacologically, followed by a rapid assessment of the processes under investigation, would be needed to begin to assess causality, even if acute genetic perturbations may be technically challenging as sufficient gene expression reduction or restoration to physiologically relevant levels may be hard to achieve.

      Response:

      We would like to first comment on the model we used here, which we think will clarify the validity of our approach. The model is a transformed stem cell model of GBM that was published in (Gangoso et al., Cell, 2021) and is now used regularly in the GBM field. As mentioned in the response to Reviewer 1, we have added text (page 4 and 5 in the revised manuscript) and a new supplementary figure (Figure S1D) clarifying that the morphological changes we observed were consistent across multiple FAK -/- clones, showing this was not due to any inter-clonal variability. We also added images showing that the morphological changes were apparent at 48 h after nucleofecting FAK -/- cells with the FAK‑expressing vector specifically (not the empty vector), prior to starting G418 selection to enrich for FAK‑expressing cells (Figure S1C), addressing the worry that clonal variation and selection was the cause of the FAK-dependent phenotypes we observed. We believe that our model provides a type of well controlled, clean genetic cancer cell system of a type that is commonly used in cancer cell biology, allowing us to attribute phenotypes to individual proteins.

      We have also carried out a more acute treatment by using the FAK inhibitor VS4718 to perturb FAK kinase activity and assessed the effects on glycolysis and glutamine oxidation after 48h treatment (Figure S2D, E and F). We found that treating the transformed neural stem cells (parental population) with FAK inhibitor (300nM VS4718) decreases glucose incorporation into glycolysis intermediates and glutamine incorporation into TCA cycle intermediates, consistent with a role for FAK’s kinase activity in maintaining glycolysis and glutamine oxidation.

      The employed pharmacological modulation of ROCK activity is the only approach that, given the presumably acute nature of the treatment, may have allowed the authors to probe the proposed functional links. The methods section of the manuscript does not however comprise details as to the duration of these treatments, which leaves open the possibility of long-term treatment having been carried out (data shown in Figure 5B refers to 72hr treatment).

      __Response: __

      We have added the duration of the treatment to the Methods section and Figure Legends, to clarify that cells were treated with ROCK inhibitors for 24h, before assessing the effects on mictochondria (Figure 4C, D, S4C and D) and glutamine oxidation (Figure 5A, and S5). For metabolic activity by AlamarBlue assay, cells were treated with ROCK inhibitors for 72h (Figure 5B).

      Even in the case of ROCK inhibitor experiments, it is however unclear if and how the effects on cell morphology and adhesion, mitochondrial organization and metabolic activity may be connected to each other and, if at all, to FAK expression.

      Given the above uncertainties due to the nature of the model and experimental approaches, it is hard to assess the reliability and thus the relevance of the findings.

      Response:

      FAK suppresses ROCK activity (as judged by pMLC2 S19, Figure 4A and B). Treating FAK -/- cells with two different ROCK inhibitors restored mesenchymal-like cell morphology, mitochondrial morphology and glutamine oxidation. As mentioned above, to strengthen our evidence for the antagonistic role of FAK in ROCK-MLC2 signalling, we have now introduced an experiment whereby integrin-FAK signalling was disrupted through treatment with a detachment agent (Accutase), and subsequently maintaining the cells in suspension in laminin-free medium. We assessed pMLC2 S19 levels (a measure of ROCK activity) relating this to FAK phosphorylation that is supressed after integrin disengagement. These results were evaluated relative to spread wild type cells growing on laminin where Integrin-FAK signalling was active (Figure S4A and B). We observed an inverse relationship between Integrin-FAK signalling and ROCK-MLC2 activity in keeping with our conclusions (Figure 4A and B).

      Experimental support for the ability of cell-substrate interaction modulation to concomitantly impact cellular metabolism and motility/invasion would be significant both in terms of advancing our understanding of glioma cell biology and of its translational potential, but the evidence being provided is at best compatible with the proposed model.

      Response: We carried out a new experiment to support the ability of cell-substrate interaction modulation to impact metabolism; specifically, we inhibited cell-substrate interactions by plating the cells on Poly-2-hydroxyethyl methacrylate (Poly 2-HEMA)-coated dishes. This suppressed FAK phosphorylation at Y397, as expected, with concomitant reduction in glutamine utilisation in the TCA cycle (Figure S3A, B and C).

      My background/expertise is in developmental and adult neurogenesis, in vivo modelling of gliomagenesis and cell fate control/reprogramming, with a focus on molecular mechanisms of differentiation and quantitative aspects of lineage dynamics; molecular details of the control of cellular metabolism, cell-cell adhesion and cytoskeletal dynamics are not core expertise of mine.

      We appreciate this reviewer’s expertise are not necessarily in the cancer cell biology and genetic intervention aspects of our study. We hope that the explanations we have provided satisfy the reviewer that our conclusions are valid.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      In this study, Ma et al. aimed to determine previously uncharacterized contributions of tissue autofluorescence, detector afterpulse, and background noise on fluorescence lifetime measurement interpretations. They introduce a computational framework they named "Fluorescence Lifetime Simulation for Biological Applications (FLiSimBA)" to model experimental limitations in Fluorescence Lifetime Imaging Microscopy (FLIM) and determine parameters for achieving multiplexed imaging of dynamic biosensors using lifetime and intensity. By quantitatively defining sensor photon effects on signal-to-noise in either fitting or averaging methods of determining lifetime, the authors contradict any claims of FLIM sensor expression insensitivity to fluorescence lifetime and highlight how these artifacts occur differently depending on the analysis method. Finally, the authors quantify how statistically meaningful experiments using multiplexed imaging could be achieved. 

      A major strength of the study is the effort to present results in a clear and understandable way given that most researchers do not think about these factors on a day-to-day basis. The model code is available and written in Matlab, which should make it readily accessible, although a version in other common languages such as Python might help with dissemination in the community. One potential weakness is that the model uses parameters that are determined in a

      specific way by the authors, and it is not clear how vastly other biological tissue and microscope setups may differ from the values used by the authors. 

      Overall, the authors achieved their aims of demonstrating how common factors

      (autofluorescence, background, and sensor expression) will affect lifetime measurements and they present a clear strategy for understanding how sensor expression may confound results if not properly considered. This work should bring to awareness an issue that new users of lifetime biosensors may not be aware of and that experts, while aware, have not quantitatively determined the conditions where these issues arise. This work will also point to future directions for improving experiments using fluorescence lifetime biosensors and the development of new sensors with more favorable properties. 

      We appreciate the comments and helpful suggestions. We now also include FLiSimBA simulation code in Python in addition to Matlab to make it more accessible to the community.

      One advantage of FLiSimBA is that the simulation package is flexible and adaptable, allowing users to input parameters based on the specific sensors, hardware, and autofluorescence measurements for their biological and optical systems. We used parameters based on a FRETbased sensor, measured autofluorescence from mouse tissue, and measured dark count/after pulse of our specific GaAsP PMT in this manuscript as examples. In Discussion and Materials and methods, we now emphasize this advantage and further clarify how these parameters can be adapted to diverse tissues, imaging systems, and sensors based on individual experiments. We further explain that these input parameters will not affect the conclusions of our study, but the specific input parameters would alter the quantitative thresholds.

      Reviewer #2 (Public review): 

      Summary: 

      By using simulations of common signal artefacts introduced by acquisition hardware and the sample itself, the authors are able to demonstrate methods to estimate their influence on the estimated lifetime, and lifetime proportions, when using signal fitting for fluorescence lifetime imaging. 

      Strengths: 

      They consider a range of effects such as after-pulsing and background signal, and present a range of situations that are relevant to many experimental situations. 

      Weaknesses: 

      A weakness is that they do not present enough detail on the fitting method that they used to estimate lifetimes and proportions. The method used will influence the results significantly. They seem to only use the "empirical lifetime" which is not a state of the art algorithm. The method used to deconvolve two multiplexed exponential signals is not given. 

      We appreciate the comments and constructive feedback. Our revision based on the reviewer’s suggestions has made our manuscript clearer and more user friendly. We originally described the detail of the fitting methods in Materials and methods. Given the importance of these methodological details for evaluating the conclusions of this study, we have moved the description of the fitting method from Materials and methods to Results. In addition, we provide further clarification and more details of the rationale of using these different methods of lifetime estimates in Discussion to aid users in choosing the best metric for evaluating fluorescence lifetime data.

      More specifically, we modified our writing to highlight the following.

      (1) In Results, we describe that lifetime histograms were fitted to Equation 3 with the GaussNewton nonlinear least-square fitting algorithm and the fitted P<sub1</sub> was used as lifetime estimation.

      (2) In Results, we clarify that our simulation of multiplexed imaging was modeled with two sensors, each displaying a single exponential decay, but the two sensors have different decay constants. We also describe that Equation 3 with the Gauss-Newton nonlinear least-square fitting algorithm was used to deconvolve the two multiplexed exponential signals (Fig. 8)

      Reviewer #3 (Public review): 

      Summary: 

      This study presents a useful computational tool, termed FLiSimBA. The MATLAB-based FLiSimBA simulations allow users to examine the effects of various noise factors (such as autofluorescence, afterpulse of the photomultiplier tube detector, and other background signals) and varying sensor expression levels. Under the conditions explored, the simulations unveiled how these factors affect the observed lifetime measurements, thereby providing useful guidelines for experimental designs. Further simulations with two distinct fluorophores uncovered conditions in which two different lifetime signals could be distinguished, indicating multiplexed dynamic imaging may be possible. 

      Strengths: 

      The simulations and their analyses were done systematically and rigorously. FliSimba can be useful for guiding and validating fluorescence lifetime imaging studies. The simulations could define useful parameters such as the minimum number of photons required to detect a specific lifetime, how sensor protein expression level may affect the lifetime data, the conditions under which the lifetime would be insensitive to the sensor expression levels, and whether certain multiplexing could be feasible. 

      Weaknesses: 

      The analyses have relied on a key premise that the fluorescence lifetime in the system can be described as two-component discrete exponential decay. This means that the experimenter should ensure that this is the right model for their fluorophores a priori and should keep in mind that the fluorescence lifetime of the fluorophores may not be perfectly described by a twocomponent discrete exponential (for which alternative algorithms have been implemented: e.g., Steinbach, P. J. Anal. Biochem. 427, 102-105, (2012)). In this regard, I also couldn't find how good the fits were for each simulation and experimental data to the given fitting equation (Equation 2, for example, for Figure 2C data). 

      We thank the reviewer for the constructive feedback. We agree that the FLiSimBA users should ensure that the right decay equations are used to describe the fluorescent sensors. In this study, we used a FRET-based PKA sensor FLIM-AKAR to provide proof-of-principle demonstration of the capability of FLiSimBA. The donor fluorophore of FLIM-AKAR, truncated monomeric enhanced GFP, displays a single exponential decay. FLIM-AKAR, a FRET-based sensor, displays a double exponential decay. The time constants of the two exponential components were determined and reported previously (Chen, et al, Neuron (2017)).  Thus, a double exponential decay equation with known τ<sub>1</sub> and τ<sub>2</sub> was used for both simulation and fitting. The goodness of fit is now provided in Supplementary Fig. 1 for both simulated and experimental data. In addition to referencing our prior study characterizing the double exponential decay model of FLIM-AKAR in Materials and methods, we have emphasized in Discussion the versality of FLiSimBA to adapt to different sensors, tissues, and analysis methods, and the importance of using the right mathematical models to describe the fluorescence decay of specific sensors. 

      Also, in Figure 2C, the 'sensor only' simulation without accounting for autofluorescence (as seen in Sensor + autoF) or afterpulse and background fluorescence (as seen in Final simulated data) seems to recapitulate the experimental data reasonably well. So, at least in this particular case where experimental data is limited by its broad spread with limited data points, being able to incorporate the additional noise factors into the simulation tool didn't seem to matter too much.  

      In the original Fig 2C, the sensor fluorescence was much higher than the contributions from autofluorescence, afterpulse, and background signals, resulting in minimal effects of these other factors, as the reviewer noted. This original figure was based on photon counts from single neurons expressing FLIM-AKAR. For the rest of the manuscript, photon counts were based on whole fields of view (FOV). Since the FOV includes cells that do not express fluorescent sensors, the influence of autofluorescence, dark currents, and background is much more pronounced, as shown in Fig. 2B. 

      Both approaches – using photon counts from the whole FOV or from individual neurons – have their justifications. Photon counts from the whole FOV simulate data from fluorescence lifetime photometry (FLiP), whereas photon counts from individual neurons simulate data from fluorescence lifetime imaging microscopy (FLIM). However, the choice of approach does not affect the conclusions of the manuscript, as a range of photon count values are simulated. To maintain consistency throughout the manuscript, we have revised the photon counts in this figure (now Supplementary Fig. 1C) to match those from the whole FOV.

      Additionally, we have made some modifications in our analyses of Supplementary Fig. 1C and Fig. 2B, detailed in the “FLIM analysis” section of Materials and methods. For instance, to minimize system artifact interference at the histogram edges, we now use a narrower time range (1.8 to 11.5 ns) for fitting and empirical lifetime calculation.

      Reviewer #1 (Recommendations for the authors): 

      (1) The authors report how autofluorescence was measured from "imaged brain slices from mice at postnatal 15 to 19 days of age without sensor expression." However, it remains unclear how many acute slices and animals were used (for example, were all 15um x 15um FOV from a single slice) and if mouse age affects autofluorescence quantification. Furthermore, would in vivo measurements have different autofluorescence conditions given that blood flow would be active? It would help if the authors more clearly explained how reliable their autofluorescence measurement is by clarifying how they obtained it, whether this would vary across brain areas, and whether in vitro vs in vivo conditions would affect autofluorescence. 

      We have added description in Materials and methods that for autofluorescence ‘Fluorescence decay histograms from 19 images of two brain slices from a single mouse were averaged.’ We have added in Discussion that users should carefully ‘measure autofluorescence that matches the age, brain region, and data collection conditions (e.g., ex vivo or in vivo) of their tissue…’, and emphasize that FLiSimBA offers customization of inputs, and it is important for users to adapt the inputs such as autofluorescence to their experimental conditions. We also clarify in Discussion that the change of input parameters such as autofluorescence across age and brain region would not affect the general insights from this study, but will affect quantitative values.

      (2) Does sensor expression level issues arise more with in-utero electroporation compared to AAV-based delivery of biosensors? A brief comment on this in the discussion may help as most users in the field today may be using AAV strategies to deliver biosensors.

      In our experience, in-utero electroporation results in higher sensor expression than AAV-based delivery, and so pose less concern for expression-level dependence. However, both delivery methods can result in expression level dependence, especially with a sensor that is not bright. We have added in Discussion ‘For a sensor with medium brightness delivered via in utero electroporation, adeno-associated virus, or as a knock-in gene, the brightness may not always fall within the expression level-independent regime.’

      (3) Figure 1. Should the x-axis on the top figures be "Time (ns)" instead of "Lifetime (ns)"?

      Similarly in Figure 8A&B, wouldn't it make more sense to have the x-axis be Time not Lifetime?

      The x-axis labels in Fig. 1 and Fig. 8A-8B have been changed to ‘Time (ns)’.   

      (4) Figure 2b: why is the empirical lifetime close to 3.5ns? Shouldn't it be somewhere between

      2.14 and 0.69? 

      In our empirical lifetime calculation, we did not set the peak channel to have a time of 0.0488 ns (i.e. the laser cycle 12.5 ns divided by 256 time channels). Rather, we set the first time channel within a defined calculation range (i.e. 1.8 ns in Supplementary Fig. 1B) to have a time of 0.0488 ns (i.e.). Thus, the empirical lifetime exceeds 2.14 ns and depends on the time range of the histogram used for calculation. 

      For Fig. 2B and Supplementary Fig. 1C, we have now adjusted the range to 1.8-11.5 ns to eliminate FLIM artifacts at the histogram edges in our experimental data, resulting in an empirical lifetime around 2.255 ns. In contrast, the range for calculating the empirical lifetime of simulated data in the rest of the study (e.g. Fig. 4D) is 0.489-11.5 ns, yielding a larger lifetime of ~3.35 ns. 

      We have clarified these details and our rationale in Materials and methods.

      (5) Figure 2b: how come the afterpulse+background contributes more to the empirical lifetime than the autofluorescence (shorter lifetime). This was unclear in the results text why autofluorescence photons did not alter empirical lifetime as much as did the afterpulse/background.

      With a histogram range from 1.8 ns to 11.5 ns used in Fig. 2B, the empirical lifetime for FLIM-AKAR sensor fluorescence, autofluorescence, and background/afterpulse are: 2-2.3 ns, around 1.69 ns, and around 4.90 ns. The larger difference of background/afterpulse from FLIM-AKAR sensor fluorescence leads to larger influence of afterpulse+background than autofluorescence. We have added an explanation of this in Results.

      (6) One overall suggestion for an improvement that could help active users of lifetime biosensors understand the consequences would be to show either a real or simulated example of a "typical experiment" conducted using FLIM-AKAR and how an incorrect interpretation could be drawn as a consequence of these artifacts. For example, do these confounds affect experiments involving comparisons across animals more than within-subject experiments such as washing a drug onto the brain slice, and the baseline period is used to normalize the change in signal? I think this type of direct discussion will help biosensor users more deeply grasp how these factors play out in common experiments being conducted.

      We have added the following in Discussion, ‘…While this issue is less problematic when the same sample is compared over short periods (e.g. minutes), It can lead to misinterpretation when fluorescence lifetime is compared across prolonged periods or between samples when comparison is made across chronic time periods or between samples with different sensor expression levels. For example, apparent changes in fluorescence lifetime observed over days, across cell types, or subcellular compartments may actually reflect variations in sensor expression levels rather than true differences in biological signals (Fig. 6), Therefore, considering biologically realistic factors in FLiSimBA is essential, as it qualitatively impacts the conclusions.’

      Reviewer #2 (Recommendations for the authors): 

      The paper would be improved with more detail on the fitting methods, and the use of state-of-theart methods. Consult for example the introduction of this paper where many methods are listed: https://www.mdpi.com/1424-8220/22/19/7293

      We have moved the description of the Gauss-Newton nonlinear least-square fitting algorithm from Materials and methods to Results to enhance clarity. We appreciate the reviewer’s suggestion to combine FLiSimBA with various analysis methods. However, the primary focus of our manuscript is to call for attention of how specific contributing factors in biological experiments influence FLIM data, and to provide a tool that rigorously considers these factors to simulate FLIM data, which can then be used for fitting. Therefore, we did not expand the scope of our manuscript. Instead, we have added in the Discussion that ‘‘FLiSimBA can be used to test multiple fitting methods and lifetime metrics as an exciting future direction for identifying the best analysis method for specific experimental conditions’, citing relevant references.

      I would also improve the content of the GitHub repository as it is very hard to identify to source code used for simulation and fitting. 

      We have reorganized and relabeled our GitHub repository and now have three folders labeled as ‘Simulation_inMatlab’, ‘DataAnalysis_inMatlab’, and ‘SimulationAnalysis_inPython’. We also updated the clarification of the contents of each folder in the README file.

      Reviewer #3 (Recommendations for the authors): 

      (1) P. 10 "For example, to detect a P1 change of 0.006 or a lifetime change of 5 ps with one sample measurement in each comparison group, approximately 300,000 photons are needed." If I am reading the graphs in Figures 3B and C, this sentence is talking about the red line. However, the intersection of 0.006 in the MDD of P1 in 3B and red is not 3E5 photons. And the intersection of 0.005 ns and red in 3C is not 3E5 photons either. Are you sure you are talking about n=1? Maybe the values are correct for the blue curve with n=5.

      Thank you for catching our error. We have corrected the text to ‘with five sample measurements’.

      (2) Figure 2 (B) legend: It would be helpful to specify what is being compared in the legend. For example, consider revising "* p < 0.05 vs sensor only; n.s. not significant vs sensor + autoF; # p < 0.05 vs sensor + autoF. Two-way ANOVA with Šídák's multiple comparisons test" to "* p <0.05 for sensor + auto F (cyan) vs sensor only; n.s. not significant for final simulated data (purple) vs sensor + autoF; # p < 0.05 for final simulated data (purple) vs sensor + autoF. Twoway ANOVA with Šídák's multiple comparisons test".

      We’ve made the change and thanks for the suggestion to make it clearer.

      (3) Figure 2 (c) Can you please show the same Two-way ANOVA test values for Experimental vs. Sensor only and for Experimental vs. Sensor + autoF? Currently, the value (n.s.) is marked only for Experimental vs. Final simulation. Given that the experimental data are sparse (compared to the simulations), it seems likely that there may be no significant difference among the 3 different simulations regarding how well they match the experimental data. Also, can you specify the P1 and P2 of the experimental data  used to generate the simulated data on this panel? Also, what is the reason why P1=0.5 was used for panels A and B, instead of the value matching the experimental value?

      As the reviewer suggested, we have included statistical tests in the figure (now Supplementary Fig. 1C). Please see our response to the Public Review of Reviewer 3’s comments as well as our changes in Materials and Methods on other changes and their rationale for this figure. We have now specified the P<sub>1</sub> value of the experimental data used to generate the simulated data on this panel both in Figure Legends and Materials and Methods. Based on the suggestion, we have now used the same P<sub>1</sub> value in Fig. 2B.

    1. Be clear about the consequences of using AI to generate pornographic images. Tell students that they may see apps to create nude pictures advertised on platforms like TikTok. Though they may be curious or think it's funny (because the pictures aren't "real"), using AI to generate nude pictures of someone is harassment and illegal. It doesn't just harm the victim—law enforcement could get involved. Victims should tell a trusted adult, report to authorities, and can also report the incident to CyberTipline.org.

      This part really stands out as a necessary and urgent conversation. With how normalized AI tools have become on platforms like TikTok, I can see how some students might not fully grasp the seriousness of using them to generate explicit content. It’s not a harmless joke—it’s a form of harassment with real legal and emotional consequences. As someone who spends a lot of time online, I think we all need to take more responsibility in calling out this kind of behavior and making sure people know where to get help,

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #3 (Public Review)

      Summary:

      In this paper the authors examined the effects of strip cropping, a relatively new agricultural technique of alternating crops in small strips of several meters wide, on ground beetle diversity. The results show an increase in species diversity (i.e. abundance and species richness) of the ground beetle communities compared to monocultures.

      Strengths:

      The article is well written; it has an easily readable tone of voice without too much jargon or overly complicated sentence structure. Moreover, as far as reviewing the models in depth without raw data and R scripts allows, the statistical work done by the authors looks good. They have well thought out how to handle heterogenous, unbalanced and taxonomically unspecific yet spatially and temporarily correlated field data. The models applied and the model checks performed are appropriate for the data at hand. Combining RDA and PCA axes together is a nice touch. Moreover, after the first round of reviews, the authors have done a great job at rewriting the paper to make it less overstated, more relevant to the data at hand and more solid in the findings. Many of the weaknesses noted in the first review have been dealt with. The overall structure of the paper is good, with a clear introduction, hypotheses, results section and discussion.

      We are grateful for this positive feedback. We are glad that our extensive revision after extensive review from three reviewers has paid off in addressing earlier weakness of our manuscript.

      Weaknesses:

      The weaknesses that remain are mainly due to a difficult dataset and choices that could have stressed certain aspects more, like the relationship between strip cropping and intercropping. The mechanistic understanding of strip cropping is what is at stake here. Does strip cropping behave similar to intercropping, a technique which has been proven to be beneficial to biodiversity because of added effects due to increased resource efficiency and greater plant species richness.

      Unfortunately, the authors do not go into this in the introduction or otherwise and simply state that they consider strip cropping a form of intercropping.

      We agree with the reviewer that a mechanistic understanding on how intercropping and strip cropping differ would be very interesting. However, we also feel that this topic is somewhat beyond the scope of the current manuscript. We are already planning work to elucidate mechanisms that may explain the pest and suppressive effects of strip cropping.

      I also do not like the exclusive focus on percentages, as these are dimensionless. I think more could have been done to show underlying structure in the data, even after rarefaction.

      While we generally agree with this point raised by the reviewer, for our heterogeneous dataset it was difficult to come up with meaningful units with dimensions. Therefore, we believe that percentages are the most suitable approach to present readers a fair comparison of the treatments.

      A further weakness is a limited embedding into the larger scientific discourses other than providing references. But this may be a matter of style and/or taste

      We believe our manuscript to be well-embedded within the relevant scientific discourse, but as indicated by reviewer 3 this might indeed be a matter of style/taste. Without exact examples it is difficult for us to judge this point.

      Reviewer #3 (Recommendations for the authors): 

      Suggestion for title: "Strip cropping shows promising preliminary increases in ground beetle community diversity compared to monocultures"

      We agree that the title could indeed be nuanced. We incorporated the suggested title, except for the word “preliminary”, as we felt that this is slightly misplaced for a 4-year study conducted at 4 locations.

      line 26: the word previous may be confusing to readers, as it suggests previous research on beetles or insects. I think it would be better to use for instance "related" or "productivity focused research"

      We agree that this wording might be confusing, and changed it to “other studies showed”.

      Line 84-85: this is vague. can you make explicit what you are trying to answer here?

      We made “biodiversity metric changes” more explicit, and changed the sentence accordingly.

      Line 88-89: I think this would fit better with the first question in line 83-84, so I suggest placing it upwards. Also, I think you mean abundant instead of common. Common suggests commonness in the entire population. Abundant suggests found often in this study. While these definitions may very much overlap, they are distinctly different.

      We have moved this sentence up and changed “common” to “abundant”. To make the result section more in line with this section, we also moved the section on the relationship between crop configuration and abundant genera up.  

      Line 146: defining rareness of species should be in the methods section. Also "following" would be better than "according"

      We now added a sentence on how we examine habitat preferences and rarity in the methods section (line 316-317). We also changed “according to” to “following”.

      Line 291: it is called being "flush" with the soil surface. This expression is not much used by non-native speakers, but is regularly encountered in studies on pitfalls, so the authors could decide to change the sentence using the proper English vernacular.

      Suggestion incorporated.

      Line 322-327, this method could do with a reference

      This method is a relatively standard calculation to calculate relative changes and to center variation around zero. Nevertheless, we added a reference to a paper that used the same method.

      Line: 333-335. I would still like to see a reference for this method.

      This methodology has not been described in literature to the best of our knowledge. As we compared two crops within strip cropping with their respective monoculture references, we compare one strip cropping field with two monocultural fields. Here we took a conservative approach by comparing the strip crop field with the monoculture with the highest richness and activity density, to see if strip cropped fields outperformed monocultures with diverse ground beetle communities.

      Line 364-366. references?

      We have added references for these R packages.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The authors claim that they can use a combination of repetitive transcranial magnetic stimulation (intermittent theta burst-iTBS) and transcranial alternating current stimulation (gamma tACS) to cause slight improvements in memory in a face/name/profession task.

      Strengths:

      The idea of stimulating the human brain non-invasively is very attractive because, if it worked, it could lead to a host of interesting applications. The current study aims to evaluate one such exciting application.

      Weaknesses:

      (1) It is highly unclear what, if anything, transpires in the brain with non-invasive stimulation. To cite one example of many, a rigorous study in rats and human cadavers, compellingly showed that traditional parameters of transcranial electrical stimulation lead to no change in brain activity due to the attenuation by the soft tissue and skull (Mihály Vöröslakos et al Nature Communications 2018): https://www.nature.com/articles/s41467-018-02928-3. It would be very useful to demonstrate via invasive neurophysiological recordings that the parameters used in the current study do indeed lead to any kind of change in brain activity. Of course, this particular study uses a different non-invasive stimulation protocol.

      Thank you for raising the important issue regarding the actual neurophysiological effects of non-invasive brain stimulation. Unfortunately, invasive neurophysiological recordings in humans for this type of study are not feasible due to ethical constraints, while studies on cadavers or rodents would not fully resolve our question. Indeed, the authors of the cited study (Mihály Vöröslakos et al., Nature Communications, 2018) highlight the impossibility of drawing definitive conclusions about the exact voltage required in the in-vivo human brain due to significant differences between rats and humans, as well as the in-vivo human brain and cadavers due to alterations in electrical conductivity that occur in postmortem tissue. Huang and colleagues addressed the difficulties in reaching direct evidence of non-invasive brain stimulation (NIBS) effects in a review published in Clinical Neurophysiology in 2017. They conclude that the use of EEG to assess brain response to TMS has great potential for a less indirect demonstration of plasticity mechanisms induced by NIBS in humans.

      To address this challenge, we conducted Experiments 3 and 4, which respectively examined the neurophysiological and connectivity changes induced by the stimulation in a non-invasive manner using TMS-EEG and fMRI. The observed changes in brain oscillatory activity (increased gamma oscillatory activity), cortical excitability (enhanced posteromedial parietal cortex reactivity), and brain connectivity (strengthened connections between the precuneus and hippocampi) provided evidence of the effects of our non-invasive brain stimulation protocol, further supporting the behavioral data.

      Additionally, we carefully considered the issue of stimulation distribution and, in response, performed a biophysical modeling analysis and E-field calculation using the parameters employed in our study (see Supplementary Materials).

      We acknowledge that further exploration of this aspect would be highly valuable, and we agree that it is worth discussing both as a technical limitation and as a potential direction for future research. We therefore, modify the discussion accordingly (main text, lines 280-289).

      “Although we studied TMS and tACS propagation through the E-field modeling and observed an increase in the precuneus gamma oscillatory activity, excitability and connectivity with the hippocampi, we cannot exclude that our results might reflect the consequences of stimulating more superficial parietal regions other than the precuneus nor report direct evidence of microscopic changes in the brain after the stimulation. Invasive neurophysiological recordings in humans for this type of study are not feasible due to ethical constraints. Studies on cadavers or rodents would not fully resolve our question due to significant differences between them (i.e. rodents do not have an anatomical correspondence while cadavers have an alterations in electrical conductivity occurring in postmortem tissue). However, further exploration of this aspect in future studies would help in the understanding of γtACS+iTBS effects.”

      (2) If there is any brain activity triggered by the current stimulation parameters, then it is extremely difficult to understand how this activity can lead to enhancing memory. The brain is complex. There are hundreds of neuronal types. Each neuron receives precise input from about 10,000 other neurons with highly tuned synaptic strengths. Let us assume that the current protocol does lead to enhancing (or inhibiting) simultaneously the activity of millions of neurons. It is unclear whether there is any activity at all in the brain triggered by this protocol, it is also unclear whether such activity would be excitatory, or inhibitory. It is also unclear how many neurons, let alone what types of neurons would change their activity. How is it possible that this can lead to memory enhancement? This seems like using a hammer to knock on my laptop and hope that the laptop will output a new Mozart-like sonata.

      Thank you for your comment. As you correctly point out, we still do not have precise knowledge of which neurons—and to what extent—are activated during non-invasive brain stimulation in humans. However, this challenge is not limited to brain stimulation but applies to many other therapeutic interventions, including psychiatric medications, without limiting their use.

      Nevertheless, a substantial body of research has investigated the mechanisms underlying the efficacy of TMS and tACS in producing behavioral after-effects, primarily through its ability to induce long-term potentiation (Bliss & Collingridge, The Journal of Physiology, 1993a; Ridding & Rothwell, Nature Reviews Neuroscience, 2007; Huang et al., Clinical Neurophysiology, 2017; Koch et al., Neuroimage 2018; Koch et al., Brain 2022; Jannati et al., Neuropsychopharmacology, 2023; Wischnewski et al., Trends in Cognitive Science, 2023; Griffiths et al., Trends in Neuroscience, 2023).

      We acknowledge that we took this important aspect for granted. We consequently expanded the introduction accordingly (main text, lines 48-60).

      “Repetitive transcranial magnetic stimulation (rTMS) and transcranial alternating current stimulation (tACS) are two forms of NIBS widely used to enhance memory performances (Grover et al., 2022; Koch et al., 2018; Wang et al., 2014). rTMS, based on the principle of Faraday, induces depolarization of cortical neuronal assemblies and leads to after-effects that have been linked to changes in synaptic plasticity involving mechanisms of long-term potentiation (LTP) (Huang et al., 2017; Jannati et al., 2023). On the other hand, tACS causes rhythmic fluctuations in neuronal membrane potentials, which can bias spike timing, leading to an entrainment of the neural activity (Wischnewski et al., 2023). In particular, the induction of gamma oscillatory a has been proposed to play an important role in a type of LTP known as spike timing-dependent plasticity, which depends on a precise temporal delay between the firing of a presynaptic and a postsynaptic neuron (Griffiths and Jensen, 2023). Both LTP and gamma oscillations have a strong link with memory processes such as encoding (Bliss and Collingridge, 1993; Griffiths and Jensen, 2023; Rossi et al., 2001), pointing to rTMS and tACS as good candidates for memory enhancement.”

      (3) Even if there is any kind of brain activation, it is unclear why the authors seem to be so sure that the precuneus is responsible. Are there neurophysiological data demonstrating that the current protocol only activates neurons in the precuneus? Of note, the non-invasive measurements shown in Figure 3 are very weak (Figure 3A top and bottom look very similar, and Figure 3C left and right look almost identical). Even if one were to accept the weak alleged differences in Figure 3, there is no indication in this figure that there is anything specific to the precuneus, rather a whole brain pattern. This would be the kind of minimally rigorous type of evidence required to make such claims. In a less convincing fashion, one could look at different positions of the stimulation apparatus. This would not be particularly compelling in terms of making a statement about the precuneus. But at least it would show that the position does matter, and over what range of distances it matters, if it matters.

      Thank you for your feedback. Our assumption that the precuneus plays a key role in the observed effects is based on several factors:

      (1) The non-invasive stimulation protocol was applied to an individually identified precuneus for each participant. Given existing evidence on TMS propagation, we can reasonably assume that the precuneus was at least a mediator of the observed effects (Ridding & Rothwell, Nature Reviews Neuroscience 2007). For further details about target identification and TMS and tACS propagation, please refer to the MRI data acquisition section in the main text and Biophysical modeling and E-field calculation section in the supplementary materials.

      (2) To investigate the effects of the neuromodulation protocol on cortical responses, we conducted a whole-brain analysis using multiple paired t-tests comparing each data point between different experimental conditions. To minimize the type I error rate, data were permuted with the Monte Carlo approach and significant p-values were corrected with the false discovery rate method (see the Methods section for details). The results identified the posterior-medial parietal areas as the only regions showing significant differences across conditions.

      (3) To control for potential generalized effects, we included a control condition in which TMS-EEG recordings were performed over the left parietal cortex (adjacent to the precuneus). This condition did not yield any significant results, reinforcing the cortical specificity of the observed effects.

      However, as stated in the Discussion, we do not claim that precuneus activity alone accounts for the observed effects. As shown in Experiment 4, stimulation led to connectivity changes between the precuneus and hippocampus, a network widely recognized as a key contributor to long-term memory formation (Bliss & Collingridge, Nature 1993). These connectivity changes suggest that precuneus stimulation triggered a ripple effect extending beyond the stimulation site, engaging the broader precuneus-hippocampus network.

      Regarding Figure 3A, it represents the overall expression of oscillatory activity detected by TMS-EEG. Since each frequency band has a different optimal scaling, the figure reflects a graphical compromise. A more detailed representation of the significant results is provided in Figure 3B. The effect sizes for gamma oscillatory activity in the delta T1 and T2 conditions were 0.52 and 0.50, respectively, which correspond to a medium effect based on Cohen’s d interpretation.

      We add a paragraph in the discussion to improve the clarity of the manuscript regarding this important aspect (lines 193-198).

      “Given the existing evidence on TMS propagation and the computation of the Biophysical model with the Efield, we can reasonably assume that the individually identified PC was a mediator of the observed effects (Ridding and Rothwell, 2007). Moreover, we observed specific cortical changes in the posteromedial parietal areas, as evidenced by the whole-brain analysis conducted on TMS-EEG data and the absence of effect on the lateral posterior parietal cortex used as a control condition.”

      (4) In the absence of any neurophysiological documentation of a direct impact on the brain, an argument in this type of study is that the behavioral results show that there must be some kind of effect. I agree with this argument. This is also the argument for placebo effects, which can be extremely powerful and useful even if the mechanism is unrelated to what is studied. Then let us dig into the behavioral results.

      Hoping to have already addressed your concern regarding the neurophysiological impact of the stimulation on the brain, we would like to emphasize that the behavioral results were obtained controlling for placebo effects. This was achieved by having participants perform the task under different stimulation conditions, including a sham condition.

      4a. There does not seem to be any effect on the STMB task, therefore we can ignore this.

      4b. The FNAT task is minimally described in the supplementary material. There are no experimental details to understand what was done. What was the size of the images? How long were the images presented for? Were there any repetitions of the images? For how long did the participants study the images? Presumably, all the names and occupations are different? What were the genders of the faces? What is chance level performance? Presumably, the same participant saw different faces across the different stimulation conditions. If not, then there can be memory effects across different conditions that are even more complex to study. If yes, then it would be useful to show that the difficulty is the same across the different stimuli.

      We thank you for signaling the lack in the description of FNAT task. We added the information required in the supplementary information (lines 93-101).

      “Each picture's face size was 19x15cm. In the learning phase, faces were shown along with names and occupations for 8 seconds each (totaling approximately 2 minutes). During immediate recall, the faces were displayed alone for 8 seconds. In the delayed recall and recognition phase, pictures were presented until the subject provided answers. We used a different set of stimuli for each stimulation condition, resulting in a total of 3 parallel task forms balanced across conditions and session order. All parallel forms comprised 6 male and 6 female faces; for each sex, there were 2 young adults (around 30 years old), 2 middle-aged adults (around 50 years old), and 2 elderly adults (around 70 years old). Before the experiments, we conducted a pilot study to ensure no differences existed between the parallel forms of the task.”

      The chance level in the immediate and delayed recall is not quantifiable since the participants had to freely recall the name and the occupation without a multiple choice. In the recognition, the chance level was around 33% (since the possible answers were 3).

      4c. Although not stated clearly, if I understand FNAT correctly, the task is based on just 12 presentations. Each point in Figure 2A represents a different participant. Unfortunately, there is no way of linking the performance of individual participants across the conditions with the information provided. Lines joining performance for each participant would be useful in this regard. Because there are only 12 faces, the results are quantized in multiples of 100/12 % in Figure 3A. While I do not doubt that the authors did their homework in terms of the statistical analyses, it is difficult to get too excited about these 12 measurements. For example, take Figure 3A immediate condition TOTAL, arguably the largest effect in the whole paper. It seems that on average, the participants may remember one more face/name/occupation.

      Thank you for the suggestion. We added graphs showing lines linking the performance of individual participants across conditions to improve clarity, please see Fig.2 revised. We apologize for the lack of clarity in the description of the FNAT. As you correctly pointed out, we used the percentage based on the single association between face, name and occupation (12 in total). However, each association consisted of three items, resulting in a total of 36 items to learn and associate – we added a paragraph to make it more explicit in the manuscript (lines 425-430).

      “We considered a correct association when a subject was able to recall all the information for each item (i.e. face, name and occupation), resulting in a total of 36 items to learn and associate. To further investigate the effect on FNAT we also computed a partial recall score accounting for those items where subjects correctly matched only names with faces (FNAT NAME) and only occupations with faces (FNAT OCCUPATION). See supplementary information for score details.”

      In the example you mentioned, participants were, on average, able to correctly recall and associate three more items compared to the other conditions. While this difference may not seem striking at first glance, it is important to consider that we assessed memory performance after a single, three-minute stimulation session. Similar effects are typically observed only after multiple stimulation sessions (Koch et al., NeuroImage, 2018; Grover et al., Nature Neuroscience, 2022). Moreover, memory performance changes are often measured by a limited set of stimuli due to methodological constraints related to memory capacity. For example, Rey Auditory Verbal learning task, requiring to learn and recall 15 words, is a typical test used to detect memory changes (Koch et al., Neuroimage, 2018; Benussi et al., Brain stimulation 2021; Benussi et al., Annals of Neurology, 2022). 

      4d. Block effects. If I understand correctly, the experiments were conducted in blocks. This is always problematic. Here is one example study that articulated the big problems in block designs (Li et al TPAMI 2021):https://ieeexplore.ieee.org/document/9264220

      Thank you for the interesting reference. According to this paper, in a block design, EEG or fMRI recordings are performed in response to different stimuli of a given class presented in succession. If this is the case, it does not correspond to our experimental design where both TMS-EEG and fMRI were conducted in resting state on different days according to the different stimulation conditions.

      4e. Even if we ignore the lack of experimental descriptions, problems with lack of evidence of brain activity, the minimalistic study of 12 faces, problems with the block design, etc. at the end of the day, the results are extremely weak. In FNAT, some results are statistically significant, some are not. The interpretation of all of this is extremely complex. Continuing with Figure 3A, it seems that the author claims that iTBS+gtACS > iTBS+sham-tACS, but iTBS+gtACS ~ sham+sham. I am struggling to interpret such a result. When separating results by name and occupation, the results are even more perplexing. There is only one condition that is statistically significant in Figure 3A NAME and none in the occupation condition.

      Thank you again for your feedback. Hoping to have thoroughly addressed your initial concerns in our previous responses, we now move on to your observations regarding the behavioral results, assuming you were referring to Figure 2A. The main finding of this study is the improvement in long-term memory performance, specifically the ability to correctly recall the association between face, name, and occupation (total FNAT), which was significantly enhanced in both Experiments 1 and 2. However, we also aimed to explore the individual contributions of name and occupation separately to gain a deeper understanding of the results. Our analysis revealed that the improvement in total FNAT was primarily driven by an increase in name recall rather than occupation recall. We understand that this may have caused some confusion. We consequently modified the manuscript in the (lines 97-99; 107-111; 425-430) to make it clearer and moved the graph relative to FNAT NAME and OCCUPATION from fig.2 in the main text to fig. S4 in supplementary information.

      “Dual iTBS+γtACS increased the performances in recalling the association between face, name and occupation (FNAT accuracy) both for the immediate (F<sub>2,38</sub>=7.18; p =0.002; η<sup>2</sup><sub>p</sub>=0.274) and the delayed (F<sub>2,38</sub>=5.86; p =0.006; η<sup>2</sup><sub>p</sub>=0.236) recall performances (Fig. 2, panel A).”

      “The in-depth analysis of the FNAT accuracy investigating the specific contribution of face-name and face-occupation recall reveald that dual iTBS+γtACS increased the performances in the association between face and name (FNAT NAME) delayed recall (F<sub>2,38</sub> =3.46; p =0.042; η<sup>2</sup>p =0.154; iTBS+γtACS vs. sham-iTBS+sham-tACS: 42.9±21.5 % vs. 33.8±19 %; p=0.048 Bonferroni corrected) (Fig. S4, supplementary information).”

      “We considered a correct association when a subject was able to recall all the information for each item (i.e. face, name and occupation), resulting in a total of 36 items to learn and associate. To further investigate the effect on FNAT we also computed a partial recall score accounting for those items where subjects correctly matched only names with faces (FNAT NAME) and only occupations with faces (FNAT OCCUPATION). See supplementary information for score details.”

      Regarding the stimulation conditions, your concerns about the performance pattern (iTBS+gtACS > iTBS+sham-tACS, but iTBS+gtACS ~ sham+sham) are understandable. However, this new protocol was developed precisely in response to the variability observed in behavioral outcomes following non-invasive brain stimulation, particularly when used to modulate memory functions (Corp et al., 2020; Pabst et al., 2022). As discussed in the manuscript, it is intended as a boost to conventional non-invasive brain stimulation protocols, leveraging the mechanisms outlined in the Discussion section.

      (5) In sum, it would be amazing to be able to use non-invasive stimulation for any kind of therapeutic purpose as the authors imagine. More work needs to be done to convince ourselves that this kind of approach is viable. The evidence provided in this study is weak.

      We hope our response will be carefully considered, fostering a constructive exchange and leading to a reassessment of your evaluation.

      Reviewer #2 (Public review):

      Summary:

      The manuscript "Dual transcranial electromagnetic stimulation of the precuneus-hippocampus network boosts human long-term memory" by Borghi and colleagues provides evidence that the combination of intermittent theta burst TMS stimulation and gamma transcranial alternating current stimulation (γtACS) targeting the precuneus increases long-term associative memory in healthy subjects compared to iTBS alone and sham conditions. Using a rich dataset of TMS-EEG and resting-state functional connectivity (rs-FC) maps and structural MRI data, the authors also provide evidence that dual stimulation increased gamma oscillations and functional connectivity between the precuneus and hippocampus. Enhanced memory performance was linked to increased gamma oscillatory activity and connectivity through white matter tracts.

      Strengths:

      The combination of personalized repetitive TMS (iTBS) and gamma tACS is a novel approach to targeting the precuneus, and thereby, connected memory-related regions to enhance long-term associative memory. The authors leverage an existing neural mechanism engaged in memory binding, theta-gamma coupling, by applying TMS at theta burst patterns and tACS at gamma frequencies to enhance gamma oscillations. The authors conducted a thorough study that suggests that simultaneous iTBS and gamma tACS could be a powerful approach for enhancing long-term associative memory. The paper was well-written, clear, and concise.

      Weaknesses:

      (1) The study did not include a condition where γtACS was applied alone. This was likely because a previous work indicated that a single 3-minute γtACS did not produce significant effects, but this limits the ability to isolate the specific contribution of γtACS in the context of this target and memory function

      Thank you for your comments. As you pointed out, we did not include a condition where γtACS was applied alone. This decision was based on the findings of Guerra et al. (Brain Stimulation 2018), who investigated the same protocol and reported no aftereffects. Given the substantial burden of the experimental design on patients and our primary goal of demonstrating an enhancement of effects compared to the standalone iTBS protocol, we decided to leave out this condition. However, you raise an important aspect that should be further discussed, we modified the limitation section accordingly (lines 290-297).

      “We did not assess the effects of γtACS alone. This decision was based on the findings of Guerra et al. (Guerra et al., 2018), who investigated the same protocol and reported no aftereffects. Given the substantial burden of the experimental design on patients and our primary goal of demonstrating an enhancement of effects compared to the standalone iTBS protocol, we decided to leave out this condition. While examining the effects of γtACS alone could help isolate its specific contribution to this target and memory function, extensive research has shown that achieving a cognitive enhancement aftereffect with tACS alone typically requires around 20–25 minutes of stimulation (Grover et al., 2023).”

      (2) The authors applied stimulation for 3 minutes, which seems to be based on prior tACS protocols. It would be helpful to present some rationale for both the duration and timing relative to the learning phase of the memory task. Would you expect additional stimulation prior to recall to benefit long-term associative memory?

      Thank you for your comment and for raising this interesting point. As you correctly noted, the protocol we used has a duration of three minutes, a choice based on previous studies demonstrating its greater efficacy with respect to single stimulation from a neurophysiological point of view. Specifically, these studies have shown that the combined stimulation enhanced gamma-band oscillations and increased cortical plasticity (Guerra et al., Brain Stimulation 2018; Maiella et al., Scientific Reports 2022). Given that the precuneus (Brodt et al., Science 2018; Schott et al., Human Brain Mapping 2018), gamma oscillations (Osipova et al., Journal of Neuroscience 2006; Deprés et al., Neurobiology of Aging 2017; Griffiths et al., Trends in Neurosciences 2023), and cortical plasticity (Brodt et al., Science 2018) are all associated with memory formation and encoding processes, we decided to apply the co-stimulation immediately before it to enhance the efficacy. We added this paragraph to the manuscript rationale (lines 48-60).

      “Repetitive transcranial magnetic stimulation (rTMS) and transcranial alternating current stimulation (tACS) are two forms of NIBS widely used to enhance memory performances (Grover et al., 2022; Koch et al., 2018; Wang et al., 2014). rTMS, based on the principle of Faraday, induces depolarization of cortical neuronal assemblies and leads to after-effects that have been linked to changes in synaptic plasticity involving mechanisms of long-term potentiation (LTP) (Huang et al., 2017; Jannati et al., 2023). On the other hand, tACS causes rhythmic fluctuations in neuronal membrane potentials, which can bias spike timing, leading to an entrainment of the neural activity (Wischnewski et al., 2023). In particular, the induction of gamma oscillatory a has been proposed to play an important role in a type of LTP known as spike timing-dependent plasticity, which depends on a precise temporal delay between the firing of a presynaptic and a postsynaptic neuron (Griffiths and Jensen, 2023). Both LTP and gamma oscillations have a strong link with memory processes such as encoding (Bliss and Collingridge, 1993; Griffiths and Jensen, 2023; Rossi et al., 2001), pointing to rTMS and tACS as good candidates for memory enhancement.”

      Regarding the question of whether stimulation could also benefit recall, the answer is yes. We can speculate that repeating the stimulation before recall might provide an additional boost. This is supported by evidence showing that both the precuneus and gamma oscillations are involved in recall processes (Flanagin et al., Cerebral Cortex 2023; Griffiths et al., Trends in Neurosciences 2023). Furthermore, previous research suggests that reinstating the same brain state as during encoding can enhance recall performance (Javadi et al., The Journal of Neuroscience 2017). We added this consideration to the discussion (lines 305-311).

      “Future studies should further investigate the effects of stimulation on distinct memory processes. In particular, stimulation could be applied before retrieval (Rossi et al., 2001), to better elucidate its specific contribution to the observed enhancements in memory performance. Additionally, it would be worth examining whether repeated stimulation - administered both before encoding and before retrieval - could produce a boosting effect. This is especially relevant in light of findings showing that matching the brain state between retrieval and encoding can significantly enhance memory performance (Javadi et al., 2017).”

      (3) How was the burst frequency of theta iTBS and gamma frequency of tACS chosen? Were these also personalized to subjects' endogenous theta and gamma oscillations? If not, were increases in gamma oscillations specific to patients' endogenous gamma oscillation frequencies or the tACS frequency?

      The stimulation protocol was chosen based on previous studies (Guerra et al., Brain Stimulation 2018; Maiella et al., Scientific Reports 2022).  Gamma tACS sinusoid frequency wave was set at 70 Hz while iTBS consisted of ten bursts of three pulses at 50 Hz lasting 2 s, repeated every 10 s with an 8 s pause between consecutive trains, for a total of 600 pulses total lasting 190 s (see iTBS+γtACS neuromodulation protocol section). In particular, the theta iTBS has been inspired by protocols used in animal models to elicit LTP in the hippocampus (Huang et al., Neuron 2005). Consequently, neither Theta iTBS nor the gamma frequency of tACS were personalized. The increase in gamma oscillations was referred to the patient’s baseline and did not correspond to the administrated tACS frequency.

      (4) The authors do a thorough job of analyzing the increase in gamma oscillations in the precuneus through TMS-EEG; however, the authors may also analyze whether theta oscillations were also enhanced through this protocol due to the iTBS potentially targeting theta oscillations. This may also be more robust than gamma oscillations increases since gamma oscillations detected on the scalp are very low amplitude and susceptible to noise and may reflect activity from multiple overlapping sources, making precise localization difficult without advanced techniques.

      Thank you for the suggestion. We analyzed theta oscillations, finding no changes.

      (5) Figure 4: Why are connectivity values pre-stimulation for the iTBS and sham tACS stimulation condition so much higher than the dual stimulation? We would expect baseline values to be more similar.

      We acknowledge that the pre-stimulation connectivity values for the iTBS and sham tACS conditions appear higher than those for the dual stimulation condition. However, as noted in our statistical analyses, there were no significant differences at baseline between conditions (p-FDR= 0.3514), suggesting that any apparent discrepancy is due to natural variability rather than systematic bias. One potential explanation for these differences is individual variability in baseline connectivity measures, which can fluctuate due to factors such as intrinsic neural dynamics, participant state, or measurement noise. Despite these variations, our statistical approach ensures that any observed post-stimulation effects are not confounded by pre-existing differences.

      (6) Figure 2: How are total association scores significantly different between stimulation conditions, but individual name and occupation associations are not? Further clarification of how the total FNAT score is calculated would be helpful.

      We apologize for any lack of clarity. The total FNAT score reflects the ability to correctly recall all the information associated with a person—specifically, the correct pairing of the face, name, and occupation. Participants received one point for each triplet they accurately recalled. The scores were then converted into percentages, as detailed in the Face-Name Associative Task Construction and Scoring section in the supplementary materials.

      Total FNAT was the primary outcome measure. However, we also analyzed name and occupation recall separately to better understand their partial contributions. Our analysis revealed that the improvement in total FNAT was primarily driven by an increase in name recall rather than occupation recall.

      We acknowledge that this distinction may have caused some confusion. To improve clarity, we revised the manuscript accordingly (lines 97-98; 107-111; 425-430).

      “Dual iTBS+γtACS increased the performances in recalling the association between face, name and occupation (FNAT accuracy) both for the immediate (F<sub>2,38</sub>=7.18 ;p=0.002; η<sup>2</sup><sub>p</sub>=0.274) and the delayed (F<sub>2,38</sub>=5.86;p=0.006; η<sup>2</sup><sub>p</sub>=0.236) recall performances (Fig. 2, panel A).”

      “The in-depth analysis of the FNAT accuracy investigating the specific contribution of face-name and face-occupation recall revealed that dual iTBS+γtACS increased the performances in the association between face and name (FNAT NAME) delayed recall (F<sub>2,38</sub> =3.46; p =0.042; η<sup>2</sup>p =0.154; iTBS+γtACS vs. sham-iTBS+sham-tACS: 42.9±21.5 % vs. 33.8±19 %; p=0.048 Bonferroni corrected) (Fig. S4, supplementary information).”

      “We considered a correct association when a subject was able to recall all the information for each item (i.e. face, name and occupation), resulting in a total of 36 items to learn and associate. To further investigate the effect on FNAT we also computed a partial recall score accounting for those items where subjects correctly matched only names with faces (FNAT NAME) and only occupations with faces (FNAT OCCUPATION). See supplementary information for score details.”

      We also moved the data regarding the specific contribution of name and occupation recall in the supplementary information (fig.S4) and further specified how we computed the score in the score (lines 102-104).

      “The score was computed by deriving an accuracy percentage index dividing by 12 and multiplying by 100 the correct association sum. The partial recall scores were computed in the same way only considering the sum of face-name (NAME) and face-occupation (OCCUPATION) correctly recollected.”

      Reviewer #3 (Public review):

      Summary:

      Borghi and colleagues present results from 4 experiments aimed at investigating the effects of dual γtACS and iTBS stimulation of the precuneus on behavioral and neural markers of memory formation. In their first experiment (n = 20), they found that a 3-minute offline (i.e., prior to task completion) stimulation that combines both techniques leads to superior memory recall performance in an associative memory task immediately after learning associations between pictures of faces, names, and occupation, as well as after a 15-minute delay, compared to iTBS alone (+ tACS sham) or no stimulation (sham for both iTBS and tACS). Performance in a second task probing short-term memory was unaffected by the stimulation condition. In a second experiment (n = 10), they show that these effects persist over 24 hours and up to a full week after initial stimulation. A third (n = 14) and fourth (n = 16) experiment were conducted to investigate the neural effects of the stimulation protocol. The authors report that, once again, only combined iTBS and γtACS increase gamma oscillatory activity and neural excitability (as measured by concurrent TMS-EEG) specific to the stimulated area at the precuneus compared to a control region, as well as precuneus-hippocampus functional connectivity (measured by resting-state MRI), which seemed to be associated with structural white matter integrity of the bilateral middle longitudinal fasciculus (measured by DTI).

      Strengths:

      Combining non-invasive brain stimulation techniques is a novel, potentially very powerful method to maximize the effects of these kinds of interventions that are usually well-tolerated and thus accepted by patients and healthy participants. It is also very impressive that the stimulation-induced improvements in memory performance resulted from a short (3 min) intervention protocol. If the effects reported here turn out to be as clinically meaningful and generalizable across populations as implied, this approach could represent a promising avenue for the treatment of impaired memory functions in many conditions.

      Methodologically, this study is expertly done! I don't see any serious issues with the technical setup in any of the experiments (with the only caveat that I am not an expert in fMRI functional connectivity measures and DTI). It is also very commendable that the authors conceptually replicated the behavioral effects of experiment 1 in experiment 2 and then conducted two additional experiments to probe the neural mechanisms associated with these effects. This certainly increases the value of the study and the confidence in the results considerably.

      The authors used a within-subject approach in their experiments, which increases statistical power and allows for stronger inferences about the tested effects. They are also used to individualize stimulation locations and intensities, which should further optimize the signal-to-noise ratio.

      Weaknesses:

      I want to state clearly that I think the strengths of this study far outweigh the concerns I have. I still list some points that I think should be clarified by the authors or taken into account by readers when interpreting the presented findings.

      I think one of the major weaknesses of this study is the overall low sample size in all of the experiments (between n = 10 and n = 20). This is, as I mentioned when discussing the strengths of the study, partly mitigated by the within-subject design and individualized stimulation parameters. The authors mention that they performed a power analysis but this analysis seemed to be based on electrophysiological readouts similar to those obtained in experiment 3. It is thus unclear whether the other experiments were sufficiently powered to reliably detect the behavioral effects of interest. That being said, the authors do report significant effects, so they were per definition powered to find those. However, the effect sizes reported for their main findings are all relatively large and it is known that significant findings from small samples may represent inflated effect sizes, which may hamper the generalizability of the current results. Ideally, the authors would replicate their main findings in a larger sample. Alternatively, I think running a sensitivity analysis to estimate the smallest effect the authors could have detected with a power of 80% could be very informative for readers to contextualize the findings. At the very least, however, I think it would be necessary to address this point as a potential limitation in the discussion of the paper.

      Thank you for the observation. As you mentioned, our power analysis was based on our previous study investigating the same neuromodulation protocol with a corresponding experimental design. The relatively small sample could be considered a possible limitation of the study which we will add to the discussion. A fundamental future step will be to replay these results on a larger population, however, to strengthen our results we performed the sensitivity analysis you suggested.

      In detail, we performed a sensitivity analysis for repeated-measures ANOVA with α=0.05 and power(1-β)=0.80 with no sphericity correction. For experiment 1, a sensitivity analysis with 1 group and 3 measurements showed a minimal detectable effect size of f=0.524 with 20 participants. In our paper, the ANOVA on total FNAT immediate performance revealed an effect size of η<sup>2</sup>=0.274 corresponding to f=0.614; the ANOVA on FNAT delayed performance revealed an effect size of η<sup>2</sup>=0.236 corresponding to f=0.556. For experiment 2, a sensitivity analysis for total FNAT immediate performance (1 group and 3 measurements) showed a minimal detectable effect size of f=0.797 with 10 participants. In our paper, the ANOVA on total FNAT immediate performance revealed an effect size of η<sup>2</sup>=0.448 corresponding to f=0.901. The sensitivity analysis for total FNAT delayed performance (1 group and 6 measurements) showed a minimal detectable effect size of f=0.378 with 10 participants. In our paper, the ANOVA on total FNAT delayed performance revealed an effect size of η<sup>2</sup>=0.484 corresponding to f=0.968. Thus, the sensitivity analysis showed that both experiments were powered enough to detect the minimum effect size computed in the power analysis. We have now added this information to the manuscript and we thank the reviewer for her/his suggestion in the statistical analysis and results section (lines 99-100; 127-128; 130-131; 543-545).

      “The sensitivity analysis showed a minimal detectable effect size of  η<sup>2</sup>=0.215 with 20 participants.”

      “The sensitivity analysis showed a minimal detectable effect size of  η<sup>2</sup>=0.388 with 10 participants.”

      “The sensitivity analysis showed a minimal detectable effect size of η<sup>2</sup>=0.125 with 10 participants.”

      “Since we do not have an a priori effect size for experiment 1 and 2, we performed a sensitivity power analysis to ensure that these experiments were able to detect the minimum effect size with 80% power and alpha level of 0.05.”

      It seems that the statistical analysis approach differed slightly between studies. In experiment 1, the authors followed up significant effects of their ANOVAs by Bonferroni-adjusted post-hoc tests whereas it seems that in experiment 2, those post-hoc tests where "exploratory", which may suggest those were uncorrected. In experiment 3, the authors use one-tailed t-tests to follow up their ANOVAs. Given some of the reported p-values, these choices suggest that some of the comparisons might have failed to reach significance if properly corrected. This is not a critical issue per se, as the important test in all these cases is the initial ANOVA but non-significant (corrected) post-hoc tests might be another indicator of an underpowered experiment. My assumptions here might be wrong, but even then, I would ask the authors to be more transparent about the reasons for their choices or provide additional justification. Finally, the authors sometimes report exact p-values whereas other times they simply say p < .05. I would ask them to be consistent and recommend using exact p-values for every result where p >= .001.

      Thank you again for the suggestions. Your observations are correct, we used a slightly different statistical depending on our hypothesis. Here are the details:

      In experiment 1, we used a repeated-measure ANOVA with one factor “stimulation condition” (iTBS+γtACS; iTBS+sham-tACS; sham-iTBS+sham-tACS). Following the significant effect of this factor we performed post-hoc analysis with Bonferroni correction.

      In experiment 2, we used a repeated-measures with two factors “stimulation condition” and “time”. As expected, we observed a significant effect of condition, confirming the result of experiment 1, but not of time. Thus, this means that the neuromodulatory effect was present regardless of the time point. However, to explore whether the effects of stimulation condition were present in each time point we performed some explorative t-tests with no correction for multiple comparisons since this was just an explorative analysis.

      In experiment 3, we used the same approach as experiment 1. However, since we had a specific hypothesis on the direction of the effect already observed in our previous study, i.e. increase in spectral power (Maiella et al., Scientific Report 2022), our tests were 1-tailed.

      For the p-values, we corrected the manuscript reporting the exact values for every result.

      While the authors went to great lengths trying to probe the neural changes likely associated with the memory improvement after stimulation, it is impossible from their data to causally relate the findings from experiments 3 and 4 to the behavioral effects in experiments 1 and 2. This is acknowledged by the authors and there are good methodological reasons for why TMS-EEG and fMRI had to be collected in sperate experiments, but it is still worth pointing out to readers that this limits inferences about how exactly dual iTBS and γtACS of the precuneus modulate learning and memory.

      Thank you for your comment. We fully agree with your observation, which is why this aspect has been considered in the study's limitations. To address your concern, we add this sentence to the limitation discussion (lines 299-301).

      “Consequently, these findings do not allow precise inferences regarding the specific mechanisms by which dual iTBS and γtACS of the precuneus modulate learning and memory.”

      There were no stimulation-related performance differences in the short-term memory task used in experiments 1 and 2. The authors argue that this demonstrates that the intervention specifically targeted long-term associative memory formation. While this is certainly possible, the STM task was a spatial memory task, whereas the LTM task relied (primarily) on verbal material. It is thus also possible that the stimulation effects were specific to a stimulus domain instead of memory type. In other words, could it be possible that the stimulation might have affected STM performance if the task taxed verbal STM instead? This is of course impossible to know without an additional experiment, but the authors could mention this possibility when discussing their findings regarding the lack of change in the STM task.

      Thank you for your interesting observation. We argue that the intervention primarily targeted long-term associative memory formation, as our findings demonstrated effects only on FNAT. However, as you correctly pointed out, we cannot exclude the possibility that the stimulation may also influence short-term verbal associative memory. We add this aspect when discussing the absence of significant findings in the STM task (lines 205-210).

      “Visual short-term associative memory, measured by STBM performance, was not modulated by any experimental condition. Even if we cannot exclude the possibility that the stimulation could have influenced short-term verbal associative memory, we expected this result since short-term associative memory is known to rely on a distinct frontoparietal network while FNAT, used to investigate long-term associative memory, has already been associated with the neural activity of the PC and the hippocampus (Parra et al., 2014; Rentz et al., 2011).”

      While the authors discuss the potential neural mechanisms by which the combined stimulation conditions might have helped memory formation, the psychological processes are somewhat neglected. For example, do the authors think the stimulation primarily improves the encoding of new information or does it also improve consolidation processes? Interestingly, the beneficial effect of dual iTBS and γtACS on recall performance was very stable across all time points tested in experiments 1 and 2, as was the performance in the other conditions. Do the authors have any explanation as to why there seems to be no further forgetting of information over time in either condition when even at immediate recall, accuracy is below 50%? Further, participants started learning the associations of the FNAT immediately after the stimulation protocol was administered. What would happen if learning started with a delay? In other words, do the authors think there is an ideal time window post-stimulation in which memory formation is enhanced? If so, this might limit the usability of this procedure in real-life applications.

      Thank you for your comment and for raising these important points.

      We hypothesized that co-stimulation would enhance encoding processes. Previous studies have shown that co-stimulation can enhance gamma-band oscillations and increase cortical plasticity (Guerra et al., Brain Stimulation 2018; Maiella et al., Scientific Reports 2022). Given that the precuneus (Brodt et al., Science 2018; Schott et al., Human Brain Mapping 2018), gamma oscillations (Osipova et al., Journal of Neuroscience 2006; Deprés et al., Neurobiology of Aging 2017; Griffiths et al., Trends in Neurosciences 2023), and cortical plasticity (Brodt et al., Science 2018) have all been associated with encoding processes, we decided to apply co-stimulation before the encoding phase, to boost it. We enlarged the introduction to specify the link between neural mechanisms and the psychological process of the encoding (lines 55-60).

      “In particular, the induction of gamma oscillatory activity has been proposed to play an important role in a type of LTP known as spike timing-dependent plasticity, which depends on a precise temporal delay between the firing of a presynaptic and a postsynaptic neuron (Griffiths and Jensen, 2023). Both LTP and gamma oscillations have a strong link with memory processes such as encoding (Bliss and Collingridge, 1993; Griffiths and Jensen, 2023; Rossi et al., 2001), pointing to rTMS and tACS as good candidates for memory enhancement.”

      We applied the co-stimulation immediately before the learning phase to maximize its potential effects. While we observed a significant increase in gamma oscillatory activity lasting up to 20 minutes, we cannot determine whether the behavioral effects we observed would have been the same with a co-stimulation applied 20 minutes before learning. Based on existing literature, a reduction in the efficacy of co-stimulation over time could be expected (Huang et al., Neuron 2005; Thut et al., Brain Topography 2009). However, we hypothesize that multiple stimulation sessions might provide an additional boost, helping to sustain the effects over time (Thut et al., Brain Topography 2009; Koch et al., Neuroimage 2018; Koch et al., Brain 2022).

      Regarding the absence of further forgetting in both stimulation conditions, we think that the clinical and demographical characteristics of the sample (i.e. young and healthy subjects) explain the almost absence of forgetting after one week.

      Reviewer #1 (Recommendations for the authors):

      To address the concerns, the authors should:

      (1) Include invasive neuronal recordings (e.g., in rats or monkeys if not possible in humans) demonstrating that the current stimulation protocol leads to direct changes in brain activity.

      We understand the interest of the first reviewer in the understanding of neurophysiological correlates of the stimulation protocol, however, we are skeptical about this request as we think it goes beyond the aims of the study. As already mentioned in the response to the reviewer, invasive neurophysiological recordings in humans for this type of study are not feasible due to ethical constraints. At the same time, studies on cadavers or rodents would not fully resolve the question. Indeed, the authors of the study cited by the reviewer (Mihály Vöröslakos et al., Nature Communications, 2018) highlight the impossibility of drawing definitive conclusions about the exact voltage required in the in-vivo human brain due to significant differences between rats and humans, as well as the in-vivo human cadavers due to alterations in electrical conductivity that occur in postmortem tissue. Huang and colleagues addressed the difficulties in reaching direct evidence of non-invasive brain stimulation (NIBS) effects in a review published in Clinical Neurophysiology in 2017. They conclude that the use of EEG to assess brain response to TMS has a great potential for a less indirect demonstration of plasticity mechanisms induced by NIBS in humans.

      It is exactly to meet the need to investigate the changes in brain activity after the stimulation protocol that we conducted Experiments 3 and 4. These experiments respectively examined the neurophysiological and connectivity changes induced by the stimulation in a non-invasive manner using TMS-EEG and fMRI. The observed changes in brain oscillatory activity (increased gamma oscillatory activity), cortical excitability (enhanced posteromedial parietal cortex reactivity), and brain connectivity (strengthened connections between the precuneus and hippocampi) provided evidence of the effects of our non-invasive brain stimulation protocol, further supporting the behavioral data.

      Additionally, we carefully considered the issue of stimulation distribution and, in response, performed a biophysical modeling analysis and E-field calculation using the parameters employed in our study (see Supplementary Materials).

      Acknowledging the reviewer's point of view, we modified the manuscript accordingly, discussing this aspect both as a technical limitation and as a potential direction for future research (main text, lines 280-289).

      “Although we studied TMS and tACS propagation through the E-field modeling and observed an increase in the precuneus gamma oscillatory activity, excitability and connectivity with the hippocampi, we cannot exclude that our results might reflect the consequences of stimulating more superficial parietal regions other than the precuneus nor report direct evidence of microscopic changes in the brain after the stimulation. Invasive neurophysiological recordings in humans for this type of study are not feasible due to ethical constraints. Studies on cadavers or rodents would not fully resolve our question due to significant differences between them (i.e. rodents do not have an anatomical correspondence while cadavers have an alterations in electrical conductivity occurring in postmortem tissue). However, further exploration of this aspect in future studies would help in the understanding of γtACS+iTBS effects.”

      (2) Address all the technical questions about the experimental design.

      We addressed all the technical questions about the experimental design.

      (3) Repeat the experiments with randomized trial order and without a block design.

      The experiments were conducted with randomized trial order and we did not use a block design.

      (4) Add many more faces to the study. It is extremely difficult to draw any conclusion from merely 12 faces. Ideally, there would be lots of other relevant memory experiments where the authors show compelling positive results.

      We understand your perplexity about drawing conclusions from 12 faces, however, this is not the case. As we explained in the response reviewer, the task we implemented did not rely on the recall of merely 12 faces. Instead, participants had to correctly learn, associate and recall 12 faces, 12 names and 12 occupations for a total of 36 items. To improve the clarity of the manuscript, we added a paragraph to make this aspect more explicit (lines 425-430).

      “We considered a correct association when a subject was able to recall all the information for each item (i.e. face, name and occupation), resulting in a total of 36 items to learn and associate. To further investigate the effect on FNAT we also computed a partial recall score accounting for those items where subjects correctly matched only names with faces (FNAT NAME) and only occupations with faces (FNAT OCCUPATION). See supplementary information for score details.”

      The behavioral changes we observed are similar to those who are typically observed after multiple stimulation sessions (Koch et al., NeuroImage, 2018; Grover et al., Nature Neuroscience, 2022, Benussi et al., Annals of Neurology, 2022). Moreover, memory performance changes are often measured by a limited set of stimuli due to methodological constraints related to memory capacity. For example, Rey Auditory Verbal learning task, requiring to learn and recall 15 words, is a typical test used to detect memory changes (Koch et al., Neuroimage, 2018; Benussi et al., Brain stimulation 2021; Benussi et al., Annals of Neurology, 2022). 

      (5) Provide a clear explanation of the apparent randomness of which results are statistically significant or not in Figure 3. But perhaps with many more experiments, a lot more memory evaluations, many more stimuli, and addressing all the other technical concerns, either the results will disappear or there will be a more interpretable pattern of results.

      We provided explanations for all the concerns shown by the reviewer.

      Reviewer #2 (Recommendations for the authors):

      Minor comments:

      (1) Figure 4: Why are connectivity values pre-stimulation for the iTBS and sham tACS stimulation condition so much higher than the dual stimulation? We would expect baseline values to be more similar.

      We acknowledge that the pre-stimulation connectivity values for the iTBS and sham tACS conditions appear higher than those for the dual stimulation condition. However, as noted in our statistical analyses, there were no significant differences at baseline between conditions (p-FDR= 0.3514), suggesting that any apparent discrepancy is due to natural variability rather than systematic bias. One potential explanation for these differences is individual variability in baseline connectivity measures, which can fluctuate due to factors such as intrinsic neural dynamics, participant state, or measurement noise. Despite these variations, our statistical approach ensures that any observed post-stimulation effects are not confounded by pre-existing differences.

      (2) Figure 2: How are total association scores significantly different between stimulation conditions, but individual name and occupation associations are not? Further clarification of how the total FNAT score is calculated would be helpful.

      We apologize for any lack of clarity. The total FNAT score reflects the ability to correctly recall all the information associated with a person—specifically, the correct pairing of the face, name, and occupation. Participants received one point for each triplet they accurately recalled. The scores were then converted into percentages, as detailed in the Face-Name Associative Task Construction and Scoring section in the supplementary materials.

      Total FNAT was the primary outcome measure. However, we also analyzed name and occupation recall separately to better understand their partial contributions. Our analysis revealed that the improvement in total FNAT was primarily driven by an increase in name recall rather than occupation recall.

      We acknowledge that this distinction may have caused some confusion. To improve clarity, we revised the manuscript accordingly (lines 97-98; 107-111; 425-430).

      “Dual iTBS+γtACS increased the performances in recalling the association between face, name and occupation (FNAT accuracy) both for the immediate (F<sub>2,38</sub>=7.18; p=0.002; η<sup>2</sup><sub>p</sub>=0.274) and the delayed (F<sub>2,38</sub>=5.86; p =0.006; η<sup>2</sup><sub>p</sub>=0.236) recall performances (Fig. 2, panel A).”

      “The in-depth analysis of the FNAT accuracy investigating the specific contribution of face-name and face-occupation recall revealed that dual iTBS+γtACS increased the performances in the association between face and name (FNAT NAME) delayed recall (F<sub>2,38</sub> =3.46; p =0.042; η<sup>2</sup>p =0.154; iTBS+γtACS vs. sham-iTBS+sham-tACS: 42.9±21.5 % vs. 33.8±19 %; p=0.048 Bonferroni corrected) (Fig. S4, supplementary information).”

      “We considered a correct association when a subject was able to recall all the information for each item (i.e. face, name and occupation), resulting in a total of 36 items to learn and associate. To further investigate the effect on FNAT we also computed a partial recall score accounting for those items where subjects correctly matched only names with faces (FNAT NAME) and only occupations with faces (FNAT OCCUPATION). See supplementary information for score details.”

      We also moved the data regarding the specific contribution of name and occupation recall in the supplementary information (fig.S4) and further specified how we computed the score in the score (lines 102-104).

      “The score was computed by deriving an accuracy percentage index dividing by 12 and multiplying by 100 the correct association sum. The partial recall scores were computed in the same way only considering the sum of face-name (NAME) and face-occupation (OCCUPATION) correctly recollected.”

      Reviewer #3 (Recommendations for the authors):

      A very small detail, in the caption for Figure 2A, OCCUPATION is described as being shown on the 'left' but it should be 'right'.

      We corrected this error.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      Phytophathogens including fungal pathogens such as F. graminearum remain a major threat to agriculture and food security. Several agriculturally relevant fungicides including the potent Quinofumelin have been discovered to date, yet the mechanisms of their action and specific targets within the cell remain unclear. This paper sets out to contribute to addressing these outstanding questions.

      We appreciate the reviewer's accurate summary of our manuscript.

      Strengths:

      The paper is generally well-written and provides convincing data to support their claims for the impact of Quinofumelin on fungal growth, the target of the drug, and the potential mechanism. Critically the authors identify an important pyrimidine pathway dihydroorotate dehydrogenase (DHODH) gene FgDHODHII in the pathway or mechanism of the drug from the prominent plant pathogen F. graminearum, confirming it as the target for Quinofumelin. The evidence is supported by transcriptomic, metabolomic as well as MST, SPR, molecular docking/structural biology analyses.

      We appreciate the reviewer's recognition of the strengths of our manuscript.

      Weaknesses:

      Whilst the study adds to our knowledge about this drug, it is, however, worth stating that previous reports (although in different organisms) by Higashimura et al., 2022 https://pmc.ncbi.nlm.nih.gov/articles/PMC9716045/ had already identified DHODH as the target for Quinofumelin and hence this knowledge is not new and hence the authors may want to tone down the claim that they discovered this mechanism and also give sufficient credit to the previous authors work at the start of the write-up in the introduction section rather than in passing as they did with reference 25? other specific recommendations to improve the text are provided in the recommendations for authors section below.

      We appreciate the reviewer's suggestion. In the revised manuscript, we have incorporated the reference in the introduction section and expanded the discussion of previous work on quinofumelin by Higashimura et al., 2022 in the discussion section to more effectively contextualize their contributions. Moreover, we have made revisions and provided responses in accordance with the recommendations.

      Reviewer #2 (Public review):

      Summary:

      In the current study, the authors aim to identify the mode of action/molecular mechanism of characterized a fungicide, quinofumelin, and its biological impact on transcriptomics and metabolomics in Fusarium graminearum and other Fusarium species. Two sets of data were generated between quinofumelin and no treatment group, and differentially abundant transcripts and metabolites were identified. The authors further focused on uridine/uracil biosynthesis pathway, considering the significant up- and down-regulation observed in final metabolites and some of the genes in the pathways. Using a deletion mutant of one of the genes and in vitro biochemical assays, the authors concluded that quinofumelin binds to the dihydroorotate dehydrogenase.

      We appreciate the reviewer's accurate summary of our manuscript.

      Strengths:

      Omics datasets were leveraged to understand the physiological impact of quinofumelin, showing the intracellular impact of the fungicide. The characterization of FgDHODHII deletion strains with supplemented metabolites clearly showed the impact of the enzyme on fungal growth.

      We appreciate the reviewer's recognition of the strengths of our manuscript.

      Weaknesses:

      Some interpretation of results is not accurate and some experiments lack controls. The comparison between quinofumelin-treated deletion strains, in the presence of different metabolites didn't suggest the fungicide is FgDHODHII specific. A wild type is required in this experiment.

      Potential Impact: Confirming the target of quinofumelin may help understand its resistance mehchanism, and further development of other inhibitory molecules against the target.

      The manuscript would benefit more in explaining the study rationale if more background on previous characterization of this fungicide on Fusarium is given.

      We appreciate the reviewer's suggestion. Under no treatment with quinofumelin, mycelial growth remains normal and does not require restoration. In the presence of quinofumelin treatment, the supplementation of downstream metabolites in the de novo pyrimidine biosynthesis pathway can restore mycelial growth that is inhibited by quinofumelin. The wild-type control group is illustrated in Figure 4. Figure 5b depicts the phenotypes of the deletion mutants. With respect to the relationship among quinofumelin, FgDHODHII, and other metabolites, quinofumelin specifically targets the key enzyme FgDHODHII in the de novo pyrimidine biosynthesis pathway, disrupting the conversion of dihydroorotate to orotate, which consequently inhibits the synthesis downstream metabolites including uracil. In our previous study, quinofumelin not only exhibited excellent antifungal activity against the mycelial growth and spore germination of F. graminearum, but also inhibited the biosynthesis of deoxynivalenol (DON). We have added this part to the introduction section.

      Reviewer #3 (Public review):

      Summary:

      The manuscript shows the mechanism of action of quinofumelin, a novel fungicide, against the fungus Fusarium graminearum. Through omics analysis, phenotypic analysis, and in silico approaches, the role of quinofumelin in targeting DHODH is uncovered.

      We appreciate the reviewer's accurate summary of our manuscript.

      Strengths:

      The phenotypic analysis and mutant generation are nice data and add to the role of metabolites in bypassing pyrimidine biosynthesis.

      We appreciate the reviewer's recognition of the strengths of our manuscript.

      Weaknesses:

      The role of DHODH in this class of fungicides has been known and this data does not add any further significance to the field. The work of Higashimura et al is not appreciated well enough as they already showed the role of quinofumelin upon DHODH II.

      There is no mention of the other fungicide within this class ipflufenoquin, as there is ample data on this molecule.

      We appreciate the reviewer's suggestion. We sincerely appreciate the reviewer's insightful comment regarding the work of Higashimura et al. We agree that their investigation into the role of quinofumelin in DHODH II inhibition provides critical foundational insights for this field. In the revised manuscript, we have incorporated the reference in the introduction section and expanded the discussion of their work in the discussion section to more effectively contextualize their contributions. The information regarding action mechanism of ipflufenoquin against filamentous fungi was added in discussion section.

      Reviewer #1 (Recommendations for the authors):

      (1) Given that the DHODH gene had been identified as a target earlier, could the authors perform blast experiments with this gene instead and let us know the percentage similarity between the FgDHODHII gene and the Pyricularia oryzae class II DHODH gene in the report by Higashimura et al., 2022.

      BLAST experiment revealed that the percentage similarity between the FgDHODHII gene and the class II DHODH gene of P. oryzae was 55.41%. We have added the description ‘Additionally, the amino acid sequence of the FgDHODHII exhibits 55.41% similarity to that of DHODHII from Pyricularia oryzae, as previously reported (Higashimura et al., 2022)’ in section Results.

      (2) Abstract:

      The authors started abbreviating new terms e.g. DEG, DMP, etc but then all of a sudden stopped and introduced UMP with no full meaning of the abbreviation. Please give the full meaning of all abbreviations in the text, UMP, STC, RM, etc.

      We have provided the full meaning for all abbreviations as requested.

      (3) Introduction section:

      The introduction talks very little about the work of other groups on quinofumelin. Perhaps add this information in and reference them including the work of Higashimura et al., 2022 which has done quite significant work on this topic but is not even mentioned in the background

      We have added the work of other groups on quinofumelin in section introduction.

      (4) General statements:

      Please show a model of the pyrimidine pathway that quinofumelin attacks to make it easier for the reader to understand the context. They could just copy this from KEGG

      We have added the model (Fig. 7).

      (5) Line 186:

      The authors did a great job of demonstrating interactions with the Quinofumelin and went to lengths to perform MST, SPR, molecular docking, and structural biology analyses yet in the end provide no details about the specific amino acid residues involved in the interaction. I would suggest that site-directed mutagenesis studies be performed on FgDHODHII to identify specific amino acid residues that interact with Quinofumelin and show that their disruption weakens Quinofumelin interaction with FgDHODHII.

      Thank you for this insightful suggestion. We fully agree with the importance of elucidating the interaction mechanism. At present, we are conducting site-directed mutagenesis studies based on interaction sites from docking results and the mutation sites of FgDHODHII from the resistant mutants; however, due to the limitations in the accuracy of existing predictive models, this work remains ongoing. Additionally, we are undertaking co-crystallization experiments of FgDHODHII with quinofumelin to directly and precisely reveal their interaction pattern

      (6) Line 76:

      What is the reference or evidence for the statement 'In addition, quinofumelin exhibits no cross-resistance to currently extensively used fungicides, indicating its unique action target against phytopathogenic fungi.

      If two fungicides share the same mechanism of action, they will exhibit cross resistance. Previous studies have demonstrated that quinofumelin retains effective antifungal activity against fungal strains resistant to commercial fungicides, indicating that quinofumelin does not exhibit cross-resistance with other commercially available fungicides and possesses a novel mechanism of action. Additionally, we have added the relevant inference.

      (7) Line 80-82:

      Again, considering the work of previous authors, this target is not newly discovered. Please consider toning down this statement 'This newly discovered selective target for antimicrobial agents provides a valuable resource for the design and development of targeted pesticides.'

      We have rewritten the description of this sentence.

      (8) Line 138: If the authors have identified DHODH in experimental groups (I assume in F. graminearum), what was the exact locus tag or gene name in F. graminearum, and why not just continue with this gene you identified or what is the point of doing a blast again to find the gene if the DHODH gene if it already came up in your transcriptomic or metabolic studies? This unfortunately doesn't make sense but could be explained better.

      The information of FgDHODHII (gene ID: FGSG_09678) has been added. We have revised this part.

      Reviewer #2 (Recommendations for the authors):

      (1) Line 40:

      Please add a reference.

      We have added the reference

      (2) Line 47:

      Please add a reference.

      We have added the reference.

      (3) Line 50:

      The lack of target diversity in existing fungicides doesn't necessarily serve as a reason for discovering new targets being more challenging than identifying new fungicides within existing categories, please consider adjusting the argument here. Instead, the authors can consider reasons for the lack of new targets in the field.

      We have revised the description.

      (4) Line 63:

      Please cite your source with the new technology.

      We have added the reference.

      (5) Line 68:

      What are you referring to for "targeted medicine", do you have a reference?

      We have revised the description and the reference.

      (6) Line 74:

      One of the papers referred to "quinoxyfen", what are the similarities and differences between the two? Please elaborate for the readership.

      Quinoxyfen, similar to quinofumelin, contains a quinoline ring structure. It inhibits mycelial growth by disrupting the MAP kinase signaling pathway in fungi (https://www.frac.info). In addition, quinoxyfen still exhibits excellent antifungal activity against the quinofumelin-resistant mutants (the findings from our group), indicating that action mechanism for quinofumelin and quinoxyfen differ.

      (7) Line 84:

      Please introduce why RNA-Seq was designed in the study first. What were the groups compared? How was the experiment set up? Without this background, it is hard to know why and how you did the experiment.

      According to your suggestions, we have added the description in Section Results. In addition, the experimental process was described in Section Materials and methods as follows: A total of 20 mL of YEPD medium containing 1 mL of conidia suspension (1×105 conidia/mL) was incubated with shaking (175 rpm/min) at 25°C. After 24 h, the medium was added with quinofumelin at a concentration of 1 μg/mL, while an equal amount of dimethyl sulfoxide was added as the control (CK). The incubation continued for another 48 h, followed by filtration and collection of hyphae. Carry out quantitative expression of genes, and then analyze the differences between groups based on the results of DESeq2 for quantitative expression.

      (8) Figures:

      The figure labeling is missing (Figures 1,2,3 etc). Please re-order your figure to match the text

      The figures have been inserted.

      (9) Line. 97:

      "Volcano plot" is a common plot to visualize DEGs, you can directly refer to the name.

      We have revised the description.

      (10) Figure 1d, 1e:

      Can you separate down- and up-regulated genes here? Does the count refer to gene number?

      The expression information for down- and up-regulated genes is presented in Figure 1a and 1b. However, these bubble plots do not distinguish down- and up-regulated genes. Instead, they only display the significant enrichment of differentially expressed genes in specific metabolic pathways. To more clearly represent the data, we have added the detailed counts of down- and up-regulated genes for each metabolic pathway in Supplementary Table S1 and S2. Here, the term "count" refers to differentially expressed genes that fall within a certain pathway.

      (11) Line 111:

      Again, no reasoning or description of why and how the experiment was done here.

      Based on the results of KEGG enrichment analysis, DEMs are associated with pathways such as thiamine metabolism, tryptophan metabolism, nitrogen metabolism, amino acid sugar and nucleotide sugar metabolism, pantothenic acid and CoA biosynthesis, and nucleotide sugar production compounds synthesis. To specifically investigate the metabolic pathways involved action mechanism of quinofumelin, we performed further metabolomic experiments. Therefore, we have added this description according the reviewer’s suggestions.

      (12) Figure 2a:

      It seems many more metabolites were reduced than increased. Is this expected? Due to the antifungal activity of this compound, how sick is the fungus upon treatment? A physiological study on F. graminearum (in a dose-dependent manner) should be done prior to the omics study. Why do you think there's a stark difference between positive and negative modes in terms of number of metabolites down- and up-regulated?

      Quinofumelin demonstrates exceptional antifungal activity against Fusarium graminearum. The results indicate that the number of reduced metabolites significantly exceeds the number of increased metabolites upon quinofumelin treatment. Mycelial growth is markedly inhibited under quinofumelin exposure. Prior to conducting omics studies, we performed a series of physiological and biochemical experiments (refer to Qian Xiu's dissertation https://paper.njau.edu.cn/openfile?dbid=72&objid=50_49_57_56_49_49&flag=free). Upon quinofumelin treatment, the number of down-regulated metabolites notably surpasses that of up-regulated metabolites compared to the control group. Based on the findings from the down-regulated metabolites, we conducted experiments by exogenously supplementing these metabolites under quinofumelin treatment to investigate whether mycelial growth could be restored. The results revealed that only the exogenous addition of uracil can restore mycelial growth impaired by quinofumelin.

      Quinofumelin exhibits an excellent antifungal activity against F. graminearum. At a concentration of 1 μg/mL, quinofumelin inhibits mycelial growth by up to 90%. This inhibitory effect indicates that life activities of F. graminearum are significantly disrupted by quinofumelin. Consequently, there is a marked difference in down- and up-regulated metabolites between quinofumelin-treated group and untreated control group. The detailed results were presented in Figures 1 and 2.

      (13) Figure 2e:

      This is a good analysis. To help represent the data more clearly, the authors can consider representing the expression using fold change with a p-value for each gene.

      To more clearly represent the data, we have incorporated the information on significant differences in metabolites in the de novo pyrimidine biosynthesis pathway, as affected by quinofumelin, in accordance with the reviewer’s suggestions.

      (14) Line 142:

      Please indicate fold change and p-value for statistical significance. Did you validate this by RT-qPCR?

      We validated the expression level of the DHODH gene under quinofumelin treatment using RT-qPCR. The results indicated that, upon treatment with the EC50 and EC90 concentrations of quinofumelin, the expression of the DHODH gene was significantly reduced by 11.91% and 33.77%, respectively (P<0.05). The corresponding results have been shown in Figure S4.

      (15) Line 145:

      It looks like uracil is the only metabolite differentially abundant in the samples - how did you conclude this whole pathway was impacted by the treatment?

      The experiments involving the exogenous supplementation of uracil revealed that the addition of uracil could restore mycelial growth inhibited by quinofumelin. Consequently, we infer that quinofumelin disrupts the de novo pyrimidine biosynthesis pathway. In addition, as uracil is the end product of the de novo pyrimidine biosynthesis pathway, the disruption of this pathway results in a reduction in uracil levels.

      (16) Figure 3:

      What sequence was used as the root of the tree? Why were the species chosen? Since the BLAST query was Homo sapiens sequence, would it be good to use that as the root?

      FgDHODHII sequence was used as the root of the tree. These selected fungal species represent significant plant-pathogenic fungi in agriculture production. According to your suggestion, we have removed the BLAST query of Homo sapiens in Figure 3.

      (17) Figure 4:

      How were the concentrations used to test chosen?

      Prior to this experiment, we carried out concentration-dependent exogenous supplementation experiments. The results indicated that 50 μg/mL of uracil can fully restore mycelial growth inhibited by quinofumelin. Consequently, we chose 50 μg/mL as the testing concentration.

      (18) Line 164:

      Why do you hypothesize supplementing dihydroorotate would restore resistance? The metabolite seemed accumulated in the treatment condition, whereas downstream metabolites were comparable or even depleted. The DHODH gene expression was suppressed. Would accumulation of dihydroorotate be associated with growth inhibition by quinofumelin? Please include the hypothesis and rationale for the experimental setup.

      DHODH regulates the conversion of dihydroorotate to orotate in the de novo pyrimidine biosynthesis pathway. The inhibition of DHODH by quinofumelin results in the accumulation of dihydroorotate and the depletion of the downstream metabolites, including UMP, uridine and uracil. Consequently, downstream metabolites were considered as positive controls, while upstream metabolite dihydroorotate served as a negative control. This design further demonstrates DHODH as action target of quinofumelin against F. graminearum. In addition, the accumulation of dihydroorotate is not associated with growth inhibition by quinofumelin; however, but the depletion of downstream metabolites in the de novo pyrimidine biosynthesis pathway is closely associated with growth inhibition by quinofumelin.

      (19) Line 168:

      I'm not sure if this conclusion is valid from your results in Figure 4 showing which metabolites restore growth.

      o minimize the potential influence of strain-specific effects, five strains were tested in the experiments shown in Figure 4. For each strain, the first row (first column) corresponds to control condition, while second row (first column) represents treatment with 1 μg/mL of quinofumelin, which completely inhibits mycelial growth. The second row (second column) for each strain represents the supplementation with 50 μg/mL of dihydroorotate fails to restore mycelial growth inhibited by quinofumelin. In contrast, the second row (third column, fourth column, fifth colomns) for each strain demonstrated that the supplementation of 50 μg/mL of UMP, uridine and uracil, respectively, can effectively restore mycelial growth inhibited by quinofumelin.

      (20) Figure 5a:

      The fact you saw growth of the deletion mutant means it's not lethal. However, the growth was severely inhibited.

      Our experimental results indicate that the growth of the deletion mutant is lethal. The mycelial growth observed originates from mycelial plugs that were not exposed to quinofumelin, rather than from the plates amended with quinofumelin.

      (21) Figure 5b:

      Would you expect different restoration of growth in the presence of quinofumelin vs. no treatment? The wild type control is missing here. Any conclusions about the relationship between quinofumelin, FgDHODHII, and other metabolites in the pathway?

      Under no treatment with quinofumelin, mycelial growth remains normal and does not require restoration. In the presence of quinofumelin treatment, the supplementation of downstream metabolites in the de novo pyrimidine biosynthesis pathway can restore mycelial growth that is inhibited by quinofumelin. The wild-type control group is illustrated in Figure 4. Figure 5b depicts the phenotypes of the deletion mutants. With respect to the relationship among quinofumelin, FgDHODHII, and other metabolites, quinofumelin specifically targets the key enzyme FgDHODHII in the de novo pyrimidine biosynthesis pathway, disrupting the conversion of dihydroorotate to orotate, which consequently inhibits the synthesis downstream metabolites including uracil.

      (22) Figure 6b:

      Lacking positive and negative controls (known binder and non-binder). What does the Kd (in comparison to other interactions) indicate in terms of binding strength?

      We tested the antifungal activities of publicly reported DHODH inhibitors (such as leflunomide and teriflunomide) against F. graminearum. The results showed that these inhibitors exhibited no significant inhibitory effects against the strain PH-1. Therefore, we lacked an effective chemical for use as a positive control in subsequent experiments. Biacore experiments offers detailed insights into molecular interactions between quinofumelin and DHODHII. As shown in Figure 6b, the left panel illustrates the time-dependent kinetic curve of quinofumelin binding to DHODHII. Within the first 60 s after quinofumelin was introduced onto the DHODHII surface, it bound to the immobilized DHODHII on the chip surface, with the response value increasing proportionally to the quinofumelin concentration. Following cessation of the injection at 60 s, quinofumelin spontaneously dissociated from the DHODHII surface, leading to a corresponding decrease in the response value. The data fitting curve presented on the right panel indicates that the affinity constant KD of quinofumelin for DHODHII is 6.606×10-6 M, which falls within the typical range of KD values (10-3 ~ 10-6 M) for protein-small molecule interaction patterns. A lower KD value indicates a stronger affinity; thus, quinofumelin exhibits strong binding affinity towards DHODHII.

      Reviewer #3 (Recommendations for the authors):

      The authors should add information about the other molecule within this class, ipflufenoquin, and what is known about it. There are already published data on its mode of action on DHODH and the role of pyrimidine biosynthesis.

      We have added the information regarding action mechanism of ipflufenoquin against filamentous fungi in discussion section.

      The work of Higashimura et al is not appreciated well enough as they already showed the role of quinofumelin upon DHODH II.

      We sincerely appreciate the reviewer's insightful comment regarding the work of Higashimura et al. We agree that their investigation into the role of quinofumelin in DHODH II inhibition provides critical foundational insights for this field. In the revised manuscript, we have incorporated the reference in the introduction section and expanded the discussion of their work in the discussion section to more effectively contextualize their contributions.

      It is unclear how the protein model was established and this should be included. What species is the molecule from and how was it obtained? How are they different from Fusarium?

      The three-dimensional structural model of F. graminearum DHODHII protein, as predicted by AlphaFold, was obtained from the UniProt database. Additionally, a detailed description along with appropriate citations has been incorporated in the ‘Manuscript’ file.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public Review):

      We thank the reviewer for the positive feedback on the work. The reviewer has raised two weaknesses and in the following we discuss how those can be addressed.  

      Weaknesses:

      The impact of the article is limited by using a network with discrete time- steps, and only a small number of time steps from stimulus to reward. They assume that each time step is on the order of hundreds of ms. They justify this by pointing to some slow intrinsic mechanisms, but they do not implement these slow mechanisms is a network with short time steps, instead they assume without demonstration that these could work as suggested. This is a reasonable first approximation, but its validity should be explicitly tested.

      Our goal here was to give a proof of concept that online random feedback is sufficient to train an RNN to estimate value. Indeed, it is important to show that the idea works in a model where the slow mechanisms are explicitly implemented. However, this is a non-trivial task and desired to be addressed in future works.  

      As the delay between cue and reward increases the performance decreases. This is not surprising given the proposed mechanism, but is still a limitation, especially given that we do not really know what a is the reasonable value of a single time step.

      In reply to this comment and the other reviewer's related comment, we have conducted two sets of additional simulations, one for examining incorporation of eligibility traces, and the other for considering (though not mechanistically implementing) behavioral time-scale synaptic plasticity (BTSP). We have added their results to the revised manuscript as Appendix. We think that the results addressed this point to some extent while how longer cue-reward delay can be learnt by elaboration of the model remains as a future issue.

      Reviewer #2 (Public Review):

      We thank the reviewer for the positive feedback on the work. The reviewer gave comments on our revisions, and here we discuss how those can be addressed.

      Comments on revisions: I would still want to see how well the network learns tasks with longer time delays (on the order of 100 or even 1000 timesteps). Previous work has shown that random feedback struggles to encode longer timescales (see Murray 2019, Figure 2), so I would be interested to see how that translates to the RL context in your model.

      We would like to note that in Murray et al 2019 the random feedback per se appeared not to be primarily responsible for the difficulty in encoding longer timesclaes. In the Figure 2d (Murray 2019), the author compared his RFLO (random feedback local online) and BPTT with two intermediate algorithms, which incorporated either one of the two approximations made in RFLO: i) random feedback instead of symmetric feedback, and ii) omittance of non-local effect (i.e., dependence of the derivative of the loss with respect to a given weight on the other weights). The performance difference between RFLO and BPTT was actually mostly explained by ii), as the author mentioned "The results show that the local approximation is essentially fully responsible for the performance difference between RFLO and BPTT, while there is no significant loss in performance due to the random feedback alone. (Line 6-8, page 7 of Murray, 2019, eLife)".

      Meanwhile, regarding the difference in the performance of the model with random feedback vs the model with symmetric feedback in our settings, actually it appeared (already) in the case with 6 time-steps or less (the biologically constrained model with random feedback performed worse: Fig. 6J, left).

      In practice, our model, either with random or symmetric feedback, would not be able to learn the cases with very long delays. This is indeed a limitation of our model. However, our model is critically different from the model of Murray 2019 in that we use RL rather than supervised learning and we use a scalar bootstrapped (TD) reward-prediction-error rather than the true output error. We would think that these differences may be major reasons for the limited learning ability of our model.

      Regarding the feasibility of the model when tasks involve longer time delays: Indeed this is a problem and the other reviewers have also raised the same point. Our model can be extended by incorporating either a kind of eligibility trace (similar one to those contained in RFLO and e-prop) or behavioral time-scale synaptic plasticity (BTSP), and we have added the results of simulations incorporating each to the revised manuscript as Appendix. But how longer cue-reward delay can be learnt by elaboration of the model remains as a future issue.

      Reviewer #3 (Public Review):

      Comments on revisions: Thank you for addressing all my comments in your reply.

      We are happy to learn that all concerns raised by the reviewer in the previous round were addressed adequately. We agree with the reviewer that there are several ways the work can be improved.

      The various points raised by the reviewers at weaknesses are desired to be taken up in future works.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations for the authors):

      Suggestions:

      Although this study has an impressive dataset, I felt that some parts of the discussion would benefit from further explanation, specifically when discussing the differences in female aggression direction between groups with different sex compositions. In the discussion is suggested that males buffer female-on-female aggression and that they 'support' lower-ranking females (see line 212), however, the study only tested the sex composition of the group and does not provide any evidence of this buffering. Thus, I would suggest adding more information on how this buffering or protection from males might manifest (for example, listing male behaviours that might showcase this protection) or referencing other studies that support this claim. Another example of this can be found in lines 223-224, which suggests that females choose lower-ranking individuals when they are presented with a larger pool of competitors; however, in lines 227-228, it's stated that this result contradicts previous work in baboons, which makes the previous claim seem unjustified. I recommend adding other examples from studies that support the results of this paper and adding a line that addresses reasons why these differences between gorillas and baboons might be caused (for example, different social dynamics or ecological constraints). In addition, I suggest the inclusion of physiological data such as direct measures of energy expenditure, caloric intake, or hormone levels, as it would strengthen the claims made in the second paragraph of the discussion. However, I understand this might not be possible due to data or time constraints, so I suggest adding more robust justification on why lactation and pregnancy were used as a proxy for energetic need. In the methods (lines 127-128), it is unclear which phase of the pregnancy or lactation is more energetically demanding. I would also suggest adding a comment on the limitations of using reproductive state to infer energetic need. Lastly, if the data is available, I believe it would be interesting to add body size and age of the females or the size difference between aggressor and target as explanatory variables in the models to test if physiological characteristics influence female-on-female aggression.

      Male support:

      We have now added more references (Watts 1994, 1997) and enriched our arguments regarding male presence buffering aggression. Previous research suggests that male gorillas may support lower-ranking females and they may intervene in female-female conflicts (Sicotte 2002). Unfortunately, our dataset did not allow us to test for male protection. We conduct proximity scans every 10 minutes and these scans are not associated to each interaction, meaning that we cannot reliably test if proximity to a male influences the likelyhood to receive aggression.

      Number of competitors and choice of weaker competitors:

      We added a very relevant reference in humans, showing that people choose weaker competitors when they have they can choose. We removed the example to baboons because it used sex ratio and the relevance to our study was not that straightforward.

      Reproductive state as a proxy for energetic needs:

      We now mention clearly that reproductive state is an indirect measure of energetic needs.

      We rephrased our methods to: “Lactation is often considered more energetically demanding than pregnancy as a whole but the latest stages of pregnancy are highly energetically demanding, potentially even more than lactation”

      Unfortunately, we do not have access to physiological and body size data. Regarding female age, for many females, ages are estimates with errors up to a decade, and thus, we choose not to use them as a reliable predictor. Having accurate values for all these variables, would indeed be very valuable and improve the predicting power of our study.

      Recommendations for writing and presentation:

      Overall, the manuscript is well-organised and well-written, but there are certain areas that could improve in clarity. In the introduction, I believe that the term 'aggression heuristic' should be introduced earlier and properly defined in order to accommodate a broader audience. The main question and aims of the study are not stated clearly in the last paragraph of the introduction. In the methods, I think it would improve the clarity to add a table for the classification of each type of agonistic interactions instead of naming them in the text. For example, a table that showcase the three intensity categories (severe, mild and moderate), than then dives into each behaviour (e.g. hit, bite, attack, etc.) and a short description of these behaviours, I think this would be helpful since some of the behaviours mentioned can be confusing (what's the difference between attack, hit and fight?). In addition, in line 104, it states that all interactions were assigned equal intensity, which needs to be explained.

      We now define aggression heuristics in both the abstract and the first paragraph of the introduction. We have also explained aggressive interactions that their nature was not obvious from their names. Hopefully, these explanations make clear the differences among the recorded behaviours.

      We have now specified that the “equal intensity” refers to avoidances and displacements used to infer power relationships: “We assigned to all avoidance/displacement interactions equal intensity, that is, equal influence to the power relationship of the interacting individuals”

      Minor corrections:

      (1) In line 41, there is a 1 after 'similar'. I am unsure if it's a mistake or a reference.

      We corrected the typo.

      (2) In lines 68-69, there is mention of other studies, but no references are provided.

      We added citations as suggested.

      (3) Remove the reference to Figure 1 (line 82) from the introduction; the figure should be referenced in the text just before the image, however, your figure is in a different section.

      We removed the reference as suggested.

      (4) Line 98 and 136, it's written 'ad libtum' but the correct spelling is 'ad libitum'.

      We corrected the typo.

      (5) Figure 3, remove the underscores between the words in the axis titles.

      We removed the underscores.

      Reviewer #2 (Recommendations for the authors):

      Here, I have outlined some specific suggestions that require attention. Addressing these comments will enhance the readability and enhance the quality of the manuscript.

      (1) L69. Add citation here, indicating the studies focusing on aggression rates.

      We added citations as suggested.

      (2) L88. The study periods used in this study and the authors' previous study (Reference 11) are different. So please add one table as Table 1 showing the details info on the sampling efforts and data included in their analysis of this study. For example, the study period, the numbers of females and males, sampling hours, the number of avoidance/displacement behaviors used to calculate individual Elo-ratings, and the number of mild/moderate/severe aggressive interactions, etc.

      We have now added another table, as suggested (new Table 1) and we have also made clear that we used the hierarchies presented in detail in (Smit & Robbins 2025).

      (3) L103. If readers do not look over Reference 25 on purpose, they do not know what the authors want to talk about and why they mention the optimized Elo-rating method. Clarify this statement and add more content explaining the differences between the two methods, or just remove it.

      We rephrased the text and in response to the previous comment, we clearly state that there are more details about our approach in Smit & Robbins 2025. At the end of the relevant sentence, we added the following parenthesis “(see “traditional Elo rating method”; we do not use the “optimized Elorating method” as it yields similar results and it is not widely used)” and we removed the sentence referring to the optimized Elo-rating method.

      (4) L110. Here, the authors stated that the individual with the standardized Elo-score 1 was the highest-ranking. L117, the "aggression direction" score of each aggressive interaction was the standardized Elo-score of the aggressor, subtracting that of the recipient. So, when the "aggression direction" score was 1, it should mean that the aggressor was the highest-ranking and the recipient was the lowest-ranking female. This is not as the authors stated in L117-120 (where the description was incorrectly reversed). Please clarify.

      The highest ranking individual has indeed Elo_score equal to 1 and we calculated the interaction score (or "aggression direction score") of each aggressive interaction by subtracting the standardized Elo-score of the aggressor from that of the recipient (Elo_recepient – Elo_aggressor). So, when the aggressor is the lowest-ranking female (Elo_score=0) and the recipient the highestranking female one (Elo_score=1), the "aggression direction score" is 1-0 = 1.

      (5) Regarding point 3 of the Public Review, please also revise/expand the paragraph L193-208 in the Discussion section accordingly.

      Please see our response to the public review. We have enriched the results section, added pairwise comparisons in a new table (Table 2) and modified the discussion accordingly.

      (6) Table 1. It's not clear why authors added the column 'Aggression Rate' but did not provide any explanation in the Methods/Results section. How did they calculate the correlation between each tested variable and the "overall adult female aggression rates"? Correlating the number of females in the first trimester of female pregnancy with the female aggression rates in each study group? What did the correlation coefficients mean? L202-204 may provide some hints as to why the authors introduced the Aggression Rate. But it should be made clear in the previous text.

      We now added more details in the legend of the table to make our point clear: “To highlight that aggression rates can increase due to increase in interactions of different score, we also include the effect of some of the tested variables on overall adult female aggression rates, based on results of linear mixed effects models from (Smit & Robbins 2024).”  We did not include detailed methods to calculate those results because they are detailed in (Smit & Robbins 2024). We find it valuable to show the results of both aggression rates and aggression directionality according to the same predictor variables as a means to clarify that aggression rates and aggression directionality are not always coordinated to one another (they do not always change in a consistent manner relative to one another).

      (7) L166.This is not rigorous. Please rephrase. There is only one western gorilla group containing only one resident male included in the analysis.

      We have toned down our text: “Our results did not show any significant difference between femalefemale aggression patterns within the one western and four mountain gorillas groups”

      (8) L167. I don't think the interaction scores in the third trimester of female pregnancy were significantly higher than those in the first trimester. The same concern applies in L194-195.

      We have now added a new table with post hoc pairwise comparisons among the different reproductive states that clarifies that.

      (9) L202. There is no column 'Aggression rates' in Table 1 of Reference 11.

      We have rephrased to make clear that we refer to Table 1 of the present study.

      (10) L204-205. Reference 49. Maybe not a proper citation here. This claim requires stronger evidence or further justification. Additionally, please rephrase and clarify the arguments in L204208 for better readability and precision.

      We have added three more references and rephrased to clarify our argument.

      Reviewer #3 (Recommendations for the authors):

      (1) Line 41: The word "similar" is misspelled.

      We corrected the typo.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      Chao et al. produced an updated version of the SpliceAI package using modern deep learning frameworks. This includes data preprocessing, model training, direct prediction, and variant effect prediction scripts. They also added functionality for model fine-tuning and model calibration. They convincingly evaluate their newly trained models against those from the original SpliceAI package and investigate how to extend SpliceAI to make predictions in new species. While their comparisons to the original SpliceAI models are convincing on the grounds of model performance, their evaluation of how well the new models match the original's understanding of non-local mutation effects is incomplete. Further, their evaluation of the new calibration functionality would benefit from a more nuanced discussion of what set of splice sites their calibration is expected to hold for, and tests in a context for which calibration is needed.

      Strengths:

      (1) They provide convincing evidence that their new implementation of SpliceAI matches the performance of the original model on a similar dataset while benefiting from improved computational efficiencies. This will enable faster prediction and retraining of splicing models for new species as well as easier integration with other modern deep learning tools.

      (2) They produce models with strong performance on non-human model species and a simple, well-documented pipeline for producing models tuned for any species of interest. This will be a boon for researchers working on splicing in these species and make it easy for researchers working on new species to generate their own models.

      (3) Their documentation is clear and abundant. This will greatly aid the ability of others to work with their code base.

      We thank the reviewer for these positive comments.  

      Weaknesses:

      (1) The authors' assessment of how much their model retains SpliceAI's understanding of "nonlocal effects of genomic mutations on splice site location and strength" (Figure 6) is not sufficiently supported. Demonstrating this would require showing that for a large number of (non-local) mutations, their model shows the same change in predictions as SpliceAI or that attribution maps for their model and SpliceAI are concordant even at distances from the splice site. Figure 6A comes close to demonstrating this, but only provides anecdotal evidence as it is limited to 2 loci. This could be overcome by summarizing the concordance between ISM maps for the two models and then comparing across many loci. Figure 6B also comes close, but falls short because instead of comparing splicing prediction differences between the models as a function of variants, it compares the average prediction difference as a function of the distance from the splice site. This limits it to only detecting differences in the model's understanding of the local splice site motif sequences. This could be overcome by looking at comparisons between differences in predictions with mutants directly and considering non-local mutants that cause differences in splicing predictions.

      We agree that two loci are insufficient to demonstrate preservation of non-local effects. To address this, we have extended our analysis to a larger set of sites: we randomly sampled 100 donor and 100 acceptor sites, applied our ISM procedure over a 5,001 nt window centered at each site for both models, and computed the ISM map as before. We then calculated the Pearson correlation between the collection of OSAI<sub>MANE</sub> and SpliceAI ISM importance scores. We also created 10 additional ISM maps similar to those in Figure 6A, which are now provided in Figure S23.

      Follow is the revised paragraph in the manuscript’s Results section:

      First, we recreated the experiment from Jaganathan et al. in which they mutated every base in a window around exon 9 of the U2SURP gene and calculated its impact on the predicted probability of the acceptor site. We repeated this experiment on exon 2 of the DST gene, again using both SpliceAI and OSAI<sub>MANE</sub> . In both cases, we found a strong similarity between the resultant patterns between SpliceAI and OSAI<sub>MANE</sub> , as shown in Figure 6A. To evaluate concordance more broadly, we randomly selected 100 donor and 100 acceptor sites and performed the same ISM experiment on each site. The Pearson correlation between SpliceAI and OSAI<sub>MANE</sub> yielded an overall median correlation of 0.857 (see Methods; additional DNA logos in Figure S23). 

      To characterize the local sequence features that both models focus on, we computed the average decrease in predicted splice-site probability resulting from each of the three possible singlenucleotide substitutions at every position within 80bp for 100 donor and 100 acceptor sites randomly sampled from the test set (Chromosomes 1, 3, 5, 7, and 9). Figure 6B shows the average decrease in splice site strength for each mutation in the format of a DNA logo, for both tools.

      We added the following text to the Methods section:

      Concordance evaluation of ISM importance scores between OSAI<sub>MANE</sub> and SpliceAI

      To assess agreement between OSAI<sub>MANE</sub> and SpliceAI across a broad set of splice sites, we applied our ISM procedure to 100 randomly chosen donor sites and 100 randomly chosen acceptor sites. For each site, we extracted a 5,001 nt window centered on the annotated splice junction and, at every coordinate within that window, substituted the reference base with each of the three alternative nucleotides. We recorded the change in predicted splice-site probability for each mutation and then averaged these Δ-scores at each position to produce a 5,001-score ISM importance profile per site.

      Next, for each splice site we computed the Pearson correlation coefficient between the paired importance profiles from ensembled OSAI<sub>MANE</sub> and ensembled SpliceAI. The median correlation was 0.857 for all splice sites. Ten additional zoom-in representative splice site DNA logo comparisons are provided in Supplementary Figure S23.

      (2) The utility of the calibration method described is unclear. When thinking about a calibrated model for splicing, the expectation would be that the models' predicted splicing probabilities would match the true probabilities that positions with that level of prediction confidence are splice sites. However, the actual calibration that they perform only considers positions as splice sites if they are splice sites in the longest isoform of the gene included in the MANE annotation. In other words, they calibrate the model such that the model's predicted splicing probabilities match the probability that a position with that level of confidence is a splice site in one particular isoform for each gene, not the probability that it is a splice site more broadly. Their level of calibration on this set of splice sites may very well not hold to broader sets of splice sites, such as sites from all annotated isoforms, sites that are commonly used in cryptic splicing, or poised sites that can be activated by a variant. This is a particularly important point as much of the utility of SpliceAI comes from its ability to issue variant effect predictions, and they have not demonstrated that this calibration holds in the context of variants. This section could be improved by expanding and clarifying the discussion of what set of splice sites they have demonstrated calibration on, what it means to calibrate against this set of splice sites, and how this calibration is expected to hold or not for other interesting sets of splice sites. Alternatively, or in addition, they could demonstrate how well their calibration holds on different sets of splice sites or show the effect of calibrating their models against different potentially interesting sets of splice sites and discuss how the results do or do not differ.

      We thank the reviewer for highlighting the need to clarify our calibration procedure. Both SpliceAI and OpenSpliceAI are trained on a single “canonical” transcript per gene: SpliceAI on the hg 19 Ensembl/Gencode canonical set and OpenSpliceAI on the MANE transcript set. To calibrate each model, we applied post-hoc temperature scaling, i.e. a single learnable parameter that rescales the logits before the softmax. This adjustment does not alter the model’s ranking or discrimination (AUC/precision–recall) but simply aligns the predicted probabilities for donor, acceptor, and non-splice classes with their observed frequencies. As shown in our reliability diagrams (Fig. S16-S22), temperature scaling yields negligible changes in performance, confirming that both SpliceAI and OpenSpliceAI were already well-calibrated. However, we acknowledge that we didn’t measure how calibration might affect predictions on non-canonical splice sites or on cryptic splicing. It is possible that calibration might have a detrimental effect on those, but because this is not a key claim of our paper, we decided not to do further experiments. We have updated the manuscript to acknowledge this potential shortcoming; please see the revised paragraph in our next response.

      (3) It is difficult to assess how well their calibration method works in general because their original models are already well calibrated, so their calibration method finds temperatures very close to 1 and only produces very small and hard to assess changes in calibration metrics. This makes it very hard to distinguish if the calibration method works, as it doesn't really produce any changes. It would be helpful to demonstrate the calibration method on a model that requires calibration or on a dataset for which the current model is not well calibrated, so that the impact of the calibration method could be observed.

      It’s true that the models we calibrated didn’t need many changes. It is possible that the calibration methods we used (which were not ours, but which were described in earlier publications) can’t improve the models much. We toned down our comments about this procedure, as follows.

      Original:

      “Collectively, these results demonstrate that OSAIs were already well-calibrated, and this consistency across species underscores the robustness of OpenSpliceAI’s training approach in diverse genomic contexts.” Revised:

      “We observed very small changes after calibration across phylogenetically diverse species, suggesting that OpenSpliceAI’s training regimen yielded well‐calibrated models, although it is possible that a different calibration algorithm might produce further improvements in performance.”

      Reviewer #2 (Public review):

      Summary:

      The paper by Chao et al offers a reimplementation of the SpliceAI algorithm in PyTorch so that the model can more easily/efficiently be retrained. They apply their new implementation of the SpliceAI algorithm, which they call OpenSpliceAI, to several species and compare it against the original model, showing that the results are very similar and that in some small species, pretraining on other species helps improve performance.

      Strengths:

      On the upside, the code runs fine, and it is well documented.

      Weaknesses:

      The paper itself does not offer much beyond reimplementing SpliceAI. There is no new algorithm, new analysis, new data, or new insights into RNA splicing. There is no comparison to many of the alternative methods that have since been published to surpass SpliceAI. Given that some of the authors are well-known with a long history of important contributions, our expectations were admittedly different. Still, we hope some readers will find the new implementation useful.

      We thank the reviewer for the feedback. We have clarified that OpenSpliceAI is an open-source PyTorch reimplementation optimized for efficient retraining and transfer learning, designed to analyze cross-species performance gains, and supported by a thorough benchmark and the release of several pretrained models to clearly position our contribution.

      Reviewer #3 (Public review):

      Summary:

      The authors present OpenSpliceAI, a PyTorch-based reimplementation of the well-known SpliceAI deep learning model for splicing prediction. The core architecture remains unchanged, but the reimplementation demonstrates convincing improvements in usability, runtime performance, and potential for cross-species application.

      Strengths:

      The improvements are well-supported by comparative benchmarks, and the work is valuable given its strong potential to broaden the adoption of splicing prediction tools across computational and experimental biology communities.

      Major comments:

      Can fine-tuning also be used to improve prediction for human splicing? Specifically, are models trained on other species and then fine-tuned with human data able to perform better on human splicing prediction? This would enhance the model's utility for more users, and ideally, such fine-tuned models should be made available.

      We evaluated transfer learning by fine-tuning models pretrained on mouse (OSAI<sub>Mouse</sub>), honeybee (OSAI<sub>Honeybee</sub>), Arabidopsis (OSAI<sub>Arabidopsis</sub>), and zebrafish (OSAI<sub>Zebrafish</sub>) on human data. While transfer learning accelerated convergence compared to training from scratch, the final human splicing prediction accuracy was comparable between fine-tuned and scratch-trained models, suggesting that performance on our current human dataset is nearing saturation under this architecture.

      We added the following paragraph to the Discussion section:

      We also evaluated pretraining on mouse (OSAI<sub>Mouse</sub>), honeybee (OSAI<sub>Honeybee</sub>), zebrafish (OSAI<sub>Zebrafish</sub>), and Arabidopsis (OSAI<sub>Arabidopsis</sub>) followed by fine-tuning on the human MANE dataset. While cross-species pretraining substantially accelerated convergence during fine-tuning, the final human splicing-prediction accuracy was comparable to that of a model trained from scratch on human data. This result indicates that our architecture seems to capture all relevant splicing features from human training data alone, and thus gains little or no benefit from crossspecies transfer learning in this context (see Figure S24).

      Reviewer #1 (Recommendations for the authors):

      We thank the editor for summarizing the points raised by each reviewer. Below is our point-bypoint response to each comment:

      (1) In Figure 3 (and generally in the other figures) OpenSpliceAI should be replaced with OSAI_{Training dataset} because otherwise it is hard to tell which precise model is being compared. And in Figure 3 it is especially important to emphasize that you are comparing a SpliceAI model trained on Human data to an OSAI model trained and evaluated on a different species.

      We have updated the labels in Figures 3, replacing “OpenSpliceAI” with “OSAI_{training dataset}” to more clearly specify which model is being compared.

      (2) Are genes paralogous to training set genes removed from the validation set as well as the test set? If you are worried about data leakage in the test set, it makes sense to also consider validation set leakage.

      Thank you for this helpful suggestion. We fully agree, and to avoid any data leakage we implemented the identical filtering pipeline for both validation and test sets: we excluded all sequences paralogous or homologous to sequences in the training set, and further removed any sequence sharing > 80 % length overlap and > 80 % sequence identity with training sequences. The effect of this filtering on the validation set is summarized in Supplementary Figure S7C.

      Figure S7. (C) Scatter plots of DNA sequence alignments between validation and training sets for Human-MANE, mouse, honeybee, zebrafish, and Arabidopsis. Each dot represents an alignment, with the x-axis showing alignment identity and the y-axis showing alignment coverage. Alignments exceeding 80% for both identity and coverage are highlighted in the redshaded region and were excluded from the test sets.

      Reviewer #3 (Recommendations for the authors):

      (1) The legend in Figure 3 is somewhat confusing. The labels like "SpliceAI-Keras (species name)" may imply that the model was retrained using data from that species, but that's not the case, correct?

      Yes, “SpliceAI-Keras (species name)” was not retrained; it refers to the released SpliceAI model evaluated on the specified species dataset. We have revised the Figure 3 legends, changing “SpliceAI-Keras (species name)” to “SpliceAI-Keras” to clarify this.

      (2) Please address the minor issues with the code, including ensuring the conda install works across various systems.

      We have addressed the issues you mentioned. OpenSpliceAI is now available on Conda and can be installed with:  conda install openspliceai. 

      The conda package homepage is at: https://anaconda.org/khchao/openspliceai We’ve also corrected all broken links in the documentation.

      (3) Utility:

      I followed all the steps in the Quick Start Guide, and aside from the issues mentioned below, everything worked as expected.

      I attempted installation using conda as described in the instructions, but it was unsuccessful. I assume this method is not yet supported.

      In Quick Start Guide: predict, the link labeled "GitHub (models/spliceai-mane/10000nt/)" appears to be incorrect. The correct path is likely "GitHub (models/openspliceaimane/10000nt/)".

      In Quick Start Guide: variant (https://ccb.jhu.edu/openspliceai/content/quick_start_guide/quickstart_variant.html#quick-startvariant), some of the download links for input files were broken. While I was able to find some files in the GitHub repository, I think the -A option should point to data/grch37.txt, not examples/data/input.vcf, and the -I option should be examples/data/input.vcf, not data/vcf/input.vcf.

      Thank you for catching these issues. We’ve now addressed all issues concerning Conda installation and file links. We thank the editor for thoroughly testing our code and reviewing the documentation.

    1. Author Response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public review):

      Summary:

      This work uses a novel, ethologically relevant behavioral task to explore decision-making paradigms in C. elegans foraging behavior. By rigorously quantifying multiple features of animal behavior as they navigate in a patch food environment, the authors provide strong evidence that worms exhibit one of three qualitatively distinct behavioral responses upon encountering a patch: (1) "search", in which the encountered patch is below the detection threshold; (2) "sample", in which animals detect a patch encounter and reduce their motor speed, but do not stay to exploit the resource and are therefore considered to have "rejected" it; and (3) "exploit", in which animals "accept" the patch and exploit the resource for tens of minutes. Interestingly, the probability of these outcomes varies with the density of the patch as well as the prior experience of the animal. Together, these experiments provide an interesting new framework for understanding the ability of the C. elegans nervous system to use sensory information and internal state to implement behavioral state decisions.

      Strengths:

      The work uses a novel, neuroethologically-inspired approach to studying foraging behavior

      The studies are carried out with an exceptional level of quantitative rigor and attention to detail

      Powerful quantitative modeling approaches including GLMs are used to study the behavioral states that worms enter upon encountering food, and the parameters that govern the decision about which state to enter

      The work provides strong evidence that C. elegans can make 'accept-reject' decisions upon encountering a food resource

      Accept-reject decisions depend on the quality of the food resource encountered as well as on internally represented features that provide measurements of multiple dimensions of internal state, including feeding status and time

      Reviewer #2 (Public review):

      This study provides an experimental and computational framework to examine and understand how C. elegans make decisions while foraging environments with patches of food. The authors show that C. elegans reject or accept food patches depending on a number of internal and external factors.

      The key novelty of this paper is the explicit demonstration of behavior analysis and quantitative modeling to elucidate decision-making processes. In particular, the description of the exploring vs. exploiting phases, and sensing vs. non-sensing categories of foraging behavior based on the clustering of behavioral states defined in a multi-dimensional behavior-metrics space, and the implementation of a generalized linear model (GLM) whose parameters can provide quantitative biological interpretations.

      The work builds on the literature of C. elegans foraging by adding the reject/accept framework.

      Reviewer #3 (Public review):

      Summary:

      In this study by Haley et al, the authors investigated explore-exploit foraging using C. elegans as a model system. Through an elegant set of patchy environment assays, the authors built a GLM based on past experience that predicts whether an animal will decide to stay on a patch to feed and exploit that resource, instead of choosing to leave and explore other patches.

      Strengths:

      I really enjoyed reading this paper. The experiments are simple and elegant, and address fundamental questions of foraging theory in a well-defined system. The experimental design is thoroughly vetted, and the authors provide a considerable volume of data to prove their points. My only criticisms have to do with the data interpretation, which I think are easily addressable.

      Weaknesses:

      History-dependence of the GLM

      The logistic GLM seems like a logical way to model a binary choice, and I think the parameters you chose are certainly important. However, the framing of them seem odd to me. I do not doubt the animals are assessing the current state of the patch with an assessment of past experience; that makes perfect logical sense. However, it seems odd to reduce past experience to the categories of recently exploited patch, recently encountered patch, and time since last exploitation. This implies the animals have some way of discriminating these past patch experiences and committing them to memory. Also, it seems logical that the time on these patches, not just their density, should also matter, just as the time without food matters. Time is inherent to memory. This model also imposes a prior categorization in trying to distinguish between sensed vs. not-sensed patches, which I criticized earlier. Only "sensed" patches are used in the model, but it is questionable whether worms genuinely do not "sense" these patches.

      It seems more likely that the worm simply has some memory of chemosensation and relative satiety, both of which increase on patches and decrease while off of patches. The magnitudes are likely a function of patch density. That being said, I leave it up to the reader to decide how best to interpret the data.

      Model design: We agree with the reviewer that past experience is not likely to be discretized into the exact parameters of our model. We have added to our manuscript to further clarify this point (lines 645-647). Investigating the mechanisms behind this behavior is beyond the scope of this project but is certainly an exciting trajectory for future C. elegans research.

      osm-6

      The argument is that osm-6 animals can't sense food very well, so when they sense it, they enter the exploitation state by default. That is what they appear to do, but why? Clearly they are sensing the food in some other way, correct? Are ciliated neurons the only way worms can sense food? Don't they also actively pump on food, and can therefore sense the food entering their pharynx? I think you could provide further insight by commenting on this. Perhaps your decision model is dependent on comparing environmental sensing with pharyngeal sensing? Food intake certainly influences their decision, no? Perhaps food intake triggers exploitation behavior, which can be over-run by chemo/mechanosensory information?

      osm-6 behavior: We thank the reviewer for pointing out the need to further elaborate on a mechanistic hypothesis to explain the behavior of osm-6 sensory mutants. We agree with the reviewer’s speculation that post-ingestive and other non-ciliary sensory cues likely drive detection of food. We have added additional commentary to our manuscript to state this (lines 529-538).

      Impact

      I think this work will have a solid impact on the field, as it provides tangible variables to test how animals assess their environment and decide to exploit resources. I think the strength of this research could be strengthened by a reassessment of their model that would both simplify it and provide testable timescales of satiety/starvation memory.

      Reviewer #2 (Recommendations for the authors):

      The authors have addressed most of my concerns.

      Reviewer #3 (Recommendations for the authors):

      The authors provide a considerable amount of processed data (great, thank you!), but it would be even better if they provided the raw data of the worm coordinates, and when and where these coordinates overlapped with patches. This is the raw data that was ultimately used for all the quantifications in the paper, and would be incredibly useful to readers who are interested in modeling the data themselves.

      This should not be prohibitive.

      Data Availability: We thank the reviewer for pointing out this need. We are uploading all processed data (e.g. worm coordinates relative to the arena and patches) to a curated data storage server. We have updated our data availability statement to state this (lines 684-688).

      Search vs. sample & sensing vs. non-sensing.

      The different definitions of behaviors in Figures 2H-K are a bit confusing. I think the confusion stems in part from the changing terms and color associations in Figures 2 H-K. Essentially the explore density in Figure 2 H is split into two densities based on the two densities (sensing vs. non-responding) observed in Figure 2I. In turn, the sensing density in Figure 2I is split into two densities (explore vs exploit) based on the two densities observed in Figure 2 H. But the way the figures are colored, yellow means search (Figure 2H) and non-responding (Figure 2I), green means exploit (Figure 2H) which includes sensing and non-responding, but also exclusively sensing (Figure 2I), and blue consistently means exploit in both figures. It might help to use two different color codes for Figures 2H and 2I, and then in 2J you define search as explore AND non-responding, sample as explore AND sensing, and exploit as exploit.

      Color schema: While we understand the confusion, we believe that introducing additional colors may also present some misunderstandings. We have decided to leave the figure as it is.

    1. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      Two important factors in visual performance are the resolving power of the lens and the signal-to-noise ratio of the photoreceptors. These both compete for space: a larger lens has improved resolving power over a smaller one, and longer photoreceptors capture more photons and hence generate responses with lower noise. The current paper explores the tradeoff of these two factors, asking how space should be allocated to maximize eye performance (measured as encoded information).

      Your summary is clear, concise and elegant. The competition is not just for space, it is for space, materials and energy. We  now emphasise that we are considering these three costs in our rewrites of the Abstract and the first paragraph of the Discussion.  

      Strengths:

      The topic of the paper is interesting and not well studied. The approach is clearly described and seems appropriate (with a few exceptions - see weaknesses below). In most cases, the parameter space of the models are well explored and tradeoffs are clear.  

      Weaknesses:

      Light level

      The calculations in the paper assume high light levels (which reduces the number of parameters that need to be considered). The impact of this assumption is not clear. A concern is that the optimization may be quite different at lower light levels. Such a dependence on light level could explain why the model predictions and experiment are not in particularly good agreement. The paper would benefit from exploring this issue.

      Thank you for raising this point. We briefly explained in our original Discussion, under Understanding the adaptive radiation of eyes (Version 1, Iines 756 – 762), how our method can be modified to investigate eyes adapted for lower light levels. We have some thoughts on how eyes might be adapted. In general, transduction rates are increased by increasing D, reducing f, increasing d<sub>rh</sub> and increasing L . In addition, d<sub>rh</sub> is increased to allow for a larger D within the constraint of eye radius/corneal surface area, and to avoid wasteful oversampling (the changes in D, f and d<sub>rh</sub> increase acceptance angle ∆ρ). We suspect that in eyes optimised for the efficient use of space, materials and energy the increases in L will be relatively small, first because  increasing D, reducing f and increasing d<sub>rh</sub> are much more effective at increasing transduction rate than increasing L. Second, increasing sensitivity by reducing f decreases the cost Vo whereas increasing sensitivity by increasing L increases the cost V<sub>ph</sub>. This disadvantage, together with exponential absorption, might explain why L is only 10% - 20% longer in the apposition eyes of nocturnal bees (Somanathan et al, J. comp. Physiol. A195, 571583, 2009). Because this line of argument is speculative and enters new territory, we have not included it in our revised version. We already present a lot of new material for readers to digest, and we agree with referee 2 that “It is possible to extend the theory to other types of eyes, although it would likely require more variables and assumptions/constraints to the theory. It is thus good to introduce the conceptual ideas without overdoing the applications of the theory”. Nonetheless, we take your point that some of the eyes in our data set might be adapted for lower light levels, and we have rewritten the Discussion section, How efficiently do insects allocate resources within their apposition eyes accordingly. On line 827 – 843 we address the assumption that eyes are adapted for full daylight,  and also take the opportunity  to mention two more reasons for increasing the eye parameter p: namely increasing image velocity (Snyder, 1979), and constructing  bright zones that increase the detectability of small targets (van Hateren et al., 1989; Straw et al., 2006).

      Discontinuities

      The discontinuities and non-monotonicity of the optimal parameters plotted in Figure 4 are concerning. Are these a numerical artifact? Some discussion of their origin would be quite helpful.

      Good points, we now address the discontinuities in the Results, where they are first observed (lines 311 - 319) 

      Discrepancies between predictions and experiment

      As the authors clearly describe, experimental measurements of eye parameters differ systematically from those predicted. This makes it difficult to know what to take away from the paper. The qualitative arguments about how resources should be allocated are pretty general, and the full model seems a complex way to arrive at those arguments. Could this reflect a failure of one of the assumptions that the model rests on - e.g. high light levels, or that the cost of space for photoreceptors and optics is similar? Given these discrepancies between model and experiment, it is also hard to evaluate conclusions about the competition between optics and photoreceptors (e.g. at the end of the abstract) and about the importance for evolution (end of introduction).

      Your misgivings boil down to two issues: what use is a model that fails to fit the data, and do we need a complicated model to show something that seems to be intuitively obvious?  Our study is useful because it introduces new approaches, methods, factors and explanations which advance our analysis and understanding of eye design and evolution. Your comments make it clear that we failed to get this message across and we have revised the manuscript accordingly. We have rewritten the Abstract and the first paragraph of the Discussion to emphasise the value of our new measure of cost, specific volume, by including more of its practical advantages. In particular, our use of specific volume 1) opens the door to the morphospace of all eyes of given type and cost. 2) This allows one to construct performance surfaces across morphospace that not only identify optima, but by evaluating the sub-optimal cast light on efficiency and adaptability. 3) Shows that photoreceptor energy costs have a major impact on design and efficiency, and 4) allows us to calculate and compare the capacities and efficiencies of compound eyes and simple eyes using a superior measure of cost. It is also possible that your dissatisfaction was deepened by disappointment. The first sentence of our original Abstract said that the goal of design is to maximize performance, so you might have expected to see that eyes are optimised.  Given that optimization provides cast iron proof that a system is designed to be efficient, and previous studies of coding by fly LMCs (Laughlin, 1981; Srinivasan et al., 1982 & van Hateren 1992) validated Barlow’s Efficient Coding Hypothesis by showing that coding is optimised, your expectation is reasonable. However, our investigation of how the allocation of resources to optics and photoreceptors affects an eye’s performance, efficiency and design does not depend a priori  on finding optima, therefore we have removed the “maximized”. Our revised Abstract now says, “to improve performance”.  

      In short, our study illustrates an old adage in statistics “All models fail to fit, but some are useful”. As is often the case, the way in which our model fails is useful. In the original version of the Results and Discussion, we argued that the allocation of resources is efficient, and identified factors that can, in principle, explain the scattering of data points. Indeed, our modelling identifies two of these deficiencies; a lack of data on species-specific energy usage, and the need for models that quantify the relationship between the quality of the captured image and the behavioural tasks for which an eye might be specialised. Thus, by examining the model’s failings we identify critical factors and pose new questions for future research.  We have rewritten the Discussion section How efficiently do insects allocate resources…. to make these points. We hope that these revisions will convince you that we have established a starting point for definitive studies, invented a vehicle that has travelled far enough to discover new territory, and shown that it can be modified to cope with difficult terrain.

      Turning to the need for a complicated model, because the costs and benefits depend on elementary optics and geometry, we too thought that there ought to be a simple model. However, when we tried to formulate a simple set of equations that approximate the definitive findings of our more complicated model we discovered that this is not as straightforward as we thought.  Many of the parameters in our model interact to determine costs and benefits, and many of these interactions are non-linear (e.g. the volumes of shells in spheres involve quadratic and cubic terms, and information depends on the log of a square root). So, rather than hold back publication of our complicated model, we decided to explain how it works as clearly as we can and demonstrate its value.

      In response to your final comment, “it is hard to evaluate conclusions about the competition between optics and photoreceptors (e.g. at the end of the abstract) and about the importance for evolution (end of introduction)”, we stand by our original argument. There must be competition in an eye of fixed cost, and because competition favours a heavy investment in photoreceptors, both in theory and in practice, it  is a significant factor in eye design. A match between investments in optics and photoreceptors is predicted by theory and observed in fly NS eyes, therefore this is a design principle. As for evolution, no one would deny that it is important to view the adaptive radiation of eyes through a cost-benefit lens. Our lens is the first to view the whole eye, optics and photoreceptor array, and the first to treat the costs of space, materials and energy. Although the view through our lens is a bit fuzzy, it reveals that costs, benefits and trade-offs are important. Thus we have established a promising starting point for a new and more comprehensive cost-benefit approach to understanding eye design and evolution.  As for the involvement of genes, when there are heritable changes in phenotype genes must be involved and if, as we suggest, efficient resource allocation is beneficial, the developmental mechanisms responsible for allocating resources to optics and photoreceptor array will be playing a formative role in eye evolution.

      Reviewer #2 (Public Review):

      Summary:

      In short, the paper presents a theoretical framework that predicts how resources should be optimally distributed between receptors and optics in eyes.

      Strengths:

      The authors build on the principle of resource allocation within an organism and develop a formal theory for optimal distribution of resources within an eye between the receptor array and the optics. Because the two parts of eyes, receptor arrays and optics, share the same role of providing visual information to the animal it is possible to isolate these from resource allocation in the rest of the animal. This allows for a novel and powerful way of exploring the principles that govern eye design. By clever and thoughtful assumptions/constraints, the authors have built a formal theory of resource allocation between the receptor array and the optics for two major types of compound eye as well as for camera-type eyes. The theory is formalized with variables that are well characterized in a number of different animal eyes, resulting in testable predictions.

      The authors use the theory to explain a number of design features that depend on different optimal distribution of resources between the receptor array and the optics in different types of eyes. As an example, they successfully explain why eye regions with different spatial resolution should be built in different ways. They also explain differences between different types of eyes, such as long photoreceptors in apposition compound eyes and much shorter receptors in camera type eyes. The predictive power in the theory is impressive.

      To keep the number of parameters at a minimum, the theory was developed for two types of compound eye (neural superposition, and apposition) and for camera-type eyes. It is possible to extend the theory to other types of eyes, although it would likely require more variables and assumptions/constraints to the theory. It is thus good to introduce the conceptual ideas without overdoing the applications of the theory.

      The paper extends a previous theory, developed by the senior author, that develops performance surfaces for optimal cost/benefit design of eyes. By combining this with resource allocation between receptors and optics, the theoretical understanding of eye design takes a major leap and provides entirely new sets of predictions and explanations for why eyes are built the way they are.

      The paper is well written and even though the theory development in the Results may be difficult to take in for many biologists, the Discussion very nicely lists all the major predictions under separate headings, and here the text is more tuned for readers that are not entirely comfortable with the formalism of the Results section. I must point out though that the Results section is kept exemplary concise. The figures are excellent and help explain concepts that otherwise may go above the head of many biologists.

      We are heartened by your appreciation of our manuscript - it persuaded us not to undertake extensive revisions – thank you.

      Reviewer #3 (Public Review):

      Summary:

      This is a proposal for a new theory for the geometry of insect eyes. The novel costbenefit function combines the cost of the optical portion with the photoreceptor portion of the eye. These quantities are put on the same footing using a specific (normalized) volume measure, plus an energy factor for the photoreceptor compartment. An optimal information transmission rate then specifies each parameter and resource allocation ratio for a variable total cost. The elegant treatment allows for comparison across a wide range of species and eye types. Simple eyes are found to be several times more efficient across a range of eye parameters than neural superposition eyes. Some trends in eye parameters can be explained by optimal allocation of resources between the optics and photoreceptors compartments of the eye.

      Strengths:

      Data from a variety of species roughly align with rough trends in the cost analysis, e.g. as a function of expanding the length of the photoreceptor compartment.

      New data could be added to the framework once collected, and many species can be compared.

      Eyes of different shapes are compared.

      Weaknesses:

      Detailed quantitative conclusions are not possible given the approximations and simplifying assumptions in the models and poor accounting for trends in the data across eye types.

      Reviewer #1 (Recommendations For The Authors):

      Figure 1: Panel E defines the parameters described in panel d. Consider swapping the order of those panels (or defining D and Delta Phi in the figure legend for d). Order follows narrative, eye types then match 

      We think that you are referring to Figure 1. We modified the legend.

      Lines 143-145: How does a different relative cost impact your results?

      Thank you for raising this question. Because our assumption that relative costs are the same is our starting point, and for optics it is not an obvious mistake, we do not raise your question here. We address your question where you next raise it because, for photoreceptors the assumption is obviously wrong.  We now emphasise that our method for accounting for photoreceptor energy costs can be applied to other costs. 

      Lines 187-190: Same as above - how do your results change if this assumption is not accurate?

      We have revised our manuscript to emphasise that we are dealing with the situation in which our initial assumption (costs per unit volume are equal) breaks down. On (lines 203 - 208) we write “ However, this assumption breaks down when we consider specific metabolic rates. To enable and power phototransduction, photoreceptors have an exceptionally high specific metabolic rate (energy consumed per gram, and hence unit volume, per second) (Laughlin et al., 1998; Niven et al., 2007; Pangršič et al., 2005). We account for this extra cost by applying an energy surcharge, S<sub>E</sub>. To equate…. 

      We also revised part of the Discussion section, Specific volume is a useful measure of cost to make it clear that we are able take account for situations in which the costs per unit volume are not equal, and we give our treatment of photoreceptor energy costs as an example of how this is done. On lines 626 - 640 we say  

      Cost estimates can be adjusted for situations in which costs per unit volume are not equal, as illustratedby our treatment of photoreceptor energy consumption.  To support transduction the photoreceptor array has an exceptionally high metabolic rate (Laughlin et al., 1998; Niven et al., 2007; Pangršič et al., 2005). We account forthis higher energy cost by using the animal’s specific metabolic rate (power per unit mass and hence power per unit volume) to convert an array’s power consumption into an equivalent volume (Methods). Photoreceptor ion pumps are the major consumers of energy and the smaller contribution of pigmented glia (Coles, 1989) is included in our calculation of the energy tariff K<sub>E</sub>. (Methods) The higher costs of materials and their turnover in the photoreceptor array can be added the energy tariff K<sub>E</sub> but given the magnitude of the light-gated current (Laughlin et al., 1998) the relative increase will be very small. Thus for our intents and purposes the effects of these additional costs are covered by our models. For want of sufficient data…”.

      Reviewer #2 (Recommendations For The Authors):

      A few comments for consideration by the authors:

      (1) In the abstract, Maybe give another example explaining why other eyes should be different to those of fast diurnal insects.

      This worthwhile extrapolation is best kept to the Discussion.

      (2) Would it be worthwhile mentioning that the photopigment density is low in rhabdoms compared to vertebrate outer segments? This will have major effects on the relative size of retina and optics.

      Thank you, we now make this good point in the Discussion (lines 698-702).

      (3) It took me a while to understand what you mean by an energy tariff. For the less initiated reader many other variables may be difficult to comprehend. A possible remedy would be to make a table with all variables explained first very briefly in a formal way and then explained again with a few more words for readers less fluent in the formalism.

      A very useful suggestion. We have taken your advice (p.4).

      (4) The "easy explanation" on lines 356-357 need a few more words to be understandable.

      We have expanded this argument, and corrected a mistake, the width of the head front to back is not 250 μm, it is 600 μm (lines 402-407)

      (5) Maybe devote a short paragraph in the Discussion to other types of eye, such as optical superposition eyes and pinhole eyes. This could be done very shortly and without formalism. I'm sure the authors already have a good idea of the optimal ratio of receptor arrays and optics in these eye types.

      We do not discuss this because we have not found a full account of the trade-offs and their  effects on costs and benefits. We hope that our analysis of apposition and simple eyes will encourage people to analyse the relationships between costs and benefits in other eye types. To this end we pointed out in the Discussion that recent advances in imaging and modelling could be helpful.

      (6)  Could the sentence on lines 668-671 be made a little clearer?

      “Efficiency is also depressed by increasing the photoreceptor energy tariff K<sub>E</sub>, and in line with the greater impact of photoreceptor energy costs in simple eyes, the reduction in efficiency is much greater in simple eyes (Figure 8b).0.

      We replaced this sentence with “In both simple and apposition eyes efficiency is reduced by increasing the photoreceptor energy tariff K<sub>E</sub>. This effect is much greater in simple eyes, thus as found for reductions in photoreceptor length (Figure 7b),K<sub>E</sub> has more impact on the design of simple eyes” (lines749 – 752).

      (7)  I have some reservations about the text on lines 789-796. The problem is that optics can do very little to improve the performance of a directional photoreceptor where delrho should optimally be very wide. Here, membrane folding is the only efficient way to improve performance (SNR). The option to reduce delrho for better performance comes later when simultaneous spatial resolution (multiple pixels) is introduced.

      Yes, we have been careless. We have rewritten this paragraph to say (lines 920-931)

      “Two key steps in the evolution of eyes were the stacking of photoreceptive membranes to absorb more photons, and the formation of optics to intercept more photons and concentrate them according to angle of incidence to form an image (Nilsson, 2013, 2021). Our modelling of well-developed image forming eyes shows that to improve performance stacked membranes (rhabdomeres) compete with optics for the resources invested in an eye, and this competition profoundly influences both form and function. It is likely that competition between optics and photoreceptors was shaping eyes as lenses evolved to support low resolution spatial vision. Thus the developmental mechanisms that allocate resources within modern high resolution eyes (Casares & MacGregor, 2021), by controlling cell size and shape, and as our study emphasises, gradients in size and shape across an eye, will have analogues or homologues in more ancient eyes. Their discovery….” (lines 920-931

      Reviewer #3 (Recommendations For The Authors):

      Suggestions for major revisions:

      While the approach is novel and elegant, the results from the analysis of insect morphology do not broadly support the optimization argument and hardly constrain parameters, like the energy tariff value, at all. The most striking result of the paper is the flat plateau in information across a broad range of shape parameters and the length, and resolution trend in Figure 5.

      At no point in the Results and Discussion do we argue that resource allocation is optimized. Indeed, we frequently observe that it is not. Our mistake was to start the Abstract by observing that animals evolve to minimise costs. We have rewritten the Abstract accordingly.

      The information peaks are quite shallow. This might actually be a very important and interesting result in the paper - the fact that the information plateaus could give the insect eye quite a wide range of parameters to slide between while achieving relatively efficient sensing of the environment. Instead of attempting to use a rather ad hoc and poorly supported measure of energetics in PR cost, perhaps the pitch could focus on this flexibility. K<sub>E</sub> does not seem to constrain eye parameters and does not add much to the paper.

      We agree, being able to construct performance surfaces across morphospace is an important advance in the field of eye design and evolution, and the performance surface’s flat top has interesting implications for the evolution of adaptations. Encouraged by your remarks, we have rewritten the Abstract and the introductory paragraph of the Discussion to draw attention to these points. 

      We are disappointed that we failed to convince you that our energy tariff, K<sub>E</sub> , is no better than a poorly supported ad hoc parameter that does not add much to the paper. In our opinion a resource allocation model that ignores photoreceptor energy consumption is obviously inadequate because the high energy cost of phototransduction is both wellknown and considered to be a formative factor in eye evolution (Niven and Laughlin, 2008). One of the advantages of modelling is that one can assess the impact of factors that are known to be present, are thought to be important, but have not been quantified. We followed standard modelling practice by introducing a cost that has the same units as the other costs and, for good physiological reasons, increases linearly with the number of microvilli, according to K<sub>E</sub>. We then vary this unknown cost parameter to discover when and why it is significant. We were pleased to discover that we could combine data on photoreceptor energy demands and whole animal metabolic rates to establish the likely range of K<sub>E</sub>. This procedure enabled us to unify the cost-benefit analyses of optics and photoreceptors, and to discover that realistic values of K<sub>E</sub> have a profound impact on the structure and performance of an efficient eye. We hope that this advance will encourage people to collect the data needed to evaluate K<sub>E</sub>.To emphasise the importance of K<sub>E</sub> and dispel doubts associated with the failure of the model to fit the data, we have revised two sections:  Flies invest efficiently in costly photoreceptor arrays in the Results, and How efficiently do insects allocate resources within their apposition eyes?  in the Discussion. These rewrites also explain why it is impossible for us to infer K<sub>E</sub> by adjusting its value so that the model’s predictions fit the data.

      The graphics after Figure 3 are quite dense and hard to follow. None of the plateau extent shown in Fig 3 is carried through to the subsequent plots, which makes the conclusions drawn from these figures very hard to parse. If the peak information occurs on a flat plateau, it would be more helpful to see those ranges of parameters displayed in the figures.

      Ideally one should do as you suggest and plot the extent of the plateau, but in our situation this is not very helpful. In the best data set, flies, optimised models predict D well, get close to ∆φ in larger eyes, and demonstrate that these optimum values are not very sensitive to K<sub>E</sub> L is a different matter, it is very sensitive to K<sub>E</sub> L which, as we show (and frequently remind) is poorly constrained by experimental data. The best we can do is estimate the envelope of L vs C<sub>tot</sub>  curves, as defined by a plausible range of K<sub>E</sub>L . Because most of the plateau boundaries you ask for will fall within this envelope, plotting them does little to clear the fog of uncertainty. We note that all three referees agree that our model can account for two robust trends, i) in apposition eyes L increase with optical resolving power and acuity, both within individual eyes and among eyes of different sizes, and ii) L is much longer is apposition eyes than in simple eyes. Nonetheless, the scatter of data points and their failure to fit creates a bad impression. We gave a number of reasons why the model does not fit the data points, but these were scattered throughout the Results and Discussion and, as referees 1 and 3 point out, this makes it difficult to draw convincing conclusions. To rectify this failing, we have rewritten two sections, in the Results Flies invest efficiently in costly photoreceptor arrays and in the Discussion, How efficiently do insects allocate resources within their apposition eyes?, to discuss these reasons en bloc, draw conclusions and suggest how better data and refinements to modelling could resolve these issues.  

      Throughout the figures, the discontinuities in the optimal cuts through parameter space are not sufficiently explained.

      We added a couple of sentences that address the “jumps” (lines 313 – 318)

      None of the data seems to hug any of the optimal lines and only weakly follow the trends shown in the plots. This makes interpretation difficult for the reader and should be better explained. The text can be a little telegraphic in the Results after roughly page 10, and requires several readings to glean insight into the manuscript's conclusions.

      We revised the Results section in which we compare the best data set, flies’  NS eyes with theoretical predictions, Flies invest efficiently in costly photoreceptor arrays,  to expand our interpretation of the data and clarify our arguments. The remaining sections have not been expanded. In the next section, which is on fused rhabdom apposition eyes, our interpretation of the scattering of data points follows the same line of argument. The remaining Results sections are entirely theoretical.  

      Overall, the rough conclusions outlined in the Results seem moderately supported by the matches of the data to the optimal information transmission cuts through parameter space, but only weakly.

      We agree, more data is required to test and refine our theoretical predictions.

      The Discussion is long and well-argued, and contains the most cogent writing in the manuscript.

      Thank you: this is most pleasing. We submitted our study to eLife because it allows longer Discussions, but we worried that ours was too long. However, we felt that our extensive Discussion was necessary for two reasons. First, we are introducing a new approach to understanding of eye design and evolution. Second, because the data on eye morphology and costs are limited, we had to make a number of assumptions and by discussing these, warts and all, we hoped to encourage experimentalists to gather more data and focus their efforts on the most revealing material.  

      Minor comments:

      We have acted upon most of your minor comments and we confine our remarks to our disagreements. We are grateful for your attention to details that we \textshould have picked up on.  

      It's a more standard convention to say "cost-benefit" rather than with a colon. 

      "equation" should be abbreviated "eq" or "eqn", never with a "t"

      when referring to the work of van Hateren, quote the paper and the database using "van Hateren" not just "Hateren"

      small latex note: use "\textit{SNR}" to get the proper formatting for those letters when in the math environment

      Line 100-110: "f" is introduced, but only f' is referenced in the figure. This should be explained in order. d_rh is not included in the figure. Also in this section, d_rh/f is also referenced before \Delta \rho_rf, which is the same quantity, without explanation.  

      Figure 1 shows eye structure and geometry. f’ is a lineal dimension of the eye but f is not, so f is not shown in Fig 1e. We eliminated the confusion surrounding ∆ρ<sub>rh</sub>  by deleting “and changing the acceptance angle of the photoreceptive waveguide ∆ρ<sub>rh</sub> (Snyder, 1979)”.  

      Fig 1 caption: this says "From dorsal to ventral," then describes trends that run ventral to dorsal, which is a confusing typo.

      Fig 3 - adding some data points to these plots might help the reader understand how (or if) K_E is constrained by the data.

      It is not possible to add data points because to total cost, Ctot ,is unknown.

      Fig 4c (and in other subplots): the jumps in L with C_tot could be explained better in the text - it wasn't clear to this reviewer why there are these discontinuities.

      Dealt with in the revised text (lines  310-318).

      Fig 4d: The caption for this subplot could be more clearly written.

      We have rewritten the subscript for subplot 4d.

      Fig 5 and other plots with data: please indicate which symbols are samples from the same species. This info is hard to reconstruct from the tables.

      We have revised Figure 5 accordingly. Species were already indicated in Figure 6.

      Line 328: missing equation number

    1. Reviewer #3 (Public review):

      Summary:

      The manuscript by Qiu and co-workers describes the single-particle cryo-electron microscopy structures of various oligomeric states of the orphan GPCR, GPR3. It describes the monomeric and dimeric structure of a mutant of GPR3 with a modified G-protein complex (miniGs) and then builds on this work to attempt an inactive 'apo' dimer and an allosteric modulator (AF bound dimer structure, by using an ICL3 insertion and stabilizing FAB fragments.

      In general, I'm supportive of the work done in this study, and it does indeed provide valuable insight into GPR3 function. It may be that dimerization of certain class A GPCRs may be a means of signalling regulation or perhaps even amplification. However, some of the interpretation of the single particle data needs some extra attention to strengthen the hypothesis presented in the manuscript.

      Firstly, I want to thank the authors for providing the unfiltered half-maps and PDB models for careful assessment. During this review, I did my own post-processing of the half-maps and used the resultant maps for careful analysis of models.

      So to begin, I understand that the authors didn't model any lipid in the binding orthosteric binding site in any of the maps, but it may be worthwhile to model something in there, as many readers only download coordinates and not the maps.

      A more general point about all the maps. In no case were any focussed refinements carried out. As the point of this paper are some of the finer details between active and intermediate states and the effect of an allosteric modulator, masking out hypervariable portions of the structure and doing local Euler searches would most certainly provide richer insights of the details in GPR3 (especially as the BRIL:Fab structures are not of interest). And also, generally, no 3D-variability studies were performed to see if minor differences in, say, TM4/5/6 positions were due to large variation in the single particles or were a stable consensus position.

      As for the PFK dimeric structure. It appears to be refined with C2 point group symmetry (which is not mentioned anywhere except in a tiny bit of text in a supplemental figure). Was this also calculated in C1 to assess if there is any difference in either GPR3 protomer? Also, how certain are the authors of the cholesterol positions at the bottom of TM4/5? At lower map thresholds in the PFK dimer structure, one of them appears to be continuous with the orthosteric lipid. It also appears that there are many unmodelled lipids in this structure, and only two were assigned as cholesterol. It appears that many of the unmodelled lipids are forming bridging connections between the GPR3 protomers. Also, it may be worthwhile to provide a table of the key interactions between the protomers (although I note that there was a figure highlighting them).

      With the PFK monomer structure, there was weak density for the same cholesterol, which was not modelled in this one; perhaps some commentary on the authors' approach for deciding how to assign density would be helpful. It also appears that the refinement mask was probably a bit tight in this one (something that cryoSPARC is notorious for), and rerefining with a much looser mask around the TM domain may be helpful in resolving the inner lipid leaflet positions.

      The Apo structure, I think, I have the most issues with. Firstly, it is not 'apo'. There is definitely unaccounted for density in the orthosteric site. Also, the structure definitely needs a bit more attention. Firstly, masking out the BRIL and FABs would be a good start in helping better resolve the TMD regions, and then even focussing on a single monomer to increase the map interpretability. My major problem here is that, if this is being called 'apo' and inactive, the map doesn't reflect this; also, the TM5/6 does not look to be in a fully inactive position. The map density (at least around one of the protomers) in this region looks to be poorly resolved, most likely due to averaging due to internal motion. I think some 3DVA is certainly warranted here to strengthen the hypothesis that they have solved an 'apo' inactive.

      The AF (allosteric modulator) bound structure is of significantly better quality. But again, only AF is modelled, and no lipids are. How are the authors sure? Perhaps some focussed refinements (and changing the Euler Origin to centre it on the AF molecule could be a good start). To this reviewer, at least in one of the protomers, adjacent to the AF position, there is a density that looks very much like the allosteric modulator, so it could even be forming a bridging dimer. Also, some potential assignments of the lipids may enlighten some of the structure-activity relationship of this modulator, as it seems to make as many contacts with surrounding lipids as it does with TM4/5. Also, it may be worthwhile exploring carefully the 3DVA of this data. In our studies (Russel et al.), we noted that the orthosteric lipid appears to ratchet back-and-forth in concert with TM4/5 twisting. Perhaps in the AF bound structure, as it binds at the 'exit' site of the lipid, perhaps it is locking in a specific conformation.

    1. Reviewer #3 (Public review):

      Summary:

      In their study McDermott et al. investigate the neurocomputational mechanism underlying sensory prediction errors. They contrast two accounts: representational sharpening and dampening. Representational sharpening suggests that predictions increase the fidelity of the neural representations of expected inputs, while representational dampening suggests the opposite (decreased fidelity for expected stimuli). The authors performed decoding analyses on EEG data, showing that first expected stimuli could be better decoded (sharpening), followed by a reversal during later response windows where unexpected inputs could be better decoded (dampening). These results are interpreted in the context of opposing process theory (OPT), which suggests that such a reversal would support perception to be both veridical (i.e., initial sharpening to increase the accuracy of perception) and informative (i.e., later dampening to highlight surprising, but informative inputs).

      Strengths:

      The topic of the present study is of significant relevance for the field of predictive processing. The experimental paradigm used by McDermott et al. is well designed, allowing the authors to avoid several common confounds in investigating predictions, such as stimulus familiarity and adaptation. The introduction of the manuscript provides a well written summery of the main arguments for the two accounts of interest (sharpening and dampening), as well as OPT. Overall, the manuscript serves as a good overview of the current state of the field.

      Weaknesses:

      In my opinion some details of the methods, results and manuscript raise some doubts about the reliability of the reported findings. Key concerns are:

      (1) In the previous round of comments, I noted that: "I am not fully convinced that Figures 3A/B and the associated results support the idea that early learning stages result in dampening and later stages in sharpening. The inference made requires, in my opinion, not only a significant effect in one-time bin and the absence of an effect in other bins. Instead to reliably make this inference one would need a contrast showing a difference in decoding accuracy between bins, or ideally an analysis not contingent on seemingly arbitrary binning of data, but a decrease (or increase) in the slope of the decoding accuracy across trials. Moreover, the decoding analyses seem to be at the edge of SNR, hence making any interpretation that depends on the absence of an effect in some bins yet more problematic and implausible". The authors responded: "we fitted a logarithmic model to quantify the change of the decoding benefit over trials, then found the trial index for which the change of the logarithmic fit was < 0.1%. Given the results of this analysis and to ensure a sufficient number of trials, we focused our further analyses on bins 1-2". However, I do not see how this new analysis addresses the concern that the conclusion highlights differences in decoding performance between bins 1 and 2, yet no contrast between these bins are performed. While I appreciate the addition of the new model, in my current understanding it does not solve the problem I raised. I still believe that if the authors wish to conclude that an effect differs between two bins they must contrast these directly and/or use a different appropriate analysis approach.

      Relatedly, the logarithmic model fitting and how it justifies the focus on analysis bin 1-2 needs to be explained better, especially the rationale of the analysis, the choice of parameters (e.g., why logarithmic, why change of logarithmic fit < 0.1% as criterion, etc), and why certain inferences follow from this analysis. Also, the reporting of the associated results seems rather sparse in the current iteration of the manuscript.

      (2) A critical point the authors raise is that they investigate the buildup of expectations during training. They go on to show that the dampening effect disappears quickly, concluding: "the decoding benefit of invalid predictions [...] disappeared after approximately 15 minutes (or 50 trials per condition)". Maybe the authors can correct me, but my best understanding is as follows: Each bin has 50 trials per condition. The 2:1 condition has 4 leading images, this would mean ~12 trials per leading stimulus, 25% of which are unexpected, so ~9 expected trials per pair. Bin 1 represents the first time the participants see the associations. Therefore, the conclusion is that participants learn the associations so rapidly that ~9 expected trials per pair suffice to not only learn the expectations (in a probabilistic context) but learn them sufficiently well such that they result in a significant decoding difference in that same bin. If so, this would seem surprisingly fast, given that participants learn by means of incidental statistical learning (i.e. they were not informed about the statistical regularities). I acknowledge that we do not know how quickly the dampening/sharpening effects develop, however surprising results should be accompanied with a critical evaluation and exceptionally strong evidence (see point 1). Consider for example the following alternative account to explain these results. Category pairs were fixed across and within participants, i.e. the same leading image categories always predicted the same trailing image categories for all participants. Some category pairings will necessarily result in a larger representational overlap (i.e., visual similarity, etc.) and hence differences in decoding accuracy due to adaptation and related effects. For example, house  barn will result in a different decoding performance compared to coffee cup  barn, simply due to the larger visual and semantic similarity between house and barn compared to coffee cup and barn. These effects should occur upon first stimulus presentation, independent of statistical learning, and may attenuate over time e.g., due to increasing familiarity with the categories (i.e., an overall attenuation leading to smaller between condition differences) or pairs.

      (3) In response to my previous comment, why the authors think their study may have found different results compared to multiple previous studies (e.g. Han et al., 2019; Kumar et al., 2017; Meyer and Olson, 2011), particularly the sharpening to dampening switch, the authors emphasize the use of non-repeated stimuli (no repetition suppression and no familiarity confound) in their design. However, I fail to see how familiarity or RS could account for the absence of sharpening/dampening inversion in previous studies.

      First, if the authors argument is about stimulus novelty and familiarity as described by Feuerriegel et al., 2021, I believe this point does not apply to the cited studies. Feuerriegel et al., 2021 note: "Relative stimulus novelty can be an important confound in situations where expected stimulus identities are presented often within an experiment, but neutral or surprising stimuli are presented only rarely", which indeed is a critical confound. However, none of the studies (Han et al., 2019; Richter et al., 2018; Kumar et al., 2017; Meyer and Olson, 2011) contained this confound, because all stimuli served as expected and unexpected stimuli, with the expectation status solely determined by the preceding cue. Thus, participants were equally familiar with the images across expectation conditions.

      Second, for a similar reason the authors argument for RS accounting for the different results does not hold either in my opinion. Again, as Feuerriegel et al. 2021 correctly point out: "Adaptation-related effects can mimic ES when the expected stimuli are a repetition of the last-seen stimulus or have been encountered more recently than stimuli in neutral expectation conditions." However, it is critical to consider the precise design of previous studies. Taking again the example of Han et al., 2019; Kumar et al., 2017; Meyer and Olson, 2011. To my knowledge none of these studies contained manipulations that would result in a more frequent or recent repetition of any specific stimulus in the expected compared to unexpected condition. The crucial manipulation in all these previous studies is not that a single stimulus or stimulus feature (which could be subject to familiarity or RS) determines the expectation status, but rather the transitional probability (i.e. cue-stimulus pairing) of a particular stimulus given the cue. Therefore, unless I am missing something critical, simple RS seems unlikely to differ between expectation condition in the previous studies and hence seems implausible to account for differences in results compared to the current study.

      Moreover, studies cited by the authors (e.g. Todorovic & de Lange, 2012) showed that RS and ES are separable in time, again making me wonder how avoiding stimulus repetition should account for the difference in the present study compared to previous ones. I am happy to be corrected in my understanding, but with the currently provided arguments by the authors I do not see how RS and familiarity can account for the discrepancy in results.

      I agree with the authors that stimulus familiarity is a clear difference compared to previous designs, but without a valid explanation why this should affect results I find this account rather unsatisfying. I see the key difference in that the authors manipulated category predictability, instead of exemplar prediction - i.e. searching for a car instead of your car. However, if results in support of OPT would indeed depend on using novel images (i.e. without stimulus repetition), would this not severely limit the scope of the account and hence also its relevance? Certainly, the account provided by the authors casts the net wider and tries to explain visual prediction. Relatedly, if OPT only applies during training, as the authors seem to argue, would this again not significantly narrow the scope of the theory? Combined these two caveats would seem to demote the account from a general account of prediction and perception to one about perception during very specific circumstances. In my understanding the appeal of OPT is that it accounts for multiple challenges faced by the perceptual system, elegantly integrating them into a cohesive framework. Most of this would be lost by claiming that OPT's primary prediction would only apply to specific circumstances - novel stimuli during learning of predictions. Moreover, in the original formulation of the account, as outlined by Press et al., I do not see any particular reason why it should be limited to these specific circumstances. This does of course not mean that the present results are incorrect, however it does require an adequate discussion and acknowledgement in the manuscript.

      Impact:

      McDermott et al. present an interesting study with potentially impactful results. However, given my concerns raised in this and the previous round of comments, I am not entirely convinced of the reliability of the results. Moreover, the difficulty of reconciling some of the present results with previous studies highlights the need for more convincing explanations of these discrepancies and a stronger discussion of the present results in the context of the literature.

    2. Author response:

      The following is the authors’ response to the original reviews

      Public reviews:

      Reviewer 1 (Public Review):

      Many thanks for the positive and constructive feedback on the manuscript.

      This study reveals a great deal about how certain neural representations are altered by expectation and learning on shorter and longer timescales, so I am loath to describe certain limitations as 'weaknesses'. But one limitation inherent in this experimental design is that, by focusing on implicit, task-irrelevant predictions, there is not much opportunity to connect the predictive influences seen at the neural level to the perceptual performance itself (e.g., how participants make perceptual decisions about expected or unexpected events, or how these events are detected or appear).

      Thank you for the interesting comment. We now discuss the limitation of task-irrelevant prediction . In brief, some studies which showed sharpening found that task demands were relevant, while some studies which showed dampening were based on task-irrelevant predictions, but it is unlikely that task relevance - which was not manipulated in the current study - would explain the switch between sharpening and dampening that we observe within and across trials.

      The behavioural data that is displayed (from a post-recording behavioural session) shows that these predictions do influence perceptual choice - leading to faster reaction times when expectations are valid. In broad strokes, we may think that such a result is broadly consistent with a 'sharpening' view of perceptual prediction, and the fact that sharpening effects are found in the study to be larger at the end of the task than at the beginning. But it strikes me that the strongest test of the relevance of these (very interesting) EEG findings would be some evidence that the neural effects relate to behavioural influences (e.g., are participants actually more behaviourally sensitive to invalid signals in earlier phases of the experiment, given that this is where the neural effects show the most 'dampening' a.k.a., prediction error advantage?)

      Thank you for the suggestion. We calculated Pearson’s correlation coefficients for behavioural responses (difference in mean reaction times), neural responses during the sharpening effect (difference in decoding accuracy), and neural responses during the dampening effect for each participant, which resulted in null findings.

      Reviewer 2 (Public Review):

      Thank you for your helpful and constructive comments on the manuscript.

      The strength in controlling for repetition effects by introducing a neutral (50% expectation) condition also adds a weakness to the current version of the manuscript, as this neutral condition is not integrated into the behavioral (reaction times) and EEG (ERP and decoding) analyses. This procedure remained unclear to me. The reported results would be strengthened by showing differences between the neutral and expected (valid) conditions on the behavioral and neural levels. This would also provide a more rigorous check that participants had implicitly learned the associations between the picture category pairings.

      Following the reviewer's suggestion, we have included the neutral condition in the behavioural analysis and performed a repeated measures ANOVA on all three conditions.

      It is not entirely clear to me what is actually decoded in the prediction condition and why the authors did not perform decoding over trial bins in prediction decoding as potential differences across time could be hidden by averaging the data. The manuscript would generally benefit from a more detailed description of the analysis rationale and methods.

      In the original version of the manuscript, prediction decoding aimed at testing if the upcoming stimulus category can be decoded from the response to the preceding ( leading) stimulus. However, in response to the other Reviewers’ comments we have decided to remove the prediction decoding analysis from the revised manuscript as it is now apparent that prediction decoding cannot be separated from category decoding based on pixel information.

      Finally, the scope of this study should be limited to expectation suppression in visual perception, as the generalization of these results to other sensory modalities or to the action domain remains open for future research.

      We have clarified the scope of the study in the revised manuscipt .

      Reviewer 3 (Public Review):

      Thank you for the thought-provoking and interesting comments and suggestions.

      (1) The results in Figure 2C seem to show that the leading image itself can only be decoded with ~33% accuracy (25% chance; i.e. ~8% above chance decoding). In contrast, Figure 2E suggests the prediction (surprisingly, valid or invalid) during the leading image presentation can be decoded with ~62% accuracy (50% chance; i.e. ~12% above chance decoding). Unless I am misinterpreting the analyses, it seems implausible to me that a prediction, but not actually shown image, can be better decoded using EEG than an image that is presented on-screen.

      Following this and the remaining comments by the Reviewer (see below), we have decided to remove the prediction analysis from the manuscript. Specifically, we have focused on the Reviewer’s concern that it is implausible that image prediction would be better decoded that an image that is presented on-screen. This led us to perform a control analysis, in which we tried to decode the leading image category based on pixel values alone (rather than on EEG responses). Since this decoding was above chance, we could not rule out the possibility that EEG responses to leading images reflect physical differences between image categories. This issue does not extend to trailing images, as the results of the decoding analysis based on trailing images are based on accuracy comparisons between valid and invalid trials, and thus image features are counterbalanced. We would like to thank the Reviewer for raising this issue

      (2) The "prediction decoding" analysis is described by the authors as "decoding the predictable trailing images based on the leading images". How this was done is however unclear to me. For each leading image decoding the predictable trailing images should be equivalent to decoding validity (as there were only 2 possible trailing image categories: 1 valid, 1 invalid). How is it then possible that the analysis is performed separately for valid and invalid trials? If the authors simply decode which leading image category was shown, but combine L1+L2 and L4+L5 into one class respectively, the resulting decoder would in my opinion not decode prediction, but instead dissociate the representation of L1+L2 from L4+L5, which may also explain why the time-course of the prediction peaks during the leading image stimulus-response, which is rather different compared to previous studies decoding predictions (e.g. Kok et al. 2017). Instead for the prediction analysis to be informative about the prediction, the decoder ought to decode the representation of the trailing image during the leading image and inter-stimulus interval. Therefore I am at present not convinced that the utilized analysis approach is informative about predictions.

      In this analysis, we attempted to decode ( from the response to leading images) which trailing categories ought to be presented. The analysis was split between trials where the expected category was indeed presented (valid) vs. those in which it was not (invalid). The separation of valid vs invalid trials in the prediction decoding analysis served as a sanity check as no information about trial validity was yet available to participants. However, as mentioned above, we have decided to remove the “prediction decoding” analysis based on leading images as we cannot disentangle prediction decoding from category decoding.

      (3) I may be misunderstanding the reported statistics or analyses, but it seems unlikely that >10  of the reported contrasts have the exact same statistic of Tmax= 2.76 . Similarly, it seems implausible, based on visual inspection of Figure 2, that the Tmax for the invalid condition decoding (reported as Tmax = 14.903) is substantially larger than for the valid condition decoding (reported as Tmax = 2.76), even though the valid condition appears to have superior peak decoding performance. Combined these details may raise concerns about the reliability of the reported statistics.

      Thank you for bringing this to our attention. This copy error has now been rectified.

      (4) The reported analyses and results do not seem to support the conclusion of early learning resulting in dampening and later stages in sharpening. Specifically, the authors appear to base this conclusion on the absence of a decoding effect in some time-bins, while in my opinion a contrast between time-bins, showing a difference in decoding accuracy, is required. Or better yet, a non-zero slope of decoding accuracy over time should be shown ( not contingent on post-hoc and seemingly arbitrary binning).

      Thank you for the helpful suggestion. We have performed an additional analysis to address this issue, we calculated the trial-by-trial time-series of the decoding accuracy benefit for valid vs. invalid for each participant and averaged this benefit across time points for each of the two significant time windows. Based on this, we fitted a logarithmic model to quantify the change of this benefit over trials, then found the trial index for which the change of the logarithmic fit was < 0.1% (i.e., accuracy was stabilized). Given the results of this analysis and to ensure a sufficient number of trials, we focussed our further analyses on bins 1-2 to directly assess the effects of learning. This is explained in more detail in the revised manuscript .

      (5) The present results both within and across trials are difficult to reconcile with previous studies using MEG (Kok et al., 2017; Han et al., 2019), single-unit and multi-unit recordings (Kumar et al., 2017; Meyer & Olson 2011), as well as fMRI (Richter et al., 2018), which investigated similar questions but yielded different results; i.e., no reversal within or across trials, as well as dampening effects with after more training. The authors do not provide a convincing explanation as to why their results should differ from previous studies, arguably further compounding doubts about the present results raised by the methods and results concerns noted above.

      The discussion of these findings has been expanded in the revised manuscript . In short, the experimental design of the above studies did not allow for an assessment of these effects prior to learning. Several of them also used repeated stimuli (albeit some studies changed the pairings of stimuli between trials), potentially allowing for RS to confound their results.

      Recommendations for the Authors:

      Reviewer 1 (Recommendations for the authors):

      (1) On a first read, I was initially very confused by the statement on p.7 that each stimulus was only presented once - as I couldn't then work out how expectations were supposed to be learned! It became clear after reading the Methods that expectations are formed at the level of stimulus category (so categories are repeated multiple times even if exemplars are not). I suspect other readers could have a similar confusion, so it would be helpful if the description of the task in the 'Results' section (e.g., around p.7) was more explicit about the way that expectations were generated, and the (very large) stimulus set that examples are being drawn from.

      Following your suggestion, we have clarified the paradigm by adding details about the categories and the manner in which expectations are formed.

      (2) p.23: the authors write that their 1D decoding images were "subjected to statistical inference amounting to a paired t-test between valid and invalid categories". What is meant by 'amounting to' here? Was it a paired t-test or something statistically equivalent? If so, I would just say 'subjected to a paired t-test' to avoid any confusion, or explaining explicitly which statistic inference was done over.

      We have rephrased this as “subjected to (1) a one-sample t-test against chance-level, equivalent to a fixed-effects analysis, and (2) a paired t-test”.

      Relatedly, this description of an analysis amounting to a 'paired t-test' only seems relevant for the sensory decoding and memory decoding analyses (where there are validity effects) rather than the prediction decoding analysis. As far as I can tell the important thing is that the expected image category can be decoded, not that it can be decoded better or worse on valid or invalid trials.

      In the previous version of the manuscript, the comparison of prediction decoding between valid and invalid trials was meant as a sanity check. However, in response to the other Reviewers’ comments we have decided to remove the prediction decoding analysis from the revised manuscript due to confounds.

      It would be helpful if authors could say a bit more about how the statistical inferences were done for the prediction decoding analyses and the 'condition against baseline' contrasts (e.g., when it is stated that decoding accuracy in valid trials *,in general,* is above 0 at some cluster-wise corrected value). My guess is that this amounts to something like a one-sample t-test - but it may be worth noting that one-sample t-tests on information measures like decoding accuracy cannot support population-level inference, because these measures cannot meaningfully be below 0 (see Allefeld et al, 2016).

      When testing for decoding accuracy against baseline, we used one-sample t-tests against chance level (rather than against 0) throughout the manuscript. We now clarify in the manuscript that this corresponds to a fixed-effects analysis (Allefeld et al., 2016). In contrast, when testing for differences in decoding accuracy between valid and invalid conditions, we used paired-sample t-tests. As mentioned above, the prediction decoding analysis has been removed from the analysis.

      (3) By design, the researchers focus on implicit predictive learning which means the expectations being formed are ( by definition) task-irrelevant. I thought it could be interesting if the authors might speculate in the discussion on how they think their results may or may not differ when predictions are deployed in task-relevant scenarios -  particularly given that some studies have found sharpening effects do not seem to depend on task demands ( e.g., Kok et al, 2012 ; Yon et al, 2018)  while other studies have found that some dampening effects do seem to depend on what the observer is attending to ( e.g., Richter et al, 2018) . Do these results hint at a possible explanation for why this might be? Even if the authors think they don't, it might be helpful to say so!

      Thank you for the interesting comment. We have expanded on this in the revised manuscript.

      Reviewer 2  (Recommendations for the authors):

      Methods/results

      (1) The goal of this study is the assessment of expectation effects during statistical learning while controlling for repetition effects, one of the common confounds in prediction suppression studies (see, Feuerriegel et al., 2021). I agree that this is an important aspect and I assume that this was the reason why the authors introduced the P=0.5 neutral condition (Figure 1B, L3). However, I completely missed the analyses of this condition in the manuscript. In the figure caption of Figure 1C, it is stated that the reaction times of the valid, invalid, and neutral conditions are shown, but only data from the valid and invalid conditions are depicted. To ensure that participants had built up expectations and had learned the pairing, one would not only expect a difference between the valid and invalid conditions but also between the valid and neutral conditions. Moreover, it would also be important to integrate the neutral condition in the multivariate EEG analysis to actually control for repetition effects. Instead, the authors constructed another control condition based on the arbitrary pairings. But why was the neutral condition not compared to the valid and invalid prediction decoding results? Besides this, I also suggest calculating the ERP for the neutral condition and adding it to Figure 2A to provide a more complete picture.

      As mentioned above, we have included the neutral condition in the behavioural analysis, as outlined in the revised manuscript. We have also included a repeated measures ANOVA on all 3 conditions. The purpose of the neutral condition was not to avoid RS, but rather to provide a control condition. We avoided repetition by using individual, categorised stimuli. Figure 1C has been amended to include the neutral condition). In response to the remaining comments, we have decided to remove the prediction decoding analysis from the manuscript.

      (2) One of the main results that is taken as evidence for the OPT is that there is higher decoding accuracy for valid trials (indicate sharpening) early in the trial and higher decoding accuracy for invalid trials (indicate dampening) later in the trial. I would have expected this result for prediction decoding that surprisingly showed none of the two effects. Instead, the result pattern occurred in sensory decoding only, and partly (early sharpening) in memory decoding. How do the authors explain these results? Additionally, I would have expected similar results in the ERP; however, only the early effect was observed. I missed a more thorough discussion of this rather complex result pattern. The lack of the opposing effect in prediction decoding limits the overall conclusion that needs to be revised accordingly.

      Since sharpening vs. dampening rests on the comparison between valid and invalid trials, evidence for sharpening vs. dampening could only be obtained from decoding based on responses to trailing images. In prediction decoding (removed from the current version), information about the validity of the trial is not yet available. Thus, our original plan was to compare this analysis with the effects of validity on the decoding of trailing images (i.e. we expected valid trials to be decoded more accurately after the trailing image than before). The results of the memory decoding did mirror the sensory decoding of the trailing image in that we found significantly higher decoding accuracy of the valid trials from 123-180 ms. As with the sensory decoding, there was a tendency towards a later flip (280-296 ms) where decoding accuracy of invalid trials became nominally higher, but this effect did not reach statistical significance in the memory decoding.

      (3) To increase the comprehensibility of the result pattern, it would be helpful for the reader to clearly state the hypotheses for the ERP and multivariate EEG analyses. What did you expect for the separate decoding analyses? How should the results of different decoding analyses differ and why? Which result pattern would (partly, or not) support the OPT?

      Our hypotheses are now stated in the revised manuscript.

      (4) I was wondering why the authors did not test for changes during learning for prediction decoding. Despite the fact that there were no significant differences between valid and invalid conditions within-trial, differences could still emerge when the data set is separated into bins. Please test and report the results.

      As mentioned above, we have decided to remove the prediction decoding analysis from the current version of the manuscript.

      (5) To assess the effect of learning the authors write: 'Given the apparent consistency of bins 2-4, we focused our analyses on bins 1-2.' Please explain what you mean by 'apparent consistency'. Did you test for consistency or is it based on descriptive results? Why do the authors not provide the complete picture and perform the analyses for all bins? This would allow for a better assessment of changes over time between valid and invalid conditions. In Figure 3, were valid and invalid trials different in any of the QT3 or QT4 bins in sensory or memory encoding?

      We have performed an additional analysis to address this issue. The reasoning behind the decision to focus on bins 1-2 is now explained in the revised manuscript. In short, fitting a learning curve to trial-by-trial decoding estimates indicates that decoding stabilizes within <50% of the trials. To quantify changes in decoding occurring within these <50% of the trials while ensuring a sufficient number of trials for statistical comparisons, we decided to focus on bins 1-2 only.

      (6) Please provide the effect size for all statistical tests.

      Effect sizes have now been provided.

      (7) Please provide exact p-values for non-significant results and significant results larger than 0.001.

      Exact p-values have now been provided.

      (8) Decoding analyses: I suppose there is a copy/paste error in the T-values as nearly all T-values on pages 11 and 12 are identical (2.76) leading to highly significant p-values (0.001) as well as non-significant effects (>0.05). Please check.

      Thank you for bringing this to our attention. This error has now been corrected.

      (9) Page 12:  There were some misleading phrases in the result section. To give one example: 'control analyses was slightly above change' - this sounds like a close to non-significant effect, but it was indeed a highly significant effect of p<0.001. Please revise.

      This phrase was part of the prediction decoding analysis and has therefore been removed.

      (10) Sample size: How was the sample size of the study be determined (N=31)? Why did only a subgroup of participants perform the behavioral categorization task after the EEG recording? With a larger sample, it would have been interesting to test if participants who showed better learning (larger difference in reaction times between valid and invalid conditions) also showed higher decoding accuracies.

      This has been clarified in the revised manuscript. In short, the larger sample size of N=31 was based on previous research; ten participants were initially tested as part of a pilot which was then expanded to include the categorisation task.

      (11) I assume catch trials were removed before data analyses?

      We have clarified that catch trials were indeed removed prior to analyses.

      (12) Page 23, 1st line: 'In each, the decoder...' Something is missing here.

      Thank you for bringing this to our attention, this sentence has now been rephrased as “In both valid and invalid analyses” in the revised manuscript.

      Discussion

      (1) The analysis over multiple trials showed dampening within the first 15 min followed by sharpening. I found the discussion of this finding very lengthy and speculative (page 17). I recommend shortening this part and providing only the main arguments that could stimulate future research.

      Thank you for the suggestion. Since Reviewer 3 has requested additional details in this part of the discussion, we have opted to keep this paragraph in the manuscript. However, we have also made it clearer that this section is relatively speculative and the arguments provided for the across trials dynamics are meant to stimulate further research.

      (2) As this task is purely perceptual, the results support the OPT for the area of visual perception. For action, different results have been reported. Suppression within-trial has been shown to be larger for expected than unexpected features of action targets and suppression even starts before the start of the movement without showing any evidence for sharpening ( e.g., Fuehrer et al., 2022, PNAS). For suppression across trials, it has been found that suppression decreases over the course of learning to associate a sensory consequence to a specific action (e.g., Kilteni et al., 2019, ELife). Therefore, expectation suppression might function differently in perception and action (an area that still requires further research). Please clarify the scope of your study and results on perceptual expectations in the introduction, discussion, and abstract.

      We have clarified the scope of the study in the revised manuscript.

      Figures

      (1) Figure 1A: Add 't' to the arrow to indicate time.

      This has been rectified.

      (2) Figure 3:  In the figure caption, sensory and memory decoding seem to be mixed up. Please correct. Please add what the dashed horizontal line indicates.

      Thank you for bringing this to our attention, this has been rectified.

      Reviewer 3  (Recommendations for the authors):

      I applaud the authors for a well-written introduction and an excellent summary of a complicated topic, giving fair treatment to the different accounts proposed in the literature. However, I believe a few additional studies should be cited in the Introduction, particularly time-resolved studies such as Han et al., 2019; Kumar et al., 2017; Meyer and Olson, 2011. This would provide the reader with a broader picture of the current state of the literature, as well as point the reader to critical time-resolved studies that did not find evidence in support of OPT, which are important to consider in the interpretation of the present results.

      The introduction has been expanded to include the aforementioned studies in the revised manuscript.

      Given previous neuroimaging studies investigating the present phenomenon, including with time-resolved measures (e.g. Kok et al., 2017; Han et al., 2019; Kumar et al., 2017; Meyer & Olson 2011), why do the authors think that their data, design, or analysis allowed them to find support for OPT but not previous studies? I do not see obvious modifications to the paradigm, data quantity or quality, or the analyses that would suggest a superior ability to test OPT predictions compared to previous studies. Given concerns regarding the data analyses (see points below), I think it is essential to convincingly answer this question to convince the reader to trust the present results.

      The most obvious alteration to the paradigm is the use of non-repeated stimuli. Each of the above time-resolved studies utilised repeated stimuli (either repeated, identical stimuli, or paired stimuli where pairings are changed but the pool of stimuli remains the same), allowing for RS to act as a confound as exemplars are still presented multiple times. By removing this confound, it is entirely plausible that we may find different time-resolved results given that it has been shown that RS and ES are separable in time (Todorovic & de Lange, 2012). We also test during learning rather than training participants on the task beforehand. By foregoing a training session, we are better equipped to assess OPT predictions as they emerge. In our across-trial results, learning appears to take place after approximately 15 minutes or 432 trials, at which point dampening reverses to sharpening. Had we trained the participants prior to testing, this effect would have been lost.

      What is actually decoded in the "prediction decoding" analysis? The authors state that it is "decoding the predictable trailing images based on the leading images" (p.11). The associated chance level (Figure 2E) is indicated as 50%. This suggests that the classes separated by the SVM are T6 vs T7. How this was done is however unclear. For each leading image decoding the predictable trailing images should be equivalent to decoding validity (as there are only 2 possible trailing images, where one is the valid and the other the invalid image). How is it then possible that the analysis is performed separately for valid and invalid trials? Are the authors simply decoding which leading image was shown, but combine L1+L2 and L4+L5 into one class respectively? If so, this needs to be better explained in the manuscript. Moreover, the resulting decoder would in my opinion not decode the predicted image, but instead learn to dissociate the representation of L1+L2 from L4+L5, which may also explain why the time course of the prediction peaks during the leading image stimulus-response, which is rather different compared to previous studies decoding (prestimulus) predictions (e.g. Kok et al. 2017). If this is indeed the case, I find it doubtful that this analysis relates to prediction. Instead for the prediction analysis to be informative about the predicted image the authors should, in my opinion, train the decoder on the representation of trailing images and test it during the prestimulus interval.

      As mentioned above, the prediction decoding analysis has been removed from the manuscript. The prediction decoding analysis was intended as a sanity check, as validity information was not yet available to participants.

      Related to the point above, were the leading/trailing image categories and their mapping to L1, L2, etc. in Figure 1B fixed across subjects? I.e. "'beach' and 'barn' as 'Leading' categories would result in 'church' as a 'Trailing' category with 75% validity" (p.20) for all participants? If so, this poses additional problems for the interpretation of the analysis discussed in the point above, as it may invalidate the control analyses depicted in Figure 2E, as systematic differences and similarities in the leading image categories could account for the observed results.

      Image categories and their mapping were indeed fixed across participants. While this may result in physical differences and similarities between images influencing results, counterbalancing categories across participants would not have addressed this issue. For example, had we swapped “beach” with “barn” in another participant, physical differences between images may still be reflected in the prediction decoding. On the other hand, counterbalancing categories across trials was not possible given our aim of examining the initial stages of learning over trials. Had we changed the mappings of categories throughout the experiment for each participant, we would have introduced reversal learning and nullified our ability to examine the initial stages of learning under flat priors. In any case, the prediction decoding analysis has been removed from the manuscript, as outlined above.

      Why was the neutral condition L3 not used for prediction decoding? After all, if during prediction decoding both the valid and invalid image can be decoded, as suggested by the authors, we would also expect significant decoding of T8/T9 during the L3 presentation.

      In the neutral condition, L3 was followed by T8 vs. T9 with 50% probability, precluding prediction decoding. While this could have served as an additional control analysis for EEG-based decoding, we have opted for removing prediction decoding from the analysis. However, in response to the other Reviewers’ comments, the neutral condition has now been included in the behavioral analysis.

      The following concern may arise due to a misunderstanding of the analyses, but I found the results in Figures 2C and 2E concerning. If my interpretation is correct, then these results suggest that the leading image itself can only be decoded with ~33% accuracy (25% chance; i.e. ~8% above chance decoding). In contrast, the predicted (valid or invalid) image during the leading image presentation can be decoded with ~62% accuracy (50% chance; i.e. ~12% above chance decoding). Does this seem reasonable? Unless I am misinterpreting the analyses, it seems implausible to me that a prediction but not actually shown image can be better decoded than an on-screen image. Moreover, to my knowledge studies reporting decoding of predictions can (1) decode expectations just above chance level (e.g. Kok et al., 2017; which is expected given the nature of what is decoded) and (2) report these prestimulus effects shortly before the anticipated stimulus onset, and not coinciding with the leading image onset ~800ms before the predicted stimulus onset. For the above reasons, the key results reported in the present manuscript seem implausible to me and may suggest the possibility of problems in the training or interpretation of the decoding analysis. If I misunderstood the analyses, the analysis text needs to be refined. If I understood the analyses correctly, at the very least the authors would need to provide strong support and arguments to convince the reader that the effects are reliable (ruling out bias and explaining why predictions can be decoded better than on-screen stimuli) and sensible (in the context of previous studies showing different time-courses and results).

      As explained above, we have addressed this concern by performing an additional analysis, implementing decoding based on image pixel values. Indeed we could not rule out the possibility that “prediction” decoding reflected stimulus differences between leading images.

      Relatedly, the authors use the prestimulus interval (-200 ms to 0 ms before predicted stimulus onset) as the baseline period. Given that this period coincides with prestimulus expectation effects ( Kok et al., 2017) , would this not result in a bias during trailing image decoding? In other words, the baseline period would contain an anticipatory representation of the expected stimulus ( Kok et al., 2017) , which is then subtracted from the subsequent EEG signal, thereby allowing the decoder to pick up on this "negative representation" of the expected image. It seems to me that a cleaner contrast would be to use the 200ms before leading image onset as the baseline.

      The analysis of trailing images aimed at testing specific hypotheses related to differences between decoding accuracy in valid vs. invalid trials. Since the baseline was by definition the same for both kinds of trials (since information about validity only appears at the onset of the trailing image), changing the baseline would not affect the results of the analysis. Valid and invalid trials would have the same prestimulus effect induced by the leading image.

      Again, maybe I misunderstood the analyses, but what exactly are the statistics reported on p. 11 onward? Why is the reported Tmax identical for multiple conditions, including the difference between conditions? Without further information this seems highly unlikely, further casting doubts on the rigor of the applied methods/analyses. For example: "In the sensory decoding analysis based on leading images, decoding accuracy was above chance for both valid (Tmax= 2.76, pFWE < 0.001) and invalid trials (Tmax= 2.76, pFWE < 0.001) from 100 ms, with no significant difference between them (Tmax= 2.76, pFWE > 0.05) (Fig. 2C)" (p.11).

      Thank you for bringing this to our attention. As previously mentioned, this copy error has been rectified in the revised manuscript.

      Relatedly, the statistics reported below in the same paragraph also seem unusual. Specifically, the Tmax difference between valid and invalid conditions seems unexpectedly large given visual inspection of the associated figure: "The decoding accuracy of both valid (Tmax = 2.76, pFWE < 0.001) and invalid trials (Tmax = 14.903, pFWE < 0.001)" (p.12). In fact, visual inspection suggests that the largest difference should probably be observed for the valid not invalid trials (i.e. larger Tmax).

      This copy error has also been rectified in the revised manuscript.

      Moreover, multiple subsequent sections of the Results continue to report the exact same Tmax value. I will not list all appearances of "Tmax = 2.76" here but would recommend the authors carefully check the reported statistics and analysis code, as it seems highly unlikely that >10 contrasts have exactly the same Tmax. Alternatively, if I misunderstand the applied methods, it would be essential to better explain the utilized method to avoid similar confusion in prospective readers.

      This error has also now been rectified. As mentioned above the prediction decoding analysis has been removed.

      I am not fully convinced that Figures 3A/B and the associated results support the idea that early learning stages result in dampening and later stages in sharpening. The inference made requires, in my opinion, not only a significant effect in one-time bin and the absence of an effect in other bins. Instead to reliably make this inference one would need a contrast showing a difference in decoding accuracy between bins, or ideally an analysis not contingent on seemingly arbitrary binning of data, but a decrease ( or increase) in the slope of the decoding accuracy across trials. Moreover, the decoding analyses seem to be at the edge of SNR, hence making any interpretation that depends on the absence of an effect in some bins yet more problematic and implausible.

      Thank you for the helpful suggestion. As previously mentioned we fitted a logarithmic model to quantify the change of the decoding benefit over trials, then found the trial index for which the change of the logarithmic fit was < 0.1 %. Given the results of this analysis and to ensure a sufficient number of trials, we focussed our further analyses on bins 1-2 . This is explained in more detail in the revised manuscript.

      Relatedly, based on the literature there is no reason to assume that the dampening effect disappears with more training, thereby placing more burden of proof on the present results. Indeed, key studies supporting the dampening account (including human fMRI and MEG studies, as well as electrophysiology in non-human primates) usually seem to entail more learning than has occurred in bin 2 of the present study. How do the authors reconcile the observation that more training in previous studies results in significant dampening, while here the dampening effect is claimed to disappear with less training?

      The discussion of these findings has been expanded on in the revised manuscript. As previously outlined, many of the studies supporting dampening did not explicitly test the effect of learning as they emerge, nor did they control for RS to the same extent.

      The Methods section is quite bare bones. This makes an exact replication difficult or even impossible. For example, the sections elaborating on the GLM and cluster-based FWE correction do not specify enough detail to replicate the procedure. Similarly, how exactly the time points for significant decoding effects were determined is unclear (e.g., p. 11). Relatedly, the explanation of the decoding analysis, e.g. the choice to perform PCA before decoding, is not well explained in the present iteration of the manuscript. Additionally, it is not mentioned how many PCs the applied threshold on average resulted in.

      Thank you for this suggestion, we have described our methods in more detail.

      To me, it is unclear whether the PCA step, which to my knowledge is not the default procedure for most decoding analyses using EEG, is essential to obtain the present results. While PCA is certainly not unusual, to my knowledge decoding of EEG data is frequently performed on the sensor level as SVMs are usually capable of dealing with the (relatively low) dimensionality of EEG data. In isolation this decision may not be too concerning, however, in combination with other doubts concerning the methods and results, I would suggest the authors replicate their analyses using a conventional decoding approach on the sensory level as well.

      Thank you for this suggestion, we have explained our decision to use PCA in the revised manuscript.

      Several choices, like the binning and the focus on bins 1-2 seem rather post-hoc. Consequently, frequentist statistics may strictly speaking not be appropriate. This further compounds above mentioned concerns regarding the reliability of the results.

      The reasoning behind our decision to focus on bins 1-2 is now explained in more detail in the revised manuscript.

      A notable difference in the present study, compared to most studies cited in the introduction motivating the present experiment, is that categories instead of exemplars were predicted.

      This seems like an important distinction to me, which surprisingly goes unaddressed in the Discussion section. This difference might be important, given that exemplar expectations allow for predictions across various feature levels (i.e., even at the pixel level), while category predictions only allow for rough (categorical) predictions.

      The decision to use categorical predictions over exemplars lies in the issue of RS, as it is impossible to control for RS while repeating stimuli over many trials. This has been discussed in more detail in the revised manuscript.

      While individually minor problems, I noticed multiple issues across several figures or associated figure texts. For example: Figure 1C only shows valid and invalid trials, but the figure text mentions the neutral condition. Why is the neutral condition not depicted but mentioned here? Additionally, the figure text lacks critical information, e.g. what the asterisk represents. The error shading in Figure 2 would benefit from transparency settings to not completely obscure the other time-courses. Increasing the figure content and font size within the figure (e.g. axis labels) would also help with legibility (e.g. consider compressing the time-course but therefore increasing the overall size of the figure). I would also recommend using more common methods to indicate statistical significance, such as a bar at the bottom of the time-course figure typically used for cluster permutation results instead of a box. Why is there no error shading in Figure 2A but all other panels? Fig 2C-F has the y-axis label "Decoding accuracy (%)" but certainly the y-axis, ranging roughly from 0.2 to 0.7, is not in %. The Figure 3 figure text gives no indication of what the error bars represent, making it impossible to interpret the depicted data. In general, I would recommend that the authors carefully revisit the figures and figure text to improve the quality and complete the information.

      Thank you for the suggestions. Figure 1C now includes the neutral condition. Asterisks denote significant results. The font size in Figure 2C-E has been increased. The y-axis on Figure 2C-E has been amended to accurately reflect decoding accuracy in percentage. Figure 2A has error shading, however, the error is sufficiently small that the error shading is difficult to see. The error bars in Figure 3 have been clarified.

      Given the choice of journal (eLife), which aims to support open science, I was surprised to find no indication of (planned) data or code sharing in the manuscript.

      Plans for sharing code/data are now outlined in the revised manuscript.

      While it is explained in sufficient detail later in the Methods section, it was not entirely clear to me, based on the method summary at the beginning of the Results section, whether categories or individual exemplars were predicted. The manuscript may benefit from clarifying this at the start of the Results section.

      Thank you for this suggestion, following this and suggestions from other reviewers, the experimental paradigm and the mappings between categories has been further explained in the revised manuscript, to make it clearer that predictions are made at the categorical level.

      "Unexpected trials resulted in a significantly increased neural response 150 ms after image onset" (p.9). I assume the authors mean the more pronounced negative deflection here. Interpreting this, especially within the Results section as "increased neural response" without additional justification may stretch the inferences we can make from ERP data; i.e. to my knowledge more pronounced ERPs could also reflect increased synchrony. That said, I do agree with the authors that it is likely to reflect increased sensory responses, it would just be useful to be more cautious in the inference.

      Thank you for the interesting comment, this has been rephrased as a “more pronounced negative deflection” in the revised manuscript.

      Why was the ERP analysis focused exclusively on Oz? Why not a cluster around Oz? For object images, we may expect a rather wide dipole.

      Feuerriegel et al (2021) have outlined issues questioning the robustness of univariate analyses for ES, as such we opted for a targeted ROI approach on the channel showing peak amplitude of the visually evoked response (Fig. 2B). More details on this are in the revised manuscript.           

      How exactly did the authors perform FWE? The description in the Method section does not appear to provide sufficient detail to replicate the procedure.

      FWE as implemented in SPM is a cluster-based method of correcting for multiple comparisons using random field theory. We have explained our thresholding methods in more detail in the revised manuscript.

      If I misunderstand the authors and they did indeed perform standard cluster permutation analyses, then I believe the results of the timing of significant clusters cannot be so readily interpreted as done here (e.g. p.11-12); see: Maris & Oostenveld 2007; Sassenhagen & Dejan 2019.

      All statistics were based on FWE under random field theory assumptions (as implemented in SPM) rather than on cluster permutation tests (as implemented in e.g.  Fieldtrip)

      Why did the authors choose not to perform spatiotemporal cluster permutation for the ERP results?

      As mentioned above, we opted to target our ERP analyses on Oz due to controversies in the literature regarding univariate effects of ES (Feuerriegel et al., 2021).

      Some results, e.g. on p.12 are reported as T29 instead of Tmax. Why?

      As mentioned above, prediction decoding analyses have been removed from the manuscript.

    1. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      Cheong et al. use a synapse-resolution wiring map of the fruit fly nerve cord to comprehensively investigate circuitry between descending neurons (DNs) from the brain and motor neurons (MNs) that enact different behaviours. These neurons were painstakingly identified, categorised, and linked to existing genetic driver lines; this allows the investigation of circuitry to be informed by the extensive literature on how flights walk, fly, and escape from looming stimuli. New motifs and hypotheses of circuit function were presented. This work will be a lasting resource for those studying nerve cord function.

      Strengths:

      The authors present an impressive amount of work in reconstructing and categorising the neurons in the DN to MN pathways. There is always a strong link between the circuitry identified and what is known in the literature, making this an excellent resource for those interested in connectomics analysis or experimental circuits neuroscience. Because of this, there are many testable hypotheses presented with clear predictions, which I expect will result in many follow-up publications. Most MNs were mapped to the individual muscles that they innervate by linking this connectome to pre-existing light microscopy datasets. When combined with past fly brain connectome datasets (Hemibrain, FAFB) or future ones, there is now a tantalising possibility of following neural pathways from sensory inputs to motor neurons and muscle.

      Weaknesses:

      As with all connectome datasets, the sample size is low, limiting statistical analyses. Readers should keep this in mind, but note that this is the current state-of-the-art. Some figures are weakened by relying too much on depictions of wiring diagrams as evidence of circuit function, similarity between neuropils, etc. without additional quantitative justification.

      We thank the reviewer for their helpful comments. We are excited about the release of this densely reconstructed connectome and its potential to facilitate circuit exploration in the VNC. We note that while statistical methods for analyzing complicated networks such as the connectome are still being developed, the wiring diagrams presented are themselves visualizations of quantitative data. We address specific concerns below.

      Reviewer #2 (Public Review):

      Summary:

      In Cheong et al., the authors analyze a new motor system (ventral nerve cord) connectome of Drosophila. Through proofreading, cross-referencing with another female VNC connectome, they define key features of VNC circuits with a focus on descending neurons (DNs), motor neurons (MNs), and local interneuron circuits. They define DN tracts, MNs for limb and wing control, and their nerves (although their sample suffers for a subset of MNs). They establish connectivity between DNs and MNs (minimal). They perform topological analysis of all VNC neurons including interneurons. They focus specifically on identifying core features of flight circuits (control of wings and halteres), leg control circuits with a focus on walking rather than other limbed behaviors (grooming, reaching, etc.), and intermediate circuits like those for escape (GF). They put these features in the context of what is known or has been posited about these various circuits.

      Strengths:

      Some strengths of the manuscript include the matching of new DN and MN types to light microscopy, including the serial homology of leg motor neurons. This is a valuable contribution that will certainly open up future lines of experimental work.

      Also, the analysis of conserved connectivity patterns within each leg neuromere and interconnecting connectivity patterns between neuromeres will be incredibly valuable. The standard leg connectome is very nice.

      Finally, the finding of different connectivity statistics (degrees of feedback) in different neuropils is quite interesting and will stimulate future work aimed at determining its functional significance.

      We thank the reviewer for their constructive feedback, and are optimistic about the utility of the MANC connectome to the Drosophila neurobiology community in dissecting VNC circuit function.

      Weaknesses:

      First, it seems like quite a limitation that the neurotransmitter predictions were based on training data from a fairly small set of cells, none of which were DNs. It's wonderful that the authors did the experimental work to map DN neurotransmitter identity using FISH, and great that the predictions were overall decently accurate for both ACh and Glu, but unfortunate that they were not accurate for GABA. I hope there are plans to retrain the neurotransmitter predictions using all of this additional ground truth experimental data that the authors collected for DNs, in order to provide more accurate neurotransmitter type predictions across more cell types.

      The reviewer makes an excellent suggestion, and collecting further ground truth data and retraining the neurotransmitter classifier is an ongoing research project. 

      Second, the degradation of many motor neurons is unfortunate. Figure 5 Supplement 1 shows that roughly 50% of the leg motor neurons have significantly compromised connectivity data, whereas, for non-leg motor neurons, few seem to be compromised. If that is the correct interpretation of this figure, perhaps a sentence like this that includes some percentages (~50% of leg MNs, ~5% of other MNs) could be added to the main text so that readers can get a sense of the impact more easily.

      Thank you for this suggestion. We have added a line describing the percentage of leg and other MNs affected (L416-417).

      As well, Figure 5 Supplement 1 caption says "Note that MN groups where all members of the group have reconstruction issues may not be flagged" - could the authors comment on how common they think this is based on manual inspection? If it changes the estimate of the percentage of affected leg motor neurons from 50% to 75% for example, this caveat in the current analysis would need to be addressed more directly. Comparing with FANC motor neurons could perhaps be an alternative/additional approach for estimating the number of motor neurons that are compromised.

      We agree that a direct comparison to another dataset, such as FANC, would aid in identifying reconstruction issues. However, a full analysis is not currently possible as only a minority of FANC neurons have been proofread or annotated. We were able to gain some insights into reconstruction quality by looking at T1 motor neurons, where FANC MN reconstruction is more complete. As reported in the submitted manuscript, we were able to confidently match T1 MNs between FANC and MANC for all but one MN (we are missing one ltm MN on the right side of MANC). While some of the MANC neurons had smaller/less dense arbors than FANC, none of them would have been flagged as having reconstruction issues. However, for FANC, we observe that neurons on the right have less dense arbors and fewer reconstructed synapses than neurons on the left.  We have prepared a reviewer figure analyzing the consistency of synapse counts for the T1 (front leg) MNs:

      Author response image 1.

      In these results (MANC on the left, FANC on the right) we compare the number of input synapses on matched motor neurons on the left (LHS) and right hand side (RHS) of each dataset. We see that the MANC distribution is much more symmetric, indicating left and right hand side synapse counts for matched MNs are more similar in MANC. This is likely largely due to the left-right difference in reconstruction completeness in the FANC T1 leg neuropils. The number of synapses per cell type is also more variable in FANC. Overall, we recommend that end users should inspect the morphology and total synapse counts of individual MNs of interest in either dataset as part of any detailed analysis.

      This analysis might benefit from some sort of control for true biological variability in the number of MN synapses between left and right or across segments. I assume the authors chose the threshold of 0.7 because it seemed to do a good job of separating degraded neurons from differences in counts that could just be due to biological variability or reconstruction imperfections, but perhaps there's some way to show this more explicitly. For example, perhaps show how much variability there is in synapse counts across all homologs for one or two specific MN types that are not degraded and are reconstructed extremely well, so any variability in input counts for those neurons is likely to be biologically real. Especially because the identification of serial homologs among motor neurons is a key new contribution of this paper, a more in-depth analysis of similarities and differences in homologous leg MNs across segments could be interesting to the field if the degradation doesn't preclude it.

      We agree that there can be ambiguity in whether variability in synapse counts between left-right homologs of a MN type represents biological variability or technical issues. We have added a comparison of synapse counts of T1 leg MNs in MANC (Left) vs FANC (Right) as noted in the previous point. As the number of connectomes available to us increases, we will have a better idea of how synapse counts of MNs vary within and between animals.

      Fourth, the infomap communities don't seem to be so well controlled/justified. Community detection can be run on any graph - why should I believe that the VNC graph is actually composed of discrete communities? Perhaps this comes from a lack of familiarity with the infomap algorithm, but I imagine most readers will be similarly unfamiliar with it, so more work should be done to demonstrate the degree to which these communities are really communities that connect more within than across communities.

      A priori we expect that there is some degree of functional division between circuits controlling different limbs or motor systems, given current evidence that VNC neuropils and neural hemilineages are relatively specialized in controlling motor output. We have added this explanation to section 2.4.2 (L633-635).

      The Infomap algorithm was chosen out of several directed and undirected community detection methods that we tried, as it defined communities that each had connectivity with narrow and specific motor neuron subclasses. For example, it labeled populations in each of the six leg neuropils as belonging to distinct communities. We think this provides an interesting partitioning of the VNC network that could have biological relevance (which future functional studies should investigate). To the reviewer’s final sentence, we do show intra- vs inter-community connectivity in Fig. 9–supplement 1B. Notably, most communities except several small ones have far more intra-community connectivity than inter-community connectivity. We have added text highlighting this observation (L656-658).

      We do, however, agree with the general point of the reviewer that it is not yet known which community detection methods are ‘optimal’ for use with connectomics data, so we have added further text (L679-683) explaining that community detection in MANC will require further investigation and validation in the future.

      I think the length of this manuscript reduces its potential for impact, as I suspect the reality is that many people won't read through all 140 pages and 21 main figures of (overall excellent) work and analysis.

      We intend this paper to serve not only as a first look into the organization of descending-to-motor circuits, but also as a resource for future investigations in MANC. The provided detail is intended to serve these purposes.

      Reviewer #1 (Recommendations For The Authors):

      General comments:

      I find that there are too many main figures with too much content in them, as well as too much corresponding text. Much of the initial anatomical identification and description could be summarised in fewer main figures, with more supplementary figures if the authors desired. I think there is a lot of great insight in this paper, particularly in the second half, but I am concerned that the extensive detail in the initial sections may challenge reader engagement through to the later sections of the paper. It would also be useful to have a higher level and shorter discussion.

      Reiterating our response from above, we intend this paper to serve not only as a first look into the organization of descending-to-motor circuits, but also as a resource for future investigations in MANC. The provided detail is intended to serve these purposes.

      There is sometimes an over-reliance on wiring diagrams or complex plots as evidence without further quantification. I will mention several examples below, as well as additional suggestions.

      Specific comments:

      In Figure 2E, how are DNs divided into pair vs population type? This was a very interesting idea, particularly in light of "command-like" neurons vs ensembles of DNs controlling behaviour. However, it is not clear how this distinction is made. This concept is referenced throughout the manuscript, so I think a clear quantitative way of identifying "pair" vs "population" identity for each DN would be very useful. And at the very least, a thorough explanation of how it is done in the current manuscript.

      We have added additional text in the Figure 2 legend to point towards Materials and Methods where the DN grouping (pair vs. population) is explained. These groups were formed based on morphology and further split into types based on connectivity, if needed. However, as the connectome represents a static snapshot of connectivity with no functional data, it remains possible that some DNs that were grouped as populations may act functionally as multiple pairs. Future work should continue to update these annotations.

      In Figure 4, there are some inconsistencies between neurotransmitter predictions and experimental FISH data. Have the authors taken into consideration Lacin et al. 2019 (https://elifesciences.org/articles/43701)? Specifically in that paper, it is stated: "We did not find any cases of neurons using more than one neurotransmitter, but found that the acetylcholine specific gene ChAT is transcribed in many glutamatergic and GABAergic neurons, but these transcripts typically do not leave the nucleus and are not translated." I wonder if this might explain some of the inconsistencies between FISH (mRNA detection) and the neurotransmitter predictions (presumably based on indirect protein structures detected via EM imagery), or the presence of so much co-transmission.

      We agree and have added this possible explanation for apparent co-transmission in the text (L394-397).

      In Figure 8B, the authors state: "We found that individual DN and MN subclasses have direct downstream and upstream partners, respectively, that are relatively hemilineage-restricted (Figure 8B)." While the connectivity patterns highlighted are intriguing, further quantitative analysis could help strengthen this point. The connectivity matrices in Figure 8B are linked to activation phenotypes and hemilineages below. But I don't really know how to interpret "relatively hemilineage-restricted" in light of this plot. How does this connectivity pattern for example compare statistically to a randomly selected set of DNs (maintaining the same group size for example)? Would random DN sets be less hemilineage restricted? Similar quantification would be helpful to support this statement "...with high correspondence between the hemilineages connected to individual DN and MN subclasses that are expected to be functionally related."

      "both upper tectulum DNs (DNut) and wing MNs (MNwm) have significant connectivity with hemilineages 6A, 7B, 2A, 19B, 12A and 3B". What is significant connectivity? Looking at the plot in Figure 8B, why is DNut -> 16B not considered significant? Is there a threshold and if so, what is the justification?

      These plots aim to be descriptive rather than drawing hard quantitative thresholds between ‘significant’ and ‘non-significant’ connectivity. We have revised the text to remove the terms ‘restricted’ and ‘significant’ and to clarify our interpretation (L555-559).

      In Figure 9G-H, this is a very interesting finding, but how do we know that the difference is real? Why not do a statistical test to compare the brain and VNC? Or create a null model network with edge swaps, etc. to compare against.

      Statistical comparison between the brain and VNC may be problematic given differences in generating these connectomes, as well as missing connectivity (only half the brain is imaged) in the hemibrain connectome. Comparison to a null model is possible and for purposes of understanding motif frequency in general has already been done (see for example, Lin et al., 2024, Nature). However, a null or shuffled model is not required for comparing motif frequencies between brain or VNC neuropils as is the point of this particular graph. At present, we simply highlight a qualitative observation that will require future work to investigate.

      Referring to Figure 12 in the main text, "we observe that the power MN upstream network is largely shared among all power MNs and is highly bilateral." Quantifying the fraction of shared upstream neurons from power MNs would make this statement much stronger. Particularly if compared to other non-power MNs. Or potentially using some other network comparison metric.

      This is a good point. We have added cosine similarity to figure 6 for wing/haltere MNs to show the similarity between inputs across these MNs, and added text in section 2.3 (L461-465) and 2.5.3 discussing the cosine similarity (L987-988).

      In Figure 13B, "Nearly 50% of these restricted neurons (totalling about 1200 per leg neuropil) have been serially matched across the six neuropils (Figure 13B)". There seems like a disconnect here. In the IR, CR, and BR columns, I see ~2750, ~500, and ~1250 neurons not in a serial set (~4500 total); I see ~1500, ~750, and ~1000 in a serial set (~3250 total). This would mean that ~58% of neurons are not in serial sets, ~42% are in serial sets. Shouldn't the conclusion be the opposite then? That surprisingly most intrinsic neurons are not repeated across leg neuropils. I find this fascinating if true. Perhaps there is some confusion on my part, however.

      We now find that about half of the leg-restricted neurons are serially repeated across the 6 leg neuropil with similar morphology and connectivity, especially to the downstream leg motor neurons. Since first submission of this paper, we have identified some additional serial homologues while completing the systematic cell typing, described in the accompanying paper Marin et al. 2024. Figure 13B has now been updated to reflect this. In total, 3998 of 7684 restricted neurons (IR,CR,BR) have been assigned to a serial set or serial type. The sentence in the text has been adjusted to report that 52% of these restricted neurons are in serial sets (L1125).

      In Figure 13D-E, "the Tect INs are not a homogenous population." Providing additional evidence could strengthen this statement. A connectivity matrix is shown in (D), followed by examples of morphologies in (E). What makes a population homogenous or heterogenous? For example, compared to all possible INs, the Tect IN morphology actually looks quite similar. Are those connectivity matrices in (D) really so different? What would a random selection of neurons look like?

      Our sister paper, Marin et al. (2024), has looked into variation of connectivity across neurons of the entire VNC in much more detail, including clustering methods that include connectivity and other criteria for cell typing. Thus, we have now amended the text to direct the reader to that paper for more detail on variability of connectivity in the Tect INs, which were divided into 5 cell types in Marin et al. (2024) (L1027-1031). In addition, we have replaced our clustering by connectivity in Figure 13 with the cell type clusters from Marin et al. (2024).

      In reference to Figure 13 - Supplement 1, "This standard leg connectome was very similar across legs, but there were small deviations 1051 between T1, T2, and T3 legs, as shown in Figure 13-Supplement 1." - what makes a deviation considered small? T1 seems to generally have many more synapses, T2 many less, and T3 a mixture depending on the connection. Also, are there lost connections or new connections? A quantification of these issues would be helpful instead of simply depicting the wiring diagrams.

      The connections that differ are likely due to the reconstruction state of leg MNs. We have now stated this in the main text for clarification (L1143-1145). In the leg neuropils, T2 and T3 left hand side MNs have sparser dendritic arbors than the right hand side. Therefore the differences in Figure 13–Supplement 1, which are almost exclusively the connections between the leg restricted neurons onto leg MNs, seem stronger in T1. Future work, bolstered by additional datasets, will undoubtedly reveal further insight into the comparison of circuits for the different legs.

      In Figure 15 - Supplement 2, "We used effective connectivity to identify leg DNs with similar MN connectivity patterns (Figure 15-Supplement 2). Of previously identified DNs, we found that DNg13 showed a highly similar effective connectivity fingerprint."

      How was this similarity calculated? How do we know these particular DNs have similar effective connectivity? The connectivity matrix depicted is quite complex, with both layer and connectivity scores quantified at each location. A principled way of determining similarity would make this statement much stronger.

      The similarity was calculated simply as the Euclidean distance between the effective connectivity matrix for each DN onto the set of MNs. While this is a straightforward comparison mathematically, effective connectivity calculations (as first introduced in this context by Li et al., 2020 by our collaborators Larry Abbott and Ashok Litwin-Kumar) have not yet been subject to functional validation. We therefore agree with the reviewer that this should not be over interpreted at this point. Future functional work should explore hypotheses suggested here and more quantitatively compare the similarity of different DN-MN pathways.

      Minor notes:

      In Figure 4E, the circles, squares, and triangles in the figure legend are too small. This is also true to some extent in the plot itself.

      We have increased the size of the symbols in the legend and plot.

      In Figure 8E right, the figure legend and x/y axes are not clear to me. Unfortunately, I'm not sure what the plot is showing because of this.

      The right plot in figure 8E is the number of DN groups each MN group receives input from, at a threshold of 1% input. As this plot is redundant to the left plot, we have decided to remove it.

      In Figure 8I, it would be interesting to see which neurons are directly downstream of DNs. One can't see layers 2/3/4 with the fan-out expansion of neurons and the y-axis scale.

      We have revised the plot to better show cell composition of individual layers.

      In Figure 19E, it would be helpful to also have a standard y-axis.

      The panel has been revised accordingly.

      Reviewer #2 (Recommendations For The Authors):

      General:

      In the Title, you do not mention DNs or MNs but these are a major focus of this study. The title could be more descriptive of the work.

      Per the reviewer’s comments, we have revised the title to “Transforming descending input into motor output: An analysis of the Drosophila Male Adult Nerve Cord connectome”.

      A glossary would be helpful, where all the paper's abbreviations and their definitions are provided in one place. Perhaps a hierarchical structure would help (for at least part of the glossary), so that terms like NTct, WTct, and HTct could be nested underneath UTct, for example.

      We do include a glossary in the sister paper, Marin et al. (2024) and in this paper have included a short glossary in the first Figure. Please refer to these sources for abbreviation reference.

      Introduction:

      Define 'Premotor'.

      We have defined ‘premotor circuits’ to be ‘circuits that directly or indirectly control motor output’ in lines 45-46.

      It might be worthwhile to start with a broader introduction sentence than the current one that focuses just on the fly, in order to emphasize the impact of MANC as the first complete connectome of a motor circuit in any animal with limbs or wings.

      We have revised the introductory paragraph per the reviewer’s suggestions.

      "Muscles in the leg are not innervated uniformly; indeed, in the T1 legs the number of MNs per muscle varies by as much as an order of magnitude" needs to specify the axis of variability more clearly - the authors probably mean variability across muscles in the leg (not variability across individuals for example) but I think the current sentence is a bit ambiguous in that respect.

      We have reworded this sentence to clarify this point (L132-133).

      Line 182 end of paragraph: It would be useful to point out explicitly what makes the MANC project valuable in the context of a similar FANC project - for example, that the MANC connectome is more complete, is a male (so interesting for anyone interested in sexual dimorphism), and gives the field an n=2 for VNC connectome datasets.

      We agree, and have added a sentence describing the benefits of the MANC connectome on L209-212.

      Line 213: A brief phrase or sentence of context could be provided to help unaware readers understand that 42% of synaptic connectivity being captured is in the same sort of range as previous datasets like the hemibrain and likely leads to the vast majority of important cell-cell connections being identified (perhaps cite Buhmann et al 2021 Nature Methods which does an analysis of this), and therefore is a reason to think highly of this dataset's quality and its potential for impact on the field. The sentence at the end of this paragraph doesn't quite do it for me.

      We have added the comparison of MANC synapse completeness to that of the Hemibrain, and revised the ending sentence in L234-237.

      Line 271: Clarify what happened to the remaining 15% of DNs that weren't able to be assigned to a tract. They travelled outside the tracts, or data quality issues prevented assignment, or something else?

      Indeed, some DNs could not be assigned to a tract as they traveled outside of all axon tracts and did not bundle with other DNs. We have added this explanation to the text (L300-301).

      Figure 1:

      The pie chart "DN postsynaptic partners by neuron class" is a bit hard to interpret without having another pie chart next to it showing "Neurons in MANC by neuron class". I know these numbers are written on the schematic but it would be nice to be able to easily tell which cell classes are overrepresented or underrepresented in the set of postsynaptic partners of DNs. e.g. It's obvious that ANs are overrepresented and DNs are underrepresented in the set of postsynaptic partners of DNs, but it would be nice if readers didn't have to do any mental math to figure out if INs or MNs are under/overrepresented.

      We agree and have added a pie chart of the neuron class composition of the entire VNC to Figure 1.

      "35.9% of leg MNs are matched to FANC" Why is this number so low? Because FANC motor neurons were only identified in T1, so the remaining 2/3rds of leg MNs in MANC weren't matched? How successful was matching for the neurons where it was actually attempted?

      For this work, we only matched the T1 neurons across the two datasets. This was both a way of checking that we found everything in these segments and a way of being more sure of muscle target assignments as our collaborators in the FANC dataset had generated extensive light level data to match motor neurons with their target leg muscles. The T2 and T3 MNs were not fully proofread or identified in FANC, precluding further analysis, and leading to the 35.9% matched number. We hope to be able to compare between these datasets more thoroughly in future, and have matched all the premotor leg restricted intrinsic neurons of our standard connectome to FANC. We report on their stereotypy in our latest preprint, Stürner, Brooks et al. 2024.

      Figure 2:

      Figure 2A: Perhaps darken the color of the MTD-III skeletons. Currently, they're so light it's hard to see, and this is one of the most interesting tracts because the claim is that it's a new tract.

      We take the reviewer’s point, however, the color scheme used for the tracts in Figure 2 is coordinated between multiple figures and figure panels, and thus we would prefer to keep it as is. If readers would like to examine DNs of a particular tract, we encourage them to retrieve said DNs using the tract annotations in NeuPrint.

      Figure 2 supplement 1: It's not clear to me what I should be getting out of seeing the right side DNs as well. If you want readers to be able to visually compare the left and right side morphologies and appreciate the high degree of symmetry, you may want to put the left and right side DN panels side-by-side. Perhaps do that (show both the left and right side DNs) for one or two tracts in the main Fig2, and then leave out the remaining panels - or if you want to include the remaining panels, explain more clearly what readers are supposed to learn from seeing them.

      We agree and have now removed Figure 2 supplement 1.

      Figure 2C caption: Instead of "DN primary neurites" I think the authors probably mean "longest single branch of each DN" or something along those lines. I think "primary neurite" is usually used to refer to the thick non-synaptic branch coming out of a neuron's soma, which can't be how it's being used here.

      We agree and have changed all references to ‘primary neurite’ for DNs to ‘longest neurite’.

      Figure 2D+E: Perhaps add an overall % of neurons of each class to the legend. I ask because I would be very interested to know what % of all DNs exist as single pairs versus as populations, and I imagine that could be a number that is quoted a fair amount by others in the field when talking about DNs.

      We agree and have added the overall percentage of each neuron class to the results (L275-276) and Figure 2 legend.

      Figure 3:

      UTct.IntTct neurons are by far the largest class of DNxn neurons, so would it be worth calling these the DNxt class (DN projecting to some combination of tectulum neuropils), to mirror the DNxl class? I would vote for doing that.

      Thanks for the suggestion.  However, the subclass naming scheme for DNs had been coordinated between multiple groups of people working on MANC reconstruction and annotation. As making changes to subclasses will impact many analyses that have already been completed for existing work, we will refrain from doing so.

      Figure 3G feels a bit out of place in this figure and under-explained

      We have clarified in the text our citations to Figure 3G to better explain our interpretation of this data.

      Figure 4

      "DNp20 has few vesicles and may be electrically coupled": If I'm correct that DNp20 is also known as DNOVS1 and is the second largest diameter axon in the neck after the giant fiber, then yes, Suver et al. 2016 J Neurosci show that this DN is gap junction coupled to neck motor neurons (see their Fig 2F). This neuron (along with the giant fiber) is enough of an outlier that it might be more representative to show a different, more canonical DN that has a low prediction probability.

      The reviewer is right that DNp20 is also known as DNOVS1 with known gap junction coupling.  We now clarify in the text (L366) how we think that could lead to a lower neurotransmitter prediction score, which is what we were trying to illustrate.

      Figure 4E: It looks like only a single DN has more inputs (~11000) than outputs (~9000), is that right? It could be interesting to dedicate some panels and text to the connectivity profile of that one unique neuron.

      Yes, that is correct, there is just one pair of DNs, DNxn166, that receives more input than it gives output (the two triangles lie on top of each other). We think that the other DN pair in that same box (more variable in total synapse number and therefore the triangles are further apart) also receives an unusually high amount of input versus output. The morphology of these two types are shown in Figure 4F and they both have fine processes that look more like dendrites, especially when compared to other DNs such as the ones in 4G. Unfortunately, neither of these two types have been matched to light microscopy images so we cannot say if they have the same type of morphology in the brain, or further explore their brain connectivity, at this time point.

      Figure 4E: "black rectangle ... gray rectangle" don't look different shades to me. It's obvious which is which based on where they are in the graph but if you want to color code this, pick more separate colors. Or code it with something other than colors.

      We have made the rectangle in Figure 4E a lighter shade of grey and added labels to refer to the panels D, F and G. The figure legend now also describes more clearly that we are plotting every DN as a single shape and exactly how many DN types are included in those rectangles to avoid confusion.

      Figure 5:

      "subclass is their two-letter muscle anatomical category" should be explained better, I'm not sure what "muscle anatomical category" means.

      We have changed the wording in the Figure 5 legend to better clarify that MN subclasses are the broad muscle category that they innervate (e.g. legs, wings).

      Figure 7:

      Leg MN identification and serial homology.

      Why are there no tarsus reductor (tarm1 and tarm2) motor neurons? Do we not know their anatomy from light microscopy well enough, perhaps? Were these MNs identified in FANC? Is it reasonable to guess that the remaining small number of unidentified T1 leg motor neurons in MANC would control these muscles? I think Marta Moita's lab has some ongoing projects on these muscles (see Twitter), so if more LM data is needed perhaps it will come from them.

      We now know that the small number of unidentified T1 leg motor neurons (a T1 pair with a serial T2 pair, serial set 17664) are not in fact MNs. A new and unpublished dataset (Janelia whole male CNS volume, the optic lobe from which has been published as Nern et al., 2025) shows they have axons within the VNC. The MN annotation for these neurons has been removed and they now have the type name INXXX471. Thus, we have no T1 leg MNs without a muscle target annotated. Our muscle target annotation comes from matching to the FANC dataset that has also not annotated tarsus reductor MNs. We suspect that the tarsus reductor MNs are hard to distinguish from the tarsus depressor MNs of which there are 5 per side and segment.

      It seems there are a few more leg motor neurons in MANC vs FANC. Any indication of which muscles they control?

      See above.

      -Figure 7E: A qualitative comparison between the cosine similarity results here and from FANC could be useful. What generally is the same versus different? Any indication of male/female differences?

      We observe no differences in the cosine similarity of T1 leg MNs between MANC and FANC and only very minor differences between T1, T2 and T3, as shown in Figure 7. In our most recent work, now on bioRxiv (Stürner, Brooks et al., 2024), we were able to find all intrinsic leg serial sets that we included in our standard leg premotor circuit here in the FANC dataset. We do not see any differences between them in terms of morphology, and while we have several cases in which we are still missing 1 of the 6 neurons in a serial set in FANC, we see similar connectivity when comparing small circuits. We have also found almost all neurons interconnecting the legs, with some very interesting exceptions, mainly coming from the abdomen, that we believe are male specific. These male-specific neurons can also be found in this preprint (Stürner, Brooks et al., 2024).

      Figure 8

      Figure 8A: Why are ~1/3rd of the wing and leg motor neurons considered populations instead of pairs? I thought essentially all wing and leg motor neurons have unique morphologies.

      Pair vs populations are assigned based on MN morphology and connectivity. For the wing MNs, many sets of DVMns and DLMns have near-identical morphology and connectivity, are not easily distinguishable in the VNC and are categorized as a ‘population’. For the leg MNs, there are ‘true’ population MN types that provide multiple innervation of the same muscle.

      The text states "up to a maximum of 20% [traversal probability] (corresponding to a synapse input fraction of 1)" but I interpret the bottom of Figure 8G to have flipped values, where a synapse input fraction of 0.2 yields a traversal probability of 1. Is there a mistake here or have I misunderstood?

      Thank you for pointing this discrepancy out. The text description was indeed flipped, and we have corrected this error.

      Caption for J says "Layers without neurons are omitted". How is it possible to have a layer without neurons?? Something about how the traversal is done doesn't seem to be explained clearly enough. If it's really possible to have a layer without neurons, I think the approach might need to be revisited as this seems quite strange.

      Here, ‘layer’ should be viewed as a nonlinear measure of indirect connectivity combining path length and synaptic weights. Layers without neurons are possible due to the details of the calculation–layer position is assigned probabilistically by the downstream synapse connectivity of the source neurons, and the probability is scaled up to 1 at an input synapse fraction of 0.2. Neuron-to-neuron connectivity of an input synapse fraction of >=0.2 is very rare in the VNC connectome and thus neurons strictly assigned to layer 2 downstream of each DN type are similarly rare. We have updated the figure legend for figure 8 to better explain this.

      Section 2.6

      "flies have been shown to walk normally without proprioceptive feedback, suggesting that inter- and intra-leg coordination is not strictly dependent on sensory feedback loops from the legs" is quite a drastic overinterpretation of that paper's results. The ablation there was not complete (some subtypes of sensory neurons were not perturbed), and the perturbed flies certainly walked with some defects. This statement certainly should be removed or significantly softened.

      Thank you for pointing this detail out. The term ‘normally’ has been removed from this sentence to soften the statement.

      Figure 13, Standard leg connectome

      Unfortunately, the motor neurons controlling the tarsus could not be included here, I suppose due to the difficulty in identifying the T2 and T3 homologs for these motor neurons. This should be mentioned in the text. This version of the standard leg connectome is without a doubt still an incredibly valuable discovery, but readers should be made aware that this version of the standard leg connectome does in fact lack the motor neurons for one joint.

      The MNs controlling the tarsus could not be matched with high confidence. We have added a sentence pointing this out when the leg circuit is introduced (L1141-1142).

      The focus here is on locomotion is the absence of other behaviors whereas the legs are responsible for grooming, reaching, boxing, etc. How should we consider the leg connectome in light of this?

      This is a very good point, and we have indeed found known grooming neurons that target our leg premotor circuit (L1158-1161). We’ve now added this observation to the Discussion (L1949-1951).

      Minor points

      L84 - re: Descending neurons work together - cite Braun et al., bioRxiv 2023; cite Yang HH bioRxiv 2023 .

      We agree that these papers are relevant to the function of DNs in combination, and have added them to the introduction (L83-84, 86-87).

      L193 - "intrepid" is overly florid language; similar for L1507 "enigmatic".

      We have replaced these words with suitable synonyms.

      L273 - The acronym "ITD" is not explained. Please check all other acronyms. Related, it would be good to include a Table or Box with all acronyms for the reader.

      We have added the full name of the ITD to the text. A glossary is available in Figure 1, and a full glossary of MANC terms is available in Table 1 of our sister paper, Marin et al. 2024.

      -L514, you state that hemilineages 6A and 6B unexpectedly produce uncoordinated leg movements (flight-related was expected). However, Harris didn't study animals in tethered flight but headless on the ground.

      The experimental setup of Harris et al. was capable of assessing flight-like motor output even if not true flight, as seen in the predominantly wing movement phenotypes of activating hemilineages 7B, 11A/B and 2A. We now also note that hemilineage annotation in Marin et al., 2024, shows that the 6B hemilineage has some projections into the leg neuropils, in support of a leg motor role in addition to an upper tectular role (L570-571).

      L1425 - "the TTM" is repeated twice.

      This sentence addresses both the TTM and its MN (TTMn). We have revised this sentence to improve clarity by expanding the full name of TTM in that paragraph and leaving TTMn abbreviated

      L1728 - Ascending neuron projections to the brain - cite Chen et al., Nat Neuro 2023.

      We agree that Chen et al. 2023 is relevant to the discussion of AN function, and have added this citation (L1836-1838).

      L1817, It is a good idea to compare with previous predictions for circuit control. But these originate from non-Drosophila work as well. Please cite and consider the original models from Buschges, Cruse, Holmes, and others.

      Thanks for the suggestion. We now cite the non-Drosophila literature as well. (L1971)

      L1827, how precisely should these "theories" be updated? Be explicit.

      We summarize in the sentences before what is different in comparison to one of the suggested models. We have now additionally added examples to the sentence (L1942-1945) to suggest that theoretical leg circuits need to account for the posterior-to-anterior as well as anterior-to-posterior connections between leg neuropils, as well as relative lack of connectivity between the left and right mesothoracic leg neuropils.

      L1831, include a discussion about another alternative which is through mechanical coupling and sensory feedback.

      We agree that leg sensory input likely contributes to leg locomotor circuits. We have added the following sentence to point out that annotations of sensory neurons in MANC are available through work in a companion paper (Marin et al. 2024), and future work is necessary to examine the contribution of sensory input to leg motor circuits (L1954-1956).

      Methods

      https://flyconnectome.github.io/malevnc/ link doesn't work.

      We have updated the link.

    1. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      The paper by Lee and Ouellette explores the role of cyclic-d-AMP in chlamydial developmental progression. The manuscript uses a collection of different recombinant plasmids to up- and down-regulate cdAMP production, and then uses classical molecular and microbiological approaches to examine the effects of expression induction in each of the transformed strains. 

      Strengths: 

      This laboratory is a leader in the use of molecular genetic manipulation in Chlamydia trachomatis and their efforts to make such efforts mainstream is commendable. Overall, the model described and defended by these investigators is thorough and significant.

      Thank you for these comments.

      Weaknesses: 

      The biggest weakness in the document is their reliance on quantitative data that is statistically not significant, in the interpretation of results. These challenges can be addressed in a revision by the authors. 

      Thank you for these comments. We point out that, while certain RT-qPCR data may not be statistically significant, our RNAseq data indicate late genes are, as a group, statistically significantly increased when increasing c-di-AMP levels and decreased when decreasing c-di-AMP levels. We do not believe running additional experiments to “achieve” statistical significance in the RT-qPCR data is worthwhile. We hope the reviewer agrees with this assessment.

      We have also included new data in this revised manuscript, which we believe further strengthens aspects of the conclusions linked to individual expression of full-length DacA isoforms. We have also quantified inclusion areas and bacterial sizes for critical strains.

      Reviewer #2 (Public review): 

      Summary: 

      This manuscript describes the role of the production of c-di-AMP on the chlamydial developmental cycle. Chlamydia are obligate intracellular bacterial pathogens that rely on eukaryotic host cells for growth. The chlamydial life cycle depends on a cell form developmental cycle that produces phenotypically distinct cell forms with specific roles during the infectious cycle. The RB cell form replicates amplifying chlamydia numbers while the EB cell form mediates entry into new host cells disseminating the infection to new hosts. Regulation of cell form development is a critical question in chlamydia biology and pathogenesis. Chlamydia must balance amplification (RB numbers) and dissemination (EB numbers) to maximize survival in its infection niche. The main findings In this manuscript show that overexpression of the dacA-ybbR operon results in increased production of c-di-AMP and early expression of the transitionary gene hctA and late gene omcB. The authors also knocked down the expression of the dacA-ybbR operon and reported a reduction in the expression of both hctA and omcB. The authors conclude with a model suggesting the amount of c-di-AMP determines the fate of the RB, continued replication, or EB conversion. Overall, this is a very intriguing study with important implications however the data is very preliminary and the model is very rudimentary and is not well supported by the data. 

      Thank you for your comments. Chlamydia is not an easy experimental system, but we have done our best to address the reviewer’s concerns in this revised submission.

      Describing the significance of the findings: 

      The findings are important and point to very exciting new avenues to explore the important questions in chlamydial cell form development. The authors present a model that is not quantified and does not match the data well. 

      Describing the strength of evidence: 

      The evidence presented is incomplete. The authors do a nice job of showing that overexpression of the dacA-ybbR operon increases c-di-AMP and that knockdown or overexpression of the catalytically dead DacA protein decreases the c-di-AMP levels. However, the effects on the developmental cycle and how they fit the proposed model are less well supported. 

      dacA-ybbR ectopic expression: 

      For the dacA-ybbR ectopic expression experiments they show that hctA is induced early but there is no significant change in OmcB gene expression. This is problematic as when RBs are treated with Pen (this paper) and (DOI 10.1128/MSYSTEMS.00689-20) hctA is expressed in the aberrant cell forms but these forms do not go on to express the late genes suggesting stress events can result in changes in the developmental expression kinetic profile. The RNA-seq data are a little reassuring as many of the EB/Late genes were shown to be upregulated by dacA-ybbR ectopic expression in this assay.

      As the reviewer notes, we also generated RNAseq data, which validates that late gene transcripts (including sigma28 and sigma54 regulated genes) are statistically significantly increased earlier in the developmental cycle in parallel to increased c-di-AMP levels. The lack of statistical significance in the RT-qPCR data for omcB, which shows a trend of higher transcripts, is less concerning given the statistically significantly RNAseq dataset. We have reported the data from three replicates for the RT-qPCR and do not think it would be worthwhile to attempt more replicates in an attempt to “achieve” statistical significance.

      We recognize that hctA may also increase during stress as noted by the Grieshaber Lab. In re-evaluating these data, we decided to remove the Penicillin-linked studies from the manuscript since they detract from the focus of the story we are trying to tell given the potential caveat the reviewer mentions.

      The authors also demonstrate that this ectopic expression reduces the overall growth rate but produces EBs earlier in the cycle but overall fewer EBs late in the cycle. This observation matches their model well as when RBs convert early there is less amplification of cell numbers. 

      dacA knockdown and dacA(mut) 

      The authors showed that dacA knockdown and ectopic expression of the dacA mutant both reduced the amount of c-di-AMP. The authors show that for both of these conditions, hctA and omcB expression is reduced at 24 hpi. This was also partially supported by the RNA-seq data for the dacA knockdown as many of the late genes were downregulated. However, a shift to an increase in RB-only genes was not readily evident. This is maybe not surprising as the chlamydial inclusion would just have an increase in RB forms and changes in cell form ratios would need more time points.

      Thank you for this comment. We agree that it is not surprising given the shift in cell forms. The reduction in hctA transcripts argues against a stress state as noted above by the reviewer, and the RNAseq data from dacA-KD conditions indicates at least that secondary differentiation has been delayed. We agree that more time points would help address the reviewer’s point, but the time and cost to perform such studies is prohibitive with an obligate intracellular bacterium.

      Interestingly, the overall growth rate appears to differ in these two conditions, growth is unaffected by dacA knockdown but is significantly affected by the expression of the mutant. In both cases, EB production is repressed. The overall model they present does not support this data well as if RBs were blocked from converting into EBs then the growth rate should increase as the RB cell form replicates while the EB cell form does not. This should shift the population to replicating cells. 

      We agree that it seems that perturbing c-di-AMP production by knockdown or overexpressing the mutant DacA(D164N) has different impacts on chlamydial growth. We have generated new data, which we believe addresses this. Overexpressing membrane-localized DacA isoforms is clearly detrimental to chlamydiae as noted in the manuscript. However, when we removed the transmembrane domain and expressed N-terminal truncations of these isoforms, we observed no effects of overexpression on chlamydial morphology or growth. Importantly, for the wild-type full-length or truncated isoforms, overexpressing each resulted in the same level of c-di-AMP production, further supporting that the negative effect of overexpressing the wild-type full-length is linked to its membrane localization and not c-di-AMP levels. These data have been included as new Figure 3. These data indicate that too much DacA in the membrane is disruptive and suggest that the balance of DacA to YbbR is important since overexpression of both did not result in the same phenotype. This is further described in the Discussion.

      As it relates to knockdown of dacA-ybbR, we have essentially removed/reduced the amount of these proteins from the membrane and have blocked the production of c-di-AMP. This is fundamentally different from overexpression.

      Overall this is a very intriguing finding that will require more gene expression data, phenotypic characterization of cell forms, and better quantitative models to fully interpret these findings. 

      Reviewer #1 (Recommendations for the authors): 

      There is a generally consistent set of experiments conducted with each of the mutant strains, allowing a straightforward examination of the effects of each transformant. There are a few general and specific things that need to be addressed for both the benefit of the reader and the accuracy of interpretation. The following is a list of items that need to be addressed in the document, with an overall goal of making it more readable and making the interpretations more quantitatively defended. 

      Specific comments: 

      (1) The manuscript overall is wordy and there are quite a few examples of text in the results that should be in the discussion (examples include lines 224-225, 248-262, 282-288, 304-308) the manuscript overall could use a careful editing for verbosity. 

      Thank you for this comment. We have removed some of the indicated sentences. However, to maintain the flow and logic of the manuscript, some statements may have been preserved to help transition between sections. As far as verbosity, we have tried to be as clear as possible in our descriptions of the results to minimize ambiguity. Others who read our manuscript appreciated the thoroughness of our descriptions.

      (2) There is also a trend in the document to base fact statements on qualitative and quantitative differences that do not approach statistical significance. Examples of this include the following: lines 156-158, 190-192, 198-199, 230-232, 239-242, 292-293). This is something the authors need to be careful about, as these different statistically insignificant differences may tend to multiply a degree of uncertainty across the entire manuscript. 

      We have quantified inclusion areas and tried to remove instances of qualitative assessments as noted by the reviewer. In regards to some of the transcripts, we can only report the data as they are. In some cases, there are trends that are not statistically significant, but it would seem to be inaccurate to state that they were unchanged. In other cases, a two-fold or less difference in transcript levels may be statistically significant but biologically insignificant. A reader can and should make their own conclusions.

      (3) Any description of inclusion or RB size being modestly different needs to be defended with microscopic quantification. 

      We have quantified inclusion areas and RB sizes and tried to remove instances of qualitative assessments as noted by the reviewer.

      (4) It would be very helpful to reviewers if there was a figure number added to each figure in the reviewer-delivered text. 

      Added.

      (5) Figure 1A: This should indicate that the genes indicated beneath each developmental form are on high (I think that is what that means). 

      We have reorganized Figure 1 to better improve the flow.

      (6) Figure 1B is exactly the same as the three images in Figure 8B. I would delete this in Figure 1. This relates to comment 9. 

      We presented this intentionally to clearly illustrate to the reader, who may not be knowledgeable in this area, what we propose is happening in the various strains. As such, we respectfully disagree and have left this aspect of the figure unchanged.

      (7) Figure 1D: It is not clear if the period in E.V has any meaning. I think this is just a typo. Also, the color coding needs to be indicated here. What do the gray bars represent? The labeling for the gene schematic for dacA-KDcom should not be directly below the first graph in D. This makes the reader think this is a label for the graph. This can be accomplished if the image in panel B is removed and the first graph in panel D is moved into B. This will make a better figure. 

      We have reorganized Figure 1 to better improve the flow.

      (8) Figure 2 C, G: The utility of these panels is not clear. For them to have any value, they need to be expressed in genome copies. If they are truly just a measure of chlamydia genomic DNA, they have minimal utility to the reader. There are similar panels in several other figures. 

      We have reported genome copies as suggested in lieu of ng gDNA for these measurements. Importantly, it does not alter any interpretations.

      (9) I am not sure about the overall utility of Figure 8. Granted, a summary of their model is useful, but the cartoons in the figure are identical or very nearly identical to model figures shown in two other publications from the same group (PMID: 39576108, 39464112) These are referenced at least tangentially in the current manuscript (Jensen paper- now published- and ref 53). Because the model has been published before, if they are to be included, there needs to be a direct comparison of the results in each of these three papers, as they basically describe the same developmental process. The model images should also be referenced directly to the first of the other papers.

      This was intentional so that readers familiar with our work will see the similarities between these systems. We have added additional comments in the Discussion related to our newly published work. As an aside, Dr. Lee generated the first version of the figure that was adapted by others in the lab. It is perhaps unlucky that those other studies have been published before his work.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This work by Ding et al uses agent-based simulations to explore the role of the structure of molecular motor myosin filaments in force generation in cytoskeletal structures. The focus of the study is on disordered actin bundles which can occur in the cell cytoskeleton and have also been investigated with in vitro purified protein experiments.

      Strengths:

      The key finding is that cooperative effects between multiple myosin filaments can enhance both total force and the efficiency of force generation (force per myosin). These trends were possible to obtain only because the detailed structure of the motor filaments with multiple heads is represented in the model.

      We appreciate your comments about the strength of our study. 

      Weaknesses:

      It is not clearly described what scientific/biological questions about cellular force production the work answers. There should be more discussion of how their simulation results compare with existing experiments or can be tested in future experiments.

      Please see our response to the comment (1) below.

      The model assumptions and scientific context need to be described better.

      We apologize for the insufficient descriptions about the model and the scientific context. We revised the manuscript to better explain model assumptions and scientific context as described in our responses below.

      The network contractility seems to be a mere appendix to the bundle contractility which is presented in much more detail.

      Please see our response to the comment (6) below.

      Reviewer #1 (Recommendations for the authors):

      (1) It is not clearly described what scientific/biological questions about cellular force production the work answers. There should be more discussion of how their simulation results compare with existing experiments, or can be tested in future experiments. The authors do briefly mention Reference 4 where different myosin isoforms were used, but it is not clear that these experiments support the scalings predicted in this work in Figures 3-6. Also, the experiments in Ref. 4 apparently did not involve passive crosslinkers (ACPs) which are key in this study.

      Thank you for the comment. In the 5th paragraph of the discussion section of the original manuscript, we applied our findings to understand how structural differences between ventral stress fibers and actin arcs could affect force generation. In addition, at the end of the discussion section, we mentioned that experiments with artificially-made myosin thick filaments could be used for verifying our results. 

      The experiments in Ref. 4 were only ones that we could directly compare our results with. In previous study, actomyosin bundles were experimentally created with ACPs (K.L. Weirich et al., Biophys J, 2021, 120: 1957-1970), but the motions of myosin thick filaments were only quantities measured in the experiments. In general, measuring forces generated by in vitro actomyosin bundles is very challenging. This is why the predictions from our model are particularly valuable for understanding the force generation of actomyosin structures. 

      (2) The architecture of the bundles seems to be prescribed by hand in these simulations. Several well-known stochastic aspects of the dynamics of actin and actin-binding proteins are not included in the model. For example, there is no remodeling of the actin structures through actin polymerization and depolymerization, or crosslink (ACP) binding and unbinding. Can the authors comment on why these effects could be neglected for the questions they want to address?

      Thank you for the comment. We previously showed that the force generation process in actomyosin networks and bundles is affected by actin dynamics (Q. Yu et al., Biophys J, 2018, 115: 2003-2013) and the unbinding of ACPs (T. Kim, Biomech Model Mechanobiol, 2015, 14(2): 345-355 and W. Jung et al., Comput Part Mech, 2015, 2(4): 317-327). 

      However, we did not include the actin dynamics and the ACP unbinding in the current study to clearly understand the effects of the structural properties of thick filaments on the force generation process. We have learned that the stochastic behaviors of cytoskeletal components lead to noisier results, which requires us to run a much larger number of simulations to obtain statistically convincing data. We added the following paragraph in the discussion section of the revised manuscript:

      “Although this study focused mainly on parameters related to motor structures, we expect that other parameters would affect the force generation process. For example, as we showed before, a decrease in ACP density would reduce forces by deteriorating connectivity between filaments. With very low ACP density, some of neighboring motors may not have ACPs between them, thus adding up their forces as shown in Fig. 2. However, such low ACP density may not maintain the structure of bundles or cross-linked networks well. In addition, the force-dependent unbinding of ACPs could change the spatial distribution of ACPs during force generation. If they behave as a slip bond which unbinds more frequently with higher forces, ACPs may not stay between two motors for long time due to high tension. Then, forces generated by two motors may have a higher chance to add up. By contrast, if they behave as a catch bond which unbinds less frequently with larger forces, more ACPs will be recruited between two motors, reducing a chance to add up

      forces. The length of actin filaments is unlikely to affect the force generation process significantly unless filaments are very short. Additionally, as we showed before, actin turnover would reduce forces by competing with motor activities, change connectivity between filaments over time, and prevent motors from being stalled for long time, all of which could affect force generation.”

      (3) The present study is confined to the fixed density of motors and ACPs. However, these can be easily varied in in vitro experiments. Works such as Reference 4 show an optimum in contractility vs myosin concentration. Myosins act not only to slide actin filaments but also crosslink them.

      Can the authors vary myosin concentration to demonstrate such effects in their model?

      As the reviewer pointed out, there is a belief that myosin thick filaments can serve as crosslinkers as well. However, unless there are a fraction of dead myosins (which remain bound on filaments without walking) or myosins dwell at the barbed ends filaments for very long time, it looks very hard for bundles or networks to generate large forces. A former experiment showed that active myosins increases the viscosity of actin networks, not elasticity (D. Humphrey et al., Nature, 2002, 416: 413-416) Computer simulations with reasonable assumptions did not show significant force generation without cross-linkers. We have tested systems with a large number of motors and a few cross-linkers in previous studies (T. Kim, Biomech Model Mechanobiol, 2015, 14(2): 345-355 and W. Jung et al., Comput Part Mech, 2015, 2(4): 317-327). We observed that large force/stress was generated momentarily, but it was relaxed very fast. It is expected that there will be similar outcomes if we try such conditions in the current study.

      (4) Why is there a (factor of 1.5-2) discrepancy in the measured (Ftot) and estimated (Fest) force values in Figure 4-6? How can the authors improve their scaling arguments to capture this? What about the estimated efficiency?

      Thank you for the comment. Indeed, there was a discrepancy between the actual and estimated forces. When the estimated force was calculated, we used the z positions of motors without consideration of the actual bundle geometry with multiple filaments. For example, if two motors are located on the opposite sides of the bundle (i.e., if they are located far from each other in x or y direction), forces generated by them may not counterbalance each other. Then, the estimated force can be smaller than the actual force because counterbalance between motors can be overcounted. The original manuscript had the following sentences to clarify this point: “F</sub>est</sub> was generally smaller than F<sub>tot</sub> because this analysis does not account for actual bundle geometry consisting of multiple F-actins; if two motors are located far from each other in x or y direction, they may not counterbalance or add up forces. Nevertheless, we found that F<sub>est</sub> captures the overall dependence of F<sub>tot</sub> on parameters well.”

      (5) Several choices of parameter values used in the simulations are not clear:

      a) Why consider F actin of 140 nm specifically? Actin can come in a range of lengths. How do their results depend upon the length scale of actin?

      It seems that there is a misunderstanding. 140 nm is the equilibrium length of one actin segment in our model. The actual F-actin consists of multiple actin segments. The length of Factin was 9 μm in bundle simulations and 10 μm (average) in network simulations. We expect that the general tendency of our results would not change with different filament length. However, if filament length becomes too short, the force generation process would be impaired due to lack of connectivity between filaments. 

      b) Similarly, very specific values of myosin backbone length (42 nm), number of myosin heads (8), number of arms (24), and Actin Cross-linking Proteins (ACPs). What informs these values and how will the results change if they are different? It is not especially clear how an "Arm" differs from "heads" and what kind of coarse-graining is involved.

      In the “model overview” section of the original manuscript, we mentioned the following to clarify the definitions of motor arms and motor heads: 

      “To mimic the structure of bipolar filaments, each motor has a backbone, consisting of serially linked segments, and two arms on each endpoint of the backbone segments that represent 8 myosin heads (N<sub>h</sub> = 8).”

      We devised this coarse-graining scheme of myosin thick filaments in our previous work (T. Kim, Biomech Model Mechanobiol, 2015, 14(5): 1143-1155). Through extensive tests, we showed that force generation and motor behaviors are largely independent of coarse-graining level. In other words, a motor with the same value of N<sub>h</sub>N<sub>a</sub> leads to similar outcomes regardless of the value of N<sub>a</sub>. However, in a bundle with multiple filaments, each motor has a sufficient number of arms to ensure simultaneous interactions with those filaments. This is why we decided to useN<sub>h</sub> = 8 and N<sub>a</sub> = 24. 

      To match the length of thick filaments and the total number of heads (N<sub>h</sub>N<sub>a</sub>) in the model with real myosin thick filaments, we have used 42 nm for each backbone length. Varying this length is equivalent to a variation in L<sub>sp</sub> that we did for Fig. 6.

      We used high ACP density to ensure connections between all neighboring pairs of actin filaments. We already showed how the presence of ACPs affects the force generation process in Fig. 2 using two actin filaments. It is expected that a variation of ACP density would affect our results to some extent. Since the main focus of the current study is the structural properties of motors, we did not explore the effects of ACP density. I hope that the reviewer would understand our intention. 

      (6) The manuscript focuses on disordered bundles with only one figure on networks. However, actin fibers also ubiquitously exist as disordered networks, and it is important to explore in more detail the contractile forces in such network arrangements.

      We appreciate the comment. Because we plan to delve into the effects of motor structures on the force generation in networks as a follow-up study, we showed the minimal results in the current study to prove the generality of our findings. I hope that the reviewer would understand our intention and plan.

      It is not described very clearly how these networks were generated.

      We apologize for lack of explanation about how the networks were generated. We added the following section in Supplementary Text of the revised manuscript:

      “Network assembly

      Unlike F-actin in bundle simulations, F-actin in network simulations is formed by stochastic processes as in our previous studies. The formation of F-actin is initiated from a nucleation event with a constant rate constant, k<sub>n,A</sub>, with the appearance of one cylindrical segment in a random position with a random orientation perpendicular to the z direction. The polymerization of F-actin is simulated by adding cylindrical segments at the barbed end of existing filaments with a rate constant, k<sub>p,A</sub>. The ratio of k<sub>n,A</sub>to k<sub>p,A</sub> is adjusted to result in the average filament length of ~10 μm. The rest of the assembly process is identical to that described in the main text.”

      Crosslinked biopolymers like actin typically form disordered elastic networks with their coordination number below rigidity percolation threshold (z=4 in 2D), see for example review by Broedersz and Mackintosh Rev. Mod, Phys. 2013. Such networks should exist in the bendingdominated regime, where bending forces play a vital role in force propagation. Was that observed in the simulations? Why or why not?

      We appreciate the comment. We are aware of the bending-dominated regime and indeed showed the importance of the bending stiffness of actin filaments at low shear strain level in our previous work (T. Kim et al., PLOS Comput Biol, 2009, 5(7): e1000439). In case of active networks with motors, such a bending-dominated regime has not been observed without external shear strain. Instead, buckling of actin filaments was found to be essential for breaking symmetry between tensile and compressive forces developed by motor activities. We have shown that the free contraction of networks is inhibited if filament bending stiffness is increased substantially (J. Li et al., Soft Matter, 2017, 13: 3213-3220 and T. Bidone et al., PLOS Comput Biol, 2017, 13(1): e1005277). We expect that contractile forces generated by bundles or networks will be reduced significantly if we highly increase bending stiffness. However, considering the focus of the current study is on the structural properties of motors, we did not perform such simulations. 

      (7) It would be interesting to see the simulated predictions of the bundle or network contraction dynamics. This can be done by changing to free boundary conditions so that the bundle can contract.

      Thank you for the suggestion. We have previously investigated the free contraction of actomyosin networks with different motor density and ACP density (J Li et al., Soft Matter, 2017, 13: 3213). We observed that the rate of network contraction was higher with more motors and ACPs. However, we did not test the effects of the structural properties of thick filaments in the previous study. We plan to investigate the effects in future studies because the focus of the current study is the force generation process. Please note that in the discussion section of the original manuscript, we mentioned the following:

      “Although we focused on force generation, the contractile behaviors of actomyosin structures (i.e., a decrease in length) have also been of great interest. Our model can be used to study such contractile behaviors by deactivating the periodic boundary condition and removing connection between one end of bundle/network and a domain boundary as done previously [20]. To achieve higher contractile speed with the same total number of myosin heads, the existence of multiple contractile units would be better as suggested in a previous work [4]. This means that there is a trade-off between force generation and contractile speed. Previous studies also showed that the contractile speed of networks is proportional to motor density [18, 43, 51]. We may be able to use our model to systematically investigate how the contractile speed is regulated by parameters that we tested in this study, including the number, distribution, length, and structure of motors.”

      Minor suggestions for improvement:

      (1) What are the vertical markers in Figures 1E and F? They should be labelled. if they are crosslinkers, it is not clear why the color is different from Figure 1A and B.

      We believe that the reviewer meant Figs. 2E, F. Those vertical lines are indeed ACPs (crosslinkers). We changed the color of ACPs in Fig. 1A and Fig. 2B-D to purple to be consistent. In addition, we changed the colors of two filaments in Figs. 2B-D slightly to be consistent with Fig. 2E.

      (2) To help understanding, please include a figure showing how forces are measured.

      We added Fig. S1 in the revised manuscript to explain how the bundle force is calculated.

      (3) It should be possible to extend the scaling arguments to predict what is the crossover myosin density (N_M) in Figure 4a at which the efficiency changes from going as 1/N_M to saturating. 

      As the reviewer might have observed, the slope of the efficiency in Fig. 4A gradually changes, rather than showing a sharp transition. Thus, it is hard to define one crossover myosin density. 

      Similarly, what are the slopes in Figure 6a-b?

      We drew the reference lines in those two plots. Unfortunately, we do not have explanations about the origin of these slopes.

      (4) Some more explanation for the observed values should be added. Figure 4: Why does efficiency plateau at a value close to 0.8 in (A)? 

      We assume that the reviewer meant the plateau of η close to 0.08, not 0.8. Our speculation for the origin of this plateau value is related to L<sub>M</sub> (= 462 nm under the reference condition). Ideally, ~43 motors are required to cover the entire length of the bundle (= 20 μm). Under this condition, η is ~0.023. Although this is not 0.08, we believe that these two values are related to each other. For example, if we increase L<sub>M</sub>, this plateau level would increase. We added the following sentences in the result section of the revised manuscript:

      “The plateau level of η at ~0.08 is related to the minimum number of motors required for saturating an entire bundle, implying that the plateau level would be higher if each motor is longer.”

      Figure 5: Overlapping between motors seems to increase the total force applied by them because of cooperative effects. However, it is not abundantly clear why that should peak at a value of f = 0.06.

      As shown in Fig. 5B, smaller f always results in higher F<sub>tot</sub> due to higher level of cooperative overlap. The minimum value of f we tested in this study was 0.06, so F<sub>tot</sub> was maximal at f = 0.06.

      (5) Why is the network force expected to scale approximately as sqrt(N_M)? Is it because of the 2D geometry where the number of motors along the x or y-direction scale as sqrt(N_M)?

      We initially thought that the weaker dependence of the total force on N<sub>M</sub> was related to the random orientations of motors. However, if the network is fully saturated with motors, the inclusion of more motors will increase forces in both x and y directions almost linearly, resulting in the direct proportionality of F<sub>tot</sub> to N<sub>M</sub>. Our new hypothesis for weaker dependence is consistent with the reviewer’s speculation; the network is not fully saturated even with 1000 motors, so the entire regime shown in Fig. 7B corresponds to that with N<sub>M</sub> < 100 in Fig. 4A where similar weaker dependence on N<sub>M</sub> was observed. We added the following sentence in the result section of the revised manuscript to clarify this point:

      “the average number of motors in each direction which can experience the cooperative overlap would be ~. Maximal N<sub>M</sub> tested with the network was ~2,500, so the dependence of F<sub>tot</sub> on N<sub>M</sub> with the network is similar to that with N<sub>M</sub> < ~50 with the bundle (Fig. 4A).”

      (6) Figures 6 D and A: Figure 6D suggests that there is a more full overlap in the cases where there was a longer bare zone or larger spacing between motor arms. However, the quantification of the total force in A shows that the force is highest for the case where LM was increased by increasing the number of arms. Why do the authors think that is? I would expect from the explanation in Fig 6D that the Lsp and Lbz would be higher than Na in Fig 6A.

      Fig. 6D shows a difference in the level of the cooperative overlap () between two motors. As the reviewer pointed out, the case with more arms shows the lowest , resulting in the lowest as we showed in Fig. S2B. However, as show in in Eq. 7, the total force is a function of both N<sub>a</sub> and . Thus, due to higher N<sub>a</sub> and lower , the force in the case with different N<sub>a</sub> can be similar to that in the case with different L<sub>bz</sub>. In the original manuscript, we had the following sentence to explain how the force can be similar between the two cases: 

      “Thus, was higher (Fig. S2B, blue), resulting in higher F<sub>tot</sub> and η despite smaller N<sub>a</sub>.”

      Reviewer #2 (Public review):

      Summary:

      In this study, the authors use a mechanical model to investigate how the geometry and deformations of myosin II filaments influence their force generation. They introduce a force generation efficiency that is defined as the ratio of the total generated force and the maximal force that the motors can generate. By changing the architecture of the myosin II filaments, they study the force generation efficiency in different systems: two filaments, a disorganized bundle, and a 2D network. In the simple two-filament systems, they found that in the presence of actin crosslinking proteins motors cannot add up their force because of steric hindrances. In the disorganized bundle, the authors identified a critical overlap of motors for cooperative force generation. This overlap is also influenced by the arrangement of the motor on the filaments and influenced by the length of the bare zone between the motor heads.

      Strengths:

      The strength of the study is the identification of organizational principles in myosin II filaments that influence force generation. It provides a complementary mechanistic perspective on the operation of these motor filaments. The force generation efficiency and the cooperative overlap number are quantitative ways to characterize the force generation of molecular motors in clusters and between filaments. These quantities and their conceptual implications are most likely also applicable in other systems.

      Thank you for the comments about the strength of our study. 

      Weaknesses:

      The detailed model that the authors present relies on over 20 numerical parameters that are listed in the supplement. Because of this vast amount of parameters, it is not clear how general the findings are. On the other hand, it was not obvious how specific the model is to myosin II, meaning how well it can describe experimental findings or make measurable predictions. The model seems to be quantitative, but the interpretation and connection to real experiments are rather qualitative in my point of view.

      As the reviewer mentioned, all agent-based computational models for simulating the actin cytoskeleton are inevitably involved with such a large number of parameters. Some of the parameter values are not known well, so we have tuned our parameter values carefully by comparing our results with experimental observations in our previous studies since 2009.We were aware of the importance of rigorous representation of unbinding and walking rates of myosin motors, so we implemented the parallel cluster model, which can predict those rates with consideration of the mechanochemical rates of myosin II, into our model. Thus, we are convincing that our motors represent myosin II.

      In our manuscript, our results were compared with prior observations in Ref. 4 (Thoresen et al., Biophys J, 2013) several times. In particular, larger force generation with more myosin heads per thick filament was consistent between the experiment and our simulations. 

      Our study can make various predictions. First, our study explains why non-muscle myosin II in stress fibers shows focal distributions rather than uniform distributions; if they stay closely, they can generate much larger forces in the stress fibers via the cooperative overlap. Our study also predicts a difference between bipolar structures (found in skeletal muscle myosins and nonmuscle myosins) and side polar structures (found in smooth muscle myosins) in terms of the likelihood of the cooperative overlap. As shown below, myosin filaments with the bipolar structure can add up their forces better than those with the side polar structure when their overlap level is the same.

      Author response image 1.

       

      It was often difficult for me to follow what parameters were changed and what parameters were set to what numerical values when inspecting the curve shown in the figures. The manuscript could be more specific by explicitly giving numbers. For example, in the caption for Figure 6, instead of saying "is varied by changing the number of motor arms, the bare zone length, the spacing between motor arms", the authors could be more specific and give the ranges: "is varied by changing the number of motor arms form ... to .., the bare zone length from .. to..., and the spacing between motor arms from .. to ..".

      This unspecificity is also reflected in the text: "We ran simulations with a variation in either L<sub>sp</sub> or L<sub>bz</sub>" What is the range of this variation? "WhenL<sub>M</sub> was similar" similar to what? "despite different N<sub>M</sub>." What are the different values for N<sub>M</sub>? These are only a few examples that show that the text could be way more specific and quantitative instead of qualitative descriptions.

      We appreciate the comment. In the revised manuscript, we specified the range of the variation in each parameter.

      In the text, after equation (2) the authors discuss assumptions about the binding of the motor to the actin filament. I think these model-related assumptions and explanations should be discussed not in the results section but rather in the "model overview" section.

      Thank you for pointing this out. In the original manuscript, we described all the details of the model in Supplementary Material. We feel that the assumptions about interactions between motors and actin filaments are too detailed information to be included in the model overview section.

      The lines with different colors in Figure 2A are not explained. What systems and parameters do they represent?

      The different colors used in Fig. 2A were used for distinguishing 20 cases. We added the explanation about the colors in the figure caption in the revised manuscript.

      Reviewer #2 (Recommendations for the authors):

      To guarantee the reproducibility of the results, I recommend that the authors publish their simulation code on GitHub.

      We appreciate the reviewer’s suggestion. Following the suggestion, we prepared and posted the code on GitHub as mentioned in the Data Availability of the revised manuscript: The source code of our model is available on GitHub: https://github.com/ktyman2/ThickFilament”

    1. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Given the importance that these coupling mechanisms have been given in theory, this is a timely and important contribution to the literature in terms of determining whether these theoretical assumptions hold true in human data.

      Thank you!

      I did not follow the logic behind including spindle amplitude in the meta-analysis. This is not a measure of SO-spindle coupling (which is the focus of the review), unless the authors were restricting their analysis of the amplitude of coupled spindles only. It doesn't sound like this is the case though. The effect of spindle amplitude on memory consolidation has been reviewed in another recent meta-analysis (Kumral et al, 2023, Neuropsychologia). As this isn't a measure of coupling, it wasn't clear why this measure was included in the present meta-analysis. You could easily make the argument that other spindle measures (e.g., density, oscillatory frequency) could also have been included, but that seems to take away from the overall goal of the paper which was to assess coupling.

      Indeed, spindle amplitude refers to all spindle events rather than only coupled spindles. This choice was made because we recognized the challenge of obtaining relevant data from each study—only 4 out of the 23 included studies performed their analyses after separating coupled and uncoupled spindles. This inconsistency strengthens the urgency and importance of this meta-analysis to standardize the methods and measures used for future analysis on SO-SP coupling and beyond. We agree that focusing on the amplitude of coupled spindles would better reveal their relations with coupling, and we have discussed this limitation in the manuscript.

      Nevertheless, we believe including spindle amplitude in our study remains valuable, as it served several purposes. First, SO-SP coupling involves the modulation between spindle amplitude and slow oscillation phase. Different studies have reported conflicting conclusions regarding how overall spindle amplitude was related to coupling as an indicator of oscillation strength overnight– some found significant correlations (e.g., Baena et al., 2023), while others did not (e.g., Roebber et al., 2022). This discrepancy highlights an indirect but potentially crucial insight into the role of spindle amplitude in coupling dynamics. Second, in studies related to SO-SP coupling, spindle amplitude is one of the most frequently reported measures along with other coupling measures that significantly correlated with oversleep memory improvements (e.g. Kurz et al., 2023; Ladenbauer et al., 2021; Niknazar et al., 2015), so we believe that including this measure can provide a more comprehensively review of the existing literature on SO-SP coupling. Third, incorporating spindle amplitude allows for a direct comparison between the measurement of coupling and individual events alone in their contribution to memory consolidation– a question that has been extensively explored in recent research. (e.g., Hahn et al., 2020; Helfrich et al., 2019; Niethard et al., 2018; Weiner et al., 2023). Finally, spindle amplitude was identified as the most important moderator for memory consolidation in Kumral et al.'s (2023) meta-analysis. By including it in our analysis, we sought to replicate their findings within a broader framework and introduce conceptual overlaps with existing reviews. Therefore, although we were not able to selectively include coupled spindles, there is still a unique relation between spindle amplitude and SO-SP coupling that other spindle measures do not have. 

      Originally, we also intended to include coupling density or counts in the analysis, which seems more relevant to the coupling metrics. However, the lack of uniformity in methods used to measure coupling density posed a significant limitation. We hope that our study will encourage consistent reporting of all relevant parameters in future research, allowing future meta-analyses to incorporate these measures comprehensively. We have added this discussion to the revised version of the manuscript (p. 3) to further clarify these points.

      All other citations were referenced in the manuscript.

      At the end of the first paragraph of section 3.1 (page 13), the authors suggest their results "... further emphasise the role of coupling compared to isolated oscillation events in memory consolidation". This had me wondering how many studies actually test this. For example, in a hierarchical regression model, would coupled spindles explain significantly more variance than uncoupled spindles? We already know that spindle activity, independent of whether they are coupled or not, predicts memory consolidation (e.g., Kumral meta-analysis). Is the variance in overnight memory consolidation fully explained by just the coupled events? If both overall spindle density and coupling measures show an equal association with consolidation, then we couldn't conclude that coupling compared to isolated events is more important.

      While primary coupling measurements, including coupling phase and strength, showed strong evidence for their associations with memory consolidation, measures of spindles, including spindle amplitude, only exhibited limited evidence (or “non-significant” effect) for their association with consolidation. These results are consistent with multiple empirical studies using different techniques (e.g., Hahn et al., 2020; Helfrich et al., 2019; Niethard et al., 2018; Weiner et al., 2023), which reported that coupling metrics are more robust predictors of consolidation and synaptic plasticity than spindle or slow oscillation metrics alone. However, we agree with the reviewer that we did not directly separate the effect between coupled and uncoupled spindles, and a more precise comparison would involve contrasting the “coupling of oscillation events” with ”individual oscillation events” rather than coupling versus isolated events.

      We recognized that Kumral and colleagues’ meta-analysis reported a moderate association between spindle measures and memory consolidation (e.g., for spindle amplitude-memory association they reported an effect size of approximately r = 0.30). However, one of the advantages of our study is that we actively cooperated with the authors to obtain a large number of unreported and insignificant data relevant to our analysis, as well as separated data that were originally reported under mixed conditions. This approach decreases the risk of false positives and selective reporting of results, making the effect size more likely to approach the true value. In contrast, we found only a weak effect size of r = 0.07 with minimal evidence for spindle amplitude-memory relation. However, we agree with the reviewer that using a more conservative term in this context would be a better choice since we did not measure all relevant spindle metrics including the density.

      To improve clarity in our manuscript, we have revised the statement to: “Together with other studies included in the review, our results suggest a crucial role of coupling but did not support the role of spindle events alone in memory consolidation,” and provide relevant references (p. 13). We believe this can more accurately reflect our findings and the existing literature to address the reviewer’s concern.

      It was very interesting to see that the relationship between the fast spindle coupling phase and overnight consolidation was strongest in the frontal electrodes. Given this, I wonder why memory promoting fast spindles shows a centro-parietal topography? Surely it would be more adaptive for fast spindles to be maximally expressed in frontal sites. Would a participant who shows a more frontal topography of fast spindles have better overnight consolidation than someone with a more canonical centro-parietal topography? Similarly, slow spindles would then be perfectly suited for memory consolidation given their frontal distribution, yet they seem less important for memory.

      Regarding the topography of fast spindles and their relationship to memory consolidation, we agree this is an intriguing issue, and we have already developed significant progress in this topic in our ongoing work, and have found evidence that participants with a more frontal topography of fast spindles show better overnight consolidation. These findings will be presented in our future publications. We share a few relevant observations: First, there are significant discrepancies in the definition of “slow spindle” in the field. Some studies defined slow spindle from 9-12 Hz (e.g. Mölle et al., 2011; Kurz et al., 2021), while others performed the event detection within a range of 11-13/14 Hz and found a frontal-dominated topography (e.g. Barakat et al., 2011; D'Atri et al., 2018). Compounding this issue, individual and age differences in spindle frequency are often overlooked, leading to challenges in reliably distinguishing between slow and fast spindles. Some studies have reported difficulty in clearly separating the two types of spindles altogether (e.g., Hahn et al., 2020). Moreover, a critical factor often ignored in past research is the propagating nature of both slow oscillations and spindles across the cortex, where spindles are coupled with significantly different phases of slow oscillations (see Figure 5). In addition, the frontal region has the strongest and most active SOs as its origin site, which may contribute to the role of frontal coupling. In contrast, not all SOs propagate from PFC to centro-parietal sites. The reviewer also raised an interesting idea that slow spindles would be perfectly suited for memory consolidation given their frontal distribution. We propose that one possible explanation is that if SOs couple exclusively with slow SPs, they may lose their ability to coordinate inter-area activity between centro-parietal and frontal regions, which could play a critical role in long-range memory transmission across hippocampus, thalamus, and prefrontal cortex. This hypothesis requires investigation in future studies. We believe a better understanding of coupling in the context of the propagation of these waves will help us better understand the observed frontal relationship with consolidation. Therefore, we believe this result supports our conclusion that coupling precision is more important than intensity, and we have addressed this in revised manuscript (pp. 15-16).

      The authors rightly note the issues with multiple comparisons in sleep physiology and memory studies. Multiple comparison issues arise in two ways in this literature. First are comparisons across multiple electrodes (many studies now use high-density systems with 64+ channels). Second are multiple comparisons across different outcome variables (at least 3 ways to quantify coupling (phase, consistency, occurrence) x 2 spindle types (fast, slow). Can the authors make some recommendations here in terms of how to move the field forward, as this issue has been raised numerous times before (e.g., Mantua 2018, Sleep; Cox & Fell 2020, Sleep Medicine Reviews for just a couple of examples). Should researchers just be focusing on the coupling phase? Or should researchers always report all three metrics of coupling, and correct for multiple comparisons? I think the use of pre-registration would be beneficial here, and perhaps could be noted by the authors in the final paragraph of section 3.5, where they discuss open research practices.

      There are indeed multiple methods that we can discuss, including cluster-based and non-parametric methods, etc., to correct for multiple comparisons in EEG data with spatiotemporal structures. In addition, encouraging the reporting of all tested but insignificant results, at least in supplementary materials, is an important practice that helps readers understand the findings with reduced bias. We agree with the reviewer’s suggestions and have added more information in section 3.4-3.5 (p. 17) to advocate for a standardized “template” used to report effect sizes and correct multiple comparisions in future research.

      We advocate for the standardization of reporting all three coupling metrics– phase, strength, and prevalence (density, count, and/or percentage coupled). Each coupling metric captures distinct a property of the coupling process and may interact with one another (Weiner et al., 2023). Therefore, we believe it is essential to report all three metrics to comprehensively explore their different roles in the “how, what, and where” of long-distance communication and consolidation of memory. As we advance toward a deeper understanding of the relationship between memory and sleep, we hope this work establishes a standard for the standardization, transparency, and replication of relevant studies.

      Reviewer #2 (Public review):

      Regarding the Moderator of Age: Although the authors discuss the limited studies on the analysis of children and elders regarding age as a moderator, the figure shows a significant gap between the ages of 40 and 60. Furthermore, there are only a few studies involving participants over the age of 60. Given the wide distribution of effect sizes from studies with participants younger than 40, did the authors test whether removing studies involving participants over 60 would still reveal a moderator effect?

      We agree that there is an age gap between younger and older adults, as current studies often focus on contrasting newly matured and fully aged populations to amplify the effect, while neglecting the gradual changes in memory consolidation mechanisms across the aging spectrum. We suggest that a non-linear analysis of age effects would be highly valuable, particularly when additional child and older adult data become available.

      In response to the reviewer’s suggestion, we re-tested the moderation effect of age after excluding effect sizes from older adults. The results revealed a decrease in the strength of evidence for phase-memory association due to increased variability, but were consistent for all other coupling parameters. The mean estimations also remained consistent (coupling phase-memory relation: -0.005 [-0.013, 0.004], BF10 = 5.51, the strength of evidence reduced from strong to moderate; coupling strength-memory relation: -0.005 [-0.015, 0.008], BF10 = 4.05, the strength of evidence remained moderate). These findings align with prior research, which typically observed a weak coupling-memory relationship in older adults during aging (Ladenbauer et al, 2021; Weiner et al., 2023) but not during development (Hahn et al., 2020; Kurz et al., 2021; Kurz et al., 2023). Therefore, this result is not surprising to us, and there are still observable moderate patterns in the data. We have reported these additional results in the revised manuscript (pp. 6, 11), and interpret “the moderator effect of age in the phase-memory association becomes less pronounced during development after excluding the older adult data”. We believe the original findings including the older adult group remain meaningful after cautious interpretation, given that the older adult data were derived from multiple studies and different groups, and they represent the aging effects.

      Reviewer #3 (Public review):

      First, the authors conclude that "SO-SP coupling should be considered as a general physiological mechanism for memory consolidation". However, the reported effect sizes are smaller than what is typically considered a "small effect”.

      While we acknowledge the concern about the small effect sizes reported in our study, it is important to contextualize these findings within the field of neuroscience, particularly memory research. Even in individual studies, small effect sizes are not uncommon due to the inherent complexity of the mechanisms involved and the multitude of confounding variables. This is an important factor to be considered in meta-analyses where we synthesize data from diverse populations and experimental conditions. For example, the relationship between SO-slow SP coupling and memory consolidation in older adults is expected to be insignificant.

      As Funder and Ozer (2019) concluded in their highly cited paper, an effect size of r = 0.3 in psychological and related fields should be considered large, with r = 0.4 or greater likely representing an overestimation and rarely found in a large sample or a replication. Therefore, we believe r = 0.1 should not be considered as a lower bound of the small effect. Bakker et al. (2019) also advocate for a contextual interpretation of the effect size. This is particularly important in meta-analyses, where the results are less prone to overestimation compared to individual studies, and we cooperated with all authors to include a large number of unreported and insignificant results. In this context, small correlations may contain substantial meaningful information to interpret. Although we agree that effect sizes reported in our study are indeed small at the overall level, they reflect a rigorous analysis that incorporates robust evidence across different levels of moderators. Our moderator analyses underscore the dynamic nature of coupling-memory relationships, with stronger associations observed in moderator subgroups that have historically exhibited better memory performance, particularly after excluding slow spindles and older adults. For example, both the coupling phase and strength of frontal fast spindles with slow oscillations exhibited "moderate-to-large" correlations with the consolidation of different types of memory, especially in young adults, with r values ranging from 0.18 to 0.32. (see Table S9.1-9.4). We have included discussion about the influence of moderators and hierarchical structures on the dynamics of coupling-memory associations (pp. 17, 20). In addition, we have updated the conclusion to be “SO-fast SP coupling should be considered as a general physiological mechanism for memory consolidation” (p. 1).

      Second, the study implements state-of-the-art Bayesian statistics. While some might see this as a strength, I would argue that it is the greatest weakness of the manuscript. A classical meta-analysis is relatively easy to understand, even for readers with only a limited background in statistics. A Bayesian analysis, on the other hand, introduces a number of subjective choices that render it much less transparent.

      This kind of analysis seems not to be made to be intelligible to the average reader. It follows a recent trend of using more and more opaque methods. Where we had to trust published results a decade ago because the data were not openly available, today we must trust the results because the methods can no longer be understood with reasonable effort.

      This becomes obvious in the forest plots. It is not immediately apparent to the reader how the distributions for each study represent the reported effect sizes (gray dots). Presumably, they depend on the Bayesian priors used for the analysis. The use of these priors makes the analyses unnecessarily opaque, eventually leading the reader to question how much of the findings depend on subjective analysis choices (which might be answered by an additional analysis in the supplementary information).

      We appreciate the reviewer for sharing this viewpoint and we value the opportunity to clarify some key points. To address the concern about clarity, we have included more details in the methods section explaining how to interpret Bayesian statistics including priors, posteriors, and Bayes factors, making our results more accessible to those less familiar with this approach.

      On the use of Bayesian models, we believe there may have been a misunderstanding. Bayesian methods, far from being "opaque" or overly complex, are increasingly valued for their ability to provide nuanced, accurate, and transparent inferences (Sutton & Abrams, 2001; Hackenberger, 2020; van de Schoot et al., 2021; Smith et al., 1995; Kruschke & Liddell, 2018). It has been applied in more than 1,200 meta-analyses as of 2020 (Hackenberger, 2020). In our study, we used priors that assume no effect (mean set to 0, which aligns with the null) while allowing for a wide range of variation to account for large uncertainties. This approach reduces the risk of overestimation or false positives and demonstrates much-improved performance over traditional methods in handling variability (Williams et al., 2018; Kruschke & Liddell, 2018). In addition, priors can also increase transparency, since all assumptions are formally encoded and open to critique or sensitivity analysis. In contrast, frequentist methods often rely on hidden or implicit assumptions such as homogeneity of variance, fixed-effects models, and independence of observations that are not directly testable. Sensitivity analyses reported in the supplemental material (Table S9.1-9.4) confirmed the robustness of our choices of priors– our results did not vary by setting different priors.

      As Kruschke and Liddell (2018) described, “shrinkage (pulling extreme estimates closer to group averages) helps prevent false alarms caused by random conspiracies of rogue outlying data,” a well-known advantage of Bayesian over traditional approaches. This explains the observed differences between the distributions and grey dots in the forest plots, which is an advantage of Bayesian models in handling heterogeneity. Unlike p-values, which can be overestimated with a large sample size and underestimated with a small sample size, Bayesian methods make assumptions explicit, enabling others to challenge or refine them– an approach aligned with open science principles (van de Schoot et al., 2021). For example, a credible interval in Bayesian model can be interpreted as “there is a 95% probability that the parameter lies within the interval.”, while a confidence interval in frequentist model means “In repeated experiments, 95% of the confidence intervals will contain the true value.” We believe the former is much more straightforward and convincing for readers to interpret. We will ensure our justification for using Bayesian models is more clearly presented in the manuscript (pp. 21-23).

      We acknowledge that even with these justifications, different researchers may still have discrepancies in their preferences for Bayesian and frequentist models. To increase the effort of transparent reporting, we have also reported the traditional frequentist meta-analysis results in Supplemental Material 10 to justify the robustness of our analysis, which suggested non-significant differences between Bayesian and frequentist models. We have included clearer references in the updated version of the manuscript to direct readers to the figures that report the statistics provided by traditional models.

      However, most of the methods are not described in sufficient detail for the reader to understand the proceedings. It might be evident for an expert in Bayesian statistics what a "prior sensitivity test" and a "posterior predictive check" are, but I suppose most readers would wish for a more detailed description. However, using a "Markov chain Monte Carlo (MCMC) method with the no-U-turn Hamiltonian Monte Carlo (HMC) sampler" and checking its convergence "through graphical posterior predictive checks, trace plots, and the Gelman and Rubin Diagnostic", which should then result in something resembling "a uniformly undulating wave with high overlap between chains" is surely something only rocket scientists understand. Whether this was done correctly in the present study cannot be ascertained because it is only mentioned in the methods and no corresponding results are provided. 

      We appreciate the reviewer’s concerns about accessibility and potential complexity in our descriptions of Bayesian methods. Our decision to provide a detailed account serves to enhance transparency and guide readers interested in replicating our study. We acknowledge that some terms may initially seem overwhelming. These steps, such as checking the MCMC chain convergence and robustness checks, are standard practices in Bayesian research and are analogous to “linearity”, “normality” and “equal variance” checks in frequentist analysis. In addition, Hamiltonian Monte Carlo (HMC) is the default algorithm Stan (the software we used to fit Bayesian models) uses to sample from the posterior distribution in Bayesian models. It is a type of MCMC method designed to be faster and more efficient than traditional sampling algorithms, especially for complex or high-dimensional models. We have added exemplary plots in the supplemental material S4.1-4.3 and the method section (pp. 21-22) to explain the results and interpretation of these convergence checks. We hope this will help address any concerns about methodological rigor.

      In one point the method might not be sufficiently justified. The method used to transform circular-linear r (actually, all references cited by the authors for circular statistics use r² because there can be no negative values) into "Z_r", seems partially plausible and might be correct under the H0. However, Figure 12.3 seems to show that under the alternative Hypothesis H1, the assumptions are not accurate (peak Z_r=~0.70 for r=0.65). I am therefore, based on the presented evidence, unsure whether this transformation is valid. Also, saying that Z_r=-1 represents the null hypothesis and Z_r=1 the alternative hypothesis can be misinterpreted, since Z_r=0 also represents the null hypothesis and is not half way between H0 and H1.

      First, we realized that in the title of Figures 12.2 and 12.3. “true r = 0.35” and “true r = 0.65” should be corrected as “true r_z” (note that we use r_z instead of Z_r in the revised manuscript per your suggestion). The method we used here is to first generate an underlying population that has null (0), moderate (0.35), or large (0.65) r_z correlations, then test whether the sampling distribution drawn from these populations followed a normal distribution across varying sample sizes. Nevertheless, the reviewer correctly noticed discrepancies between the reported true r_z and its sampling distribution peak. This discrepancy arises because, when generating large population data, achieving exact values close to a strong correlation like r_z = 0.65 is unlikely. We loop through simulations to generate population data and ensure their r_z values fall within a threshold. For moderate effect sizes (e.g., r_z = 0.35), this is straightforward using a narrow range (0.34 < r_z < 0.35). However, for larger effect sizes like r_z = 0.65, a wider range (0.6 < r_z < 0.7) is required. therefore sometimes the population we used to draw the sample has a r_z slightly deviated from 0.65. This remains reasonable since the main point of this analysis is to ensure that a large r_z still has a normal sampling distribution, but not focus specifically on achieving r_z = 0.65.

      We acknowledge that this variability of the range used was not clearly explained in supplemental material 12 and it is not accurate to report “true r_z = 0.65”. In the revised version, we have addressed this issue by adding vertical lines to each subplot to indicate the r_z of the population we used to draw samples, making it easier to check if it aligns with the sampling peak. In addition, we have revised the title to “Sampling distributions of r_z drawn from strong correlations

      (r_z = 0.6-0.7)”. We confirmed that population r_z and the peak of their sampling distribution remain consistent under both H0 and H1 in all sample sizes with n > 25, and we hope this explanation can fully resolve your concern.

      We agree with the reviewer that claiming r_z = -1 represents the null hypothesis is not accurate. The circlin r_z = 0 is better analogous to Pearson’s r = 0 since both represent the mean drawn from the population under the null hypothesis. In contrast, the mean effect size under null will be positive in the raw circlin r, which is one of the important reasons for the transformation. To provide a more accurate interpretation, we updated Table 6 to describe the following strength levels of evidence: no effect (r < 0), null (r = 0), small (r = 0.1), moderate (r = 0.3), and large (r =0.5). We thank the reviewer again for their valuable feedback.

      Reviewer #2 (Recommendations for the authors):

      (1) There is an extra space in the Notes of Figure 1. "SW R sharp-wave ripple.".

      We thank the reviewer for pointing this out. We have confirmed that the "extra space" is not an actual error but a result of how italicized Times New Roman font is rendered in the LaTeX format. We believe that the journal’s formatting process will resolve this issue.

      (2) In the introduction, slow oscillations (SO) are defined with a frequency of 0.16-4 Hz, sleep spindles (SP) at 8-16 Hz, and sharp-wave ripples (SWR) at 80-300 Hz. The term "fast oscillation" (FO) is first introduced with the clarification "SPs in our case." However, on page 2, the authors state, "SO-FO coupling involving SWRs, SPs, and SOs..." There seems to be a discrepancy in the definition of FO; does it consistently refer to SPs and SWRs throughout the article?

      We appreciate the reviewer’s observation regarding the potential ambiguity of the term "FO." In our manuscript, "FO" is used as a general term to describe the interaction of a "relatively faster oscillation" with a "relatively slower oscillation" in the phase-amplitude coupling mechanism, therefore it is not intended to exclusively refer to SPs or SWRs. For example, it is usually used to describe SO–SP–SWR couplings during sleep memory studies, but Theta–Alpha–Gamma couplings in wakeful memory studies. To address this confusion, we removed the phrase "SPs in our case" and explicitly use "SPs" when referring to spindles. In addition, we have replaced "fast oscillation" with "faster oscillation" to emphasize that it is used in a relative sense (p. 1), rather than to refer to a specific oscillation. Also, we only retained the term “FO” when introducing the PAC mechanism.

      (3) On page 2, the first paragraph contains the phrase: "...which occur in the precise hierarchical temporal structure of SO-FO coupling involving SWRs, SPs, and SOs ..." Since "SO-FO" refers to slow and fast oscillations, it is better to maintain the order of frequencies, suggesting it as: SOs, SPs, and SWRs.

      We sincerely thank the reviewer for their valuable suggestion. We have updated the sentence to maintain the correct order from the lowest to the highest frequencies in the revised version (p. 2).

      (4) References should be provided:

      a “Studies using calcium imaging after SP stimulation explained the significance of the precise coupling phase for synaptic plasticity.".

      b. "Electrophysiology evidence indicates that the association between memory consolidation and SO-SP coupling is influenced by a variety of behavioral and physiological factors under different conditions."

      c. "Since some studies found that fast SPs predominate in the centroparietal region, while slow SPs are more common in the frontal region, a significant amount of studies only extracted specific types of SPs from limited electrodes. Some studies even averaged all electrodes to estimate coupling..."

      This is a great point.  These have been referenced as follows:

      a. Rephrased: “Studies using calcium imaging and SP stimulation explained the significance of the precise coupling phase for synaptic plasticity.” We changed “after” to “and” to reflect that these were conducted as two separate experiments. This is a summary statement, with relevant citations provided in the following two sentences of the paragraph, including Niethard et al., 2018, and Rosanova et al., 2005. (p. 2)

      b. Included diverse sources of evidence: “Electrophysiology evidence from studies included in our meta-analysis (e.g. Denis et al., 2021; Hahn et al., 2020; Mylonas et al., 2020) and others (e.g. Bartsch et al., 2019; Muehlroth et al., 2019; Rodheim et al., 2023) reported that the association between memory consolidation and SO-SP coupling is influenced by a variety of behavioral and physiological factors under different conditions.” (p. 3)

      c. Added references and more details: “Since some studies found that fast SPs predominate in the centroparietal region, while slow SPs are more common in the frontal region, a significant amount of studies selectively extracted specific types of SPs from limited electrodes (e.g. Dehnavi et al., 2021; Perrault et al., 2019; Schreiner et al., 2021). Some studies even averaged all electrodes in their spectral and/or time-series analysis to estimate metrics of oscillations and their couplings (e.g. Denis et al., 2022; Mölle et al., 2011; Nicolas et al., 2022).” (p. 4)

      Reviewer #3 (Recommendations for the authors):

      There are a number of terms that are not clearly defined or used:

      (1) SP amplitude. Does this mean only the amplitude of coupled spindles or of spindles in general?

      This refers to the amplitude of spindles in general. We clarified this in the revised text (and see response to reviewer #1, point #1).

      (2) The definition of a small effect

      We thank the reviewer again for raising this important question. As we responded in the public review, small effect sizes are common in neuroscience and meta-analyses due to the complexity of the underlying mechanisms and the presence of numerous confounding variables and hierarchical levels. To help readers better interpret effect sizes, we changed rigid ranges to widely accepted benchmarks for effect size levels in neuroscience research: small (r=0.1), moderate (r=0.3), and large (r=0.5; Cohen, 1988). We also noted that an evidence and context-based framework will provide a more practical way to interpret the observed effect sizes compared to rigid categorizations.

      (3) Can a BF10 based on experimental evidence actually be "infinite" and a probability actually be 1.00?

      We appreciate the reviewer for highlighting this potential confusion. The formula used to calculate BF10 is P(data | H1) / P(data | H0). In the experimental setting with an informative prior, an ‘infinite’ BF10 value indicates that all posterior samples are overwhelmingly compatible with H1 given the data and assumptions (Cox et al., 2023; Heck et al., 2023; Ly et al., 2016). In such cases, the denominator P(data | H0) becomes vanishingly small, leading BF10 to converge to infinity. This scenario occurs when the probability of H1 converges to 1 (e.g., 0.9999999999…).

      It is a well-established convention in Bayesian statistics to report the Bayes factor as "infinity" in cases where the evidence is overwhelmingly strong, and BF10 exceeds the numerical limits of the computation tools to become effectively infinite. To address this ambiguity, we added a footnote in the revised version of the manuscript to clarify the interpretation of an 'infinite' BF10 . (p. 8)

      (4) Z_r should be renamed to r_z or similar. These are not Z values (-inf..+inf), but r values (-1..1).

      We thank the reviewers for their suggestions. We agree that r_z would provide a clearer and more accurate interpretation, while z is more appropriate for referring to Fisher's z-transformed r (see point (5)). We have updated the notation accordingly.

      (5) Also, it remains quite unclear at which points in the analyses, "r" values or "Fisher's z transformed r" values are used. Assumptions of normality should only apply to the transformed values. However, the formulas for the random effects model seem to assume normality for r values.

      The correlation values were z-transformed during preprocessing to ensure normality and the correct estimation of sampling variances before running the models. The outputs were then back-transformed to raw r values only when reporting the results to help readers interpret the effect size. We mentioned this in Section 5.5.1, therefore the normality assumptions are not a concern. We have updated the notation r to z (-inf..+inf) in the formula of the random and mixed effect models in the revised version of the manuscript (p. 22).

      Language

      (1) Frequency. In the introduction, the authors use "frequency" when they mean something like the incidence of spindles.

      We agree that the term "frequency" has been used inconsistently to describe both the incidence of events and the frequency bands of oscillations. We have replaced "frequency" with "prevalence" to refer to the incidence of coupling events where applicable (p. 3).

      (2) Moderate and mediate. These two terms are usually meant to indicate two different types of causal influences.

      Thanks for the reviewer’s suggestions. We agree that "moderate" is more appropriate to describe moderators in this study since it does not directly imply causality. We have replaced mediate with moderate in relevant contexts.

      (3) "the moderate effect of memory task is relatively weak": "moderator effect" or "moderate effect"?

      We appreciate the reviewer for pointing out this mistake. We have updated the term to "moderator effect" in Section 2.2.2 (p. 6).

      (4) "in frontal regions we found a latest coupled but most precise and strong SO-fast SP coupling" Meaning?

      We thank the reviewer for bringing this concern of clarity to our attention. By 'latest,' we refer to the delayed phase of SO-fast SP coupling observed in the frontal regions compared to the central and parietal regions (see Figure 5), "Precise and strong" describes the high precision and strength of phase-locking between the SO up-state and the fast SP peak in these regions. We have rephrased this sentence to be: “We found that SO-fast SP coupling in the frontal region occurred at the latest phase observed across all regions, characterized by the highest precision and strength of phase-locking.” to improve clarity (p. 9).

      (5) Figure 5 and others contain angles in degrees and radians.

      We appreciate the reviewer pointing out this inconsistency. We have updated the manuscript and supplementary material to consistently use radians throughout.

    1. Reviewer #2 (Public review):

      The revised manuscript by Altan et al. includes some real improvements to the visualizations and explanations of the authors' thesis statement with respect to fMRI measurements of pRF sizes. In particular, the deposition of the paper's data has allowed me to probe and refine several of my previous concerns. While I still have major concerns about how the data are presented in the current draft of the manuscript, my skepticism about data quality overall has been much alleviated. Note that this review focuses almost exclusively on the fMRI data as I was satisfied with the quality of the psychophysical data and analyses in my previous review.

      Major Concerns

      (I) Statistical Analysis

      In my previous review, I raised the concern that the small sample size combined with the noisiness of the fMRI data, a lack of clarity about some of the statistics, and a lack of code/data likely combine to make this paper difficult or impossible to reproduce as it stands. The authors have since addressed several aspects of this concern, most importantly by depositing their data. However their response leaves some major questions, which I detail below.

      First of all, the authors claim in their response to the previous review that the small sample size is not an issue because large samples are not necessary to obtain "conclusive" results. They are, of course, technically correct that a small sample size can yield significant results, but the response misses the point entirely. In fact, small samples are more likely than large samples to erroneously yield a significant result (Button et al., 2013, DOI:10.1038/nrn3475), especially when noise is high. The response by the authors cites Schwarzkopf & Huang (2024) to support their methods on this front. After reading the paper, I fail to see how it is at all relevant to the manuscript at hand or the criticism raised in the previous review. Schwarzkopf & Huang propose a statistical framework that is narrowly tailored to situations where one is already certain that some phenomenon (like the adaptation of pRF size to spatial frequency) either always occurs or never occurs. Such a framework is invalid if one cannot be certain that, for example, pRF size adapts in 98% of people but not the remaining 2%. Even if the paper were relevant to the current study, the authors don't cite this paper, use its framework, or admit the assumptions it requires in the current manuscript. The observation that a small dataset can theoretically lead to significance under a set of assumptions not appropriate for the current manuscript is not a serious response to the concern that this manuscript may not be reproducible.

      To overcome this concern, the authors should provide clear descriptions of their statistical analyses and explanations of why these analyses are appropriate for the data. Ideally, source code should be published that demonstrates how the statistical tests were run on the published data. (I was unable to find any such source code in the OSF repository.) If the effects in the paper were much stronger, this level of rigor might not be strictly necessary, but the data currently give the impression of being right near the boundary of significance, and the manuscript's analyses needs to reflect that. The descriptions in the text were helpful, but I was only able to approximately reproduce the authors analyses based on these descriptions alone. Specifically, I attempted to reproduce the Mood's median tests described in the second paragraph of section 3.2 after filtering the data based on the criteria described in the final paragraph of section 3.1. I found that 7/8 (V1), 7/8 (V2), 5/8 (V3), 5/8 (V4), and 4/8 (V3A) subjects passed the median test when accounting for the (40) multiple comparisons. These results are reasonably close to those reported in the manuscript and might just differ based on the multiple comparisons strategy used (which I did not find documented in the manuscript). However, Mood's median test does not test the direction of the difference-just whether the medians are different-so I additionally required that the median sigma of the high-adapted pRFs be greater than that of the low-adapted pRFs. Surprisingly, in V1 and V3, one subject each (not the same subject) failed this part of the test, meaning that they had significant differences between conditions but in the wrong direction. This leaves 6/8 (V1), 7/8 (V2), 4/8 (V3), 5/8 (V4), and 4/8 (V3A) subjects that appear to support the authors' conclusions. As the authors mention, however, this set of analyses runs the risk of comparing different parts of cortex, so I also performed Wilcox signed-rank tests on the (paired) vertex data for which both the high-adapted and low-adapted conditions passed all the authors' stated thresholds. These results largely agreed with the median test (only 5/8 subjects significant in V1 but 6/8 in in V3A, other areas the same, though the two tests did not always agree which subjects had significant differences). These analyses were of course performed by a reviewer with a reviewer's time commitment to the project and shouldn't be considered a replacement for the authors' expertise with their own data. If the authors think that I have made a mistake in these calculations, then the best way to refute them would be to publish the source code they used to threshold the data and to perform the same tests.

      Setting aside the precise values of the relevant tests, we should also consider whether 5 of 8 subjects showing a significant effect (as they report for V3, for example) should count as significant evidence of the effect? If one assumes, as a null hypothesis, that there is no difference between the two conditions in V3 and that all differences are purely noise, then a binomial test across subjects would be appropriate. Even if 6 of 8 subjects show the effect, however (and ignoring multiple comparisons), the p-value of a one-sided binomial test is not significant at the 0.05 level (7 of 8 subjects is barely significant). Of course, a more rigorous way to approach this question could be something like an ANOVA, and the authors use an ANOVA analysis of the medians in the paragraph following their use of Mood's median test. However, ANOVA assumes normality, and the authors state in the previous paragraph that they employed Mood's median test because "the distribution of the pRF sizes is zero-bounded and highly skewed" so this choice does not make sense. The Central Limits Theorem might be applied to the medians in theory, but with only 8 subjects and with an underlying distribution of pRF sizes that is non-negative, the relevant data will almost certainly not be normally distributed. These tests should probably be something like a Kruskal-Wallis ANOVA on ranks.

      All of the above said, my intuition about the data is currently that there are significant changes to the adapted pRF size in V2. I am not currently convinced that the effects in other visual areas are significant, and I suspect that the paper would be improved if authors abandoned their claims that areas other than V2 show a substantial effect. Importantly, I don't think this causes the paper to lose any impact-in fact, if the authors agree with my assessments, then the paper might be improved by focusing on V2. Specifically, the authors' already discuss psychophysical work related to the perception of texture on pages 18 and 19 and link it to their results. V2 is also implicated in the perception of texture (see, for example, Freeman et al., 2013; DOI:10.1038/nn.3402; Ziemba et al., 2016, DOI:10.1073/pnas.1510847113; Ziemba et al., 2019; DOI:10.1523/JNEUROSCI.1743-19.2019) and so would naturally be the part of the visual cortex where one might predict that spatial frequency adaptation would have a strong effect on pRF size. This neatly connects the psychophysical and imaging sides of this project and could make a very nice story out of the present work.

      (II) Visualizations

      The manuscript's visual evidence regarding the pRF data also remains fairly weak (but I found the pRF size comparisons in the OSF repository and Figure S1 to be better evidence-more in the next paragraph). The first line of the Results section still states, "A visual inspection on the pRF size maps in Figure 4c clearly shows a difference between the two conditions, which is evident in all regions." As I mentioned in my previous review, I don't agree with this claim (specifically, that it is clear). My impression when I look at these plots is of similarity between the maps, and, where there is dissimilarity, of likely artifacts. For example, the splotch of cortex near the upper vertical meridian (ventral boundary) of V1 that shows up in yellow in the upper plot but not the lower plot also has a weirdly high eccentricity and a polar angle near the opposite vertical meridian: almost certainly not the actual tuning of that patch of cortex. If this is the clearest example subject in the dataset, then the effect looks to me to be very small and inconsistently distributed across the visual areas. That said, I'm not convinced that the problem here is the data-rather, I think it's just very hard to communicate a small difference in parameter tuning across a visual area using this kind of side-by-side figure. I think that Figure S2, though noisy (as pRF maps typically are), is more convincing than Figure 4c, personally. For what it's worth, when looking at the data myself, I found that plotting log(𝜎(H) / 𝜎(L)), which will be unstable when noise causes 𝜎(H) or 𝜎(L) to approach zero, was less useful than plotting plotting (𝜎(H) - 𝜎(L)) / (𝜎(H) + 𝜎(L)). This latter quantity will be constrained between -1 and 1 and shows something like a proportional change in the pRF size (and thus should be more comparable across eccentricity).

      In my opinion, the inclusion of the pRF size comparison plots in the OSF repository and Figure S1 made a stronger case than any of the plots of the cortical surface. I would suggest putting these on log-log plots since the distribution of pRF size (like eccentricity) is approximately exponential on the cortical surface. As-is, it's clear in many plots that there is a big splotch of data in the compressed lower left corner, but it's hard to get a sense for how these should be compared to the upper right expanse of the plots. It is frequently hard to tell whether there is a greater concentration of points above or below the line of equality in the lower left corner as well, and this is fairly central to the paper's claims. My intuition is that the upper right is showing relatively little data (maybe 10%?), but these data are very emphasized by the current plots.
The authors might even want to consider putting a collection of these scatter-plots (or maybe just subject 007, or possible all subjects' pRFs on a single scatter-plot) in the main paper and using these visualizations to provide intuitive supporting for the main conclusions about the fMRI data (where the manuscript currently use Figure 4c for visual intuition).

      Minor Comments

      (1) Although eLife does not strictly require it, I would like to see more of the authors' code deposited along with the data (especially the code for calculating the statistics that were mentioned above). I do appreciate the simulation code that the authors added in the latest submission (largely added in response to my criticism in the previous reviews), and I'll admit that it helped me understand where the authors were coming from, but it also contains a bug and thus makes a good example of why I'd like to see more of the authors' code. If we set aside the scientific question of whether the simulation is representative of an fMRI voxel (more in Minor Comment 5, below), Figures 1A and the "AdaptaionEffectSimulated.png" file from the repository (https://osf.io/d5agf) imply that only small RFs were excluded in the high-adapted condition and only large RFs were excluded in the low-adapted condition. However, the script provided (SimlatePrfAdaptation.m: https://osf.io/u4d2h) does not do this. Lines 7 and 8 of the script set the small and large cutoffs at the 30th and 70th percentiles, respectively, then exclude everything greater than the 30th percentile in the "Large RFs adapted out" condition (lines 19-21) and exclude anything less than the 70th percentile in the "Small RFs adapted out" condition (lines 27-29). So the figures imply that they are representing 70% of the data but they are in fact representing only the most extreme 30% of the data. (Moreover, I was unable to run the script because it contains hard-coded paths to code in someone's home directory.) Just to be clear, these kinds of bugs are quite common in scientific code, and this bug was almost certainly an honest mistake.

      (2) I also noticed that the individual subject scatter-plots of high versus low adapted pRF sizes on the OSF seem to occasionally have a large concentration of values on the x=0 and y=0 axes. This isn't really a big deal in the plots, but the manuscript states that "we denoised the pRF data to remove artifactual vertices where at least one of the following criteria was met: (1) sigma values were equal to or less than zero ..." so I would encourage the authors to double-check that the rest of their analysis code was run with the stated filtering.

      (3) The manuscript also says that the median test was performed "on the raw pRF size values". I'm not really sure what the "raw" means here. Does this refer to pRF sizes without thresholding applied?

      (4) The eccentricity data are much clearer now with the additional comments from the authors and the full set of maps; my concerns about this point have been met.

      (5) Regarding the simulation of RFs in a voxel (setting aside the bug), I will admit both to hoping for a more biologically-grounded situation and to nonetheless understanding where the authors are coming from based on the provided example. What I mean by biologically-grounded: something like, assume a 2.5-mm isotropic voxel aligned to the surface of V1 at 4{degree sign} of eccentricity; the voxel would span X to Y degrees of eccentricity, and we predict Z neurons with RFs in this voxel with a distribution of RF sizes at that eccentricity from [reference], etc. eventually demonstrating a plausible pRF size change commensurate to the paper's measurements. I do think that a simulation like this would make the paper more compelling, but I'll acknowledge that it probably isn't necessary and might be beyond the scope here.

    1. Author Response:

      Public Reviews:

      Reviewer #1 (Public review):

      Weaknesses:

      (1) It remains unclear how this stimulation protocol is proposed to enhance memory. Memories are believed to be stored by precise inputs to specific neurons and highly tuned changes in synaptic strengths. It remains unclear whether proposed neural activity generated by the stimulation reflects the activation of specific memories or generally increased activity across all classes of neurons.

      Thank you for raising the important issue of the actual neurophysiological effects of non-invasive brain stimulation. Unfortunately, invasive neurophysiological recordings in humans for this type of study are not feasible due to ethical constraints, while studies on cadavers or rodents would not fully resolve our question. Indeed, the authors of the cited study (Mihály Vöröslakos et al., Nature Communications, 2018) highlight the impossibility of drawing definitive conclusions about the exact voltage required in the in-vivo human brain due to significant differences between rats and humans, as well as the in-vivo human brain and cadavers due to alterations in electrical conductivity that occur in postmortem tissue.

      We acknowledge that further exploration of this aspect would be highly valuable, and we agree that it is worth discussing both as a technical limitation and as a potential direction for future research, we therefore modify the manuscript correspondingly. However, to address the challenge of in vivo recordings, we conducted Experiments 3 and 4, which respectively examined the neurophysiological and connectivity changes induced by the stimulation in a non-invasive manner. The observed changes in brain oscillatory activity (increased gamma oscillatory activity), cortical excitability (enhanced posteromedial parietal cortex reactivity), and brain connectivity (strengthened connections between the precuneus and hippocampi) provided evidence of the effects of our non-invasive brain stimulation protocol, further supporting the behavioral data.

      Additionally, we carefully considered the issue of stimulation distribution and, in response, performed a biophysical modeling analysis and E-field calculation using the parameters employed in our study (see Supplementary Materials).

      (2) The claim that effects directly involve the precuneus lacks strong support. The measurements shown in Figure 3 appear to be weak (i.e., Figure 3A top and bottom look similar, and Figure 3C left and right look similar). The figure appears to show a more global brain pattern rather than effects that are limited to the precuneus. Related to this, it would perhaps be useful to show the different positions of the stimulation apparatus. This could perhaps show that the position of the stimulation matters and could perhaps illustrate a range of distances over which position of the stimulation matters.

      Thank you for your feedback. We will improve the clarity of the manuscript to better address this important aspect. Our assumption that the precuneus plays a key role in the observed effects is based on several factors:

      (1) The non-invasive stimulation protocol was applied to an individually identified precuneus for each participant. Given existing evidence on TMS propagation, we can reasonably assume that the precuneus was at least a mediator of the observed effects (Ridding & Rothwell, Nature Reviews Neuroscience 2007). For further details about target identification and TMS and tACS propagation, please refer to the MRI data acquisition section in the main text and Biophysical modeling and E-field calculation section in the supplementary materials.

      (2) To investigate the effects of the neuromodulation protocol on cortical responses, we conducted a whole-brain analysis using multiple paired t-tests comparing each data point between different experimental conditions. To minimize the type I error rate, data were permuted with the Monte Carlo approach and significant p-values were corrected with the false discovery rate method (see the Methods section for details). The results identified the posterior-medial parietal areas as the only regions showing significant differences across conditions.

      (3) To control for potential generalized effects, we included a control condition in which TMS-EEG recordings were performed over the left parietal cortex (adjacent to the precuneus). This condition did not yield any significant results, reinforcing the cortical specificity of the observed effects.

      However, as stated in the Discussion, we do not claim that precuneus activity alone accounts for the observed effects. As shown in Experiment 4, stimulation led to connectivity changes between the precuneus and hippocampus, a network widely recognized as a key contributor to long-term memory formation (Bliss & Collingridge, Nature 1993). These connectivity changes suggest that precuneus stimulation triggered a ripple effect extending beyond the stimulation site, engaging the broader precuneus-hippocampus network.

      Regarding Figure 3A, it represents the overall expression of oscillatory activity detected by TMS-EEG. Since each frequency band has a different optimal scaling, the figure reflects a graphical compromise. A more detailed representation of the significant results is provided in Figure 3B. The effect sizes for gamma oscillatory activity in the delta T1 and T2 conditions were 0.52 and 0.50, respectively, which correspond to a medium effect based on Cohen’s d interpretation.

      (3) Behavioral results showing an effect on memory would substantiate claims that the stimulation approach produces significant changes in brain activity. However, placebo effects can be extremely powerful and useful, and this should probably be mentioned. Also, in the behavioral results that are currently presented, there are several concerns:

      a) There does not appear to be a significant effect on the STMB task.

      b) The FNAT task is minimally described in the supplementary material. Experimental details that would help the reader understand what was done are not described. Experimental details are missing for: the size of the images, the duration of the image presentation, the degree of image repetition, how long the participants studied the images, whether the names and occupations were different, genders of the faces, and whether the same participant saw different faces across the different stimulation conditions. Regarding the latter point, if the same participant saw the same faces across the different stimulation conditions, then there could be memory effects across different conditions that would need to be included in the statistical analyses. If participants saw different faces across the different stimulus conditions, then it would be useful to show that the difficulty was the same across the different stimuli.

      We thank you for signaling the lack in the description of FNAT task. We will add all the information required to the manuscript.

      In the meantime, here we provide the answers to your questions. The size of the images 19x15cm. They were presented in the learning phase and the immediate recall for 8 seconds each, while in the delayed recall they were shown (after the face recognition phase) until the subject answered. The learning phase, where name and occupation were shown together with the faces, lasted around 2 minutes comprising the instructions. We used a different set of stimuli for each stimulation condition, for a total of 3 parallel task forms balanced across the condition and order of sessions. All the parallel forms were composed of 6 male and 6 female faces, for each sex there were 2 young adults (aged around 30 years old), 2 middle adults (aged around 50 years old), and 2 old adults (aged around 70 years old). Before the experiments, we ran a pilot study to ensure there were no differences between the parallel forms of the task. We can provide the task with its parallel form upon request. The chance level in the immediate and delayed recall is not quantifiable since the participants had to freely recall the name and the occupation without a multiple choice. In the recognition, the chance level was around 33% (since the possible answers were 3).

      c) Also, if I understand FNAT correctly, the task is based on just 12 presentations, and each point in Figure 2A represents a different participant. How the performance of individual participants changed across the conditions is unclear with the information provided. Lines joining performance measurements across conditions for each participant would be useful in this regard. Because there are only 12 faces, the results are quantized in multiples of 100/12 % in Figure 3A. While I do not doubt that the authors did their homework in terms of the statistical analyses, it seems as though these 12 measurements do not correspond to a large effect size. For example, in Figure 3A for the immediate condition (total), it seems that, on average, the participants may remember one more face/name/occupation.

      We will add another graph to the manuscript with lines connecting each participant's performance. Unfortunately, we were not able to incorporate it in the box-and-whisker plot.

      We apologize for the lack of clarity in the description of the FNAT. As you correctly pointed out, we used the percentage based on the single association between face, name and occupation (12 in total). However, each association consisted of three items, resulting in a total of 36 items to learn and associate – we will make it more explicit in the manuscript.

      In the example you mentioned, participants were, on average, able to recall three more items compared to the other conditions. While this difference may not seem striking at first glance, it is important to consider that we assessed memory performance after a single, three-minute stimulation session. Similar effects are typically observed only after multiple stimulation sessions (Koch et al., NeuroImage, 2018; Grover et al., Nature Neuroscience, 2022).

      d) Block effects. If I understand correctly, the experiments were conducted in blocks. This is potentially problematic. An example study that articulates potential problems associated with block designs is described in Li et al (TPAMI 2021, https://ieeexplore.ieee.org/document/9264220). It is unclear if potential problems associated with block designs were taken into consideration.

      Thank you for the interesting reference. According to this paper, in a block design, EEG or fMRI recordings are performed in response to different stimuli of a given class presented in succession. If this is the case, it does not correspond to our experimental design where both TMS-EEG and fMRI were conducted in a resting state on different days according to the different stimulation conditions.

      e) In the FNAT portion of the paper, some results are statistically significant, while others are not. The interpretation of this is unclear. In Figure 3A, it seems as though the authors claim that iTBS+gtACS > iTBS+sham-tACS, but iTBS+gtACS ~ sham+sham. The interpretation of such a result is unclear. Results are also unclear when separated by name and occupation. There is only one condition that is statistically significant in Figure 3A in the name condition, and no significant results in the occupation condition. In short, the statistical analyses, and accompanying results that support the authors’ claims, should be explained more clearly.

      Thank you again for your feedback. We will work on making the large amount of data we reported easier to interpret.

      Hoping to have thoroughly addressed your initial concerns in our previous responses, we now move on to your observations regarding the behavioral results, assuming you were referring to Figure 2A. The main finding of this study is the improvement in long-term memory performance, specifically the ability to correctly recall the association between face, name, and occupation (total FNAT), which was significantly enhanced in both Experiments 1 and 2. However, we also aimed to explore the individual contributions of name and occupation separately to gain a deeper understanding of the results. Our analysis revealed that the improvement in total FNAT was primarily driven by an increase in name recall rather than occupation recall. We understand that this may have caused some confusion. Therefore we will clarify this in the manuscript and consider presenting the name and occupation in a separate plot.

      Regarding the stimulation conditions, your concerns about the performance pattern (iTBS+gtACS > iTBS+sham-tACS, but iTBS+gtACS ~ sham+sham) are understandable. However, this new protocol was developed precisely in response to the variability observed in behavioral outcomes following non-invasive brain stimulation, particularly when used to modulate memory functions (Corp et al., 2020; Pabst et al., 2022). As discussed in the manuscript, it is intended as a boost to conventional non-invasive brain stimulation protocols, leveraging the mechanisms outlined in the Discussion section.

      Reviewer #2 (Public review):

      Weaknesses:

      (1) The study did not include a condition where γtACS was applied alone. This was likely because a previous work indicated that a single 3-minute γtACS did not produce significant effects, but this limits the ability to isolate the specific contribution of γtACS in the context of this target and memory function

      Thank you for your comments. As you pointed out, we did not include a condition where γtACS was applied alone. This decision was based on the findings of Guerra et al. (Brain Stimulation 2018), who investigated the same protocol and reported no aftereffects. Given the substantial burden of the experimental design on patients and our primary goal of demonstrating an enhancement of effects compared to the standalone iTBS protocol, we decided to leave out this condition. However, we agree that investigating the effects of γtACS alone is an interesting and relevant aspect worthy of further exploration. In line with these observations, we will expand the discussion on this point in the study’s limitations section.

      (2) The authors applied stimulation for 3 minutes, which seems to be based on prior tACS protocols. It would be helpful to present some rationale for both the duration and timing relative to the learning phase of the memory task. Would you expect additional stimulation prior to recall to benefit long-term associative memory?

      Thank you for your comment and for raising this interesting point. As you correctly noted, the protocol we used has a duration of three minutes, a choice based on previous studies demonstrating its greater efficacy with respect to single stimulation from a neurophysiological point of view. Specifically, these studies have shown that the combined stimulation enhanced gamma-band oscillations and increased cortical plasticity (Guerra et al., Brain Stimulation 2018; Maiella et al., Scientific Reports 2022). Given that the precuneus (Brodt et al., Science 2018; Schott et al., Human Brain Mapping 2018), gamma oscillations (Osipova et al., Journal of Neuroscience 2006; Deprés et al., Neurobiology of Aging 2017; Griffiths et al., Trends in Neurosciences 2023), and cortical plasticity (Brodt et al., Science 2018) are all associated with encoding processes, we decided to apply the co-stimulation immediately before it to enhance the efficacy.

      Regarding the question of whether stimulation could also benefit recall, the answer is yes. We can speculate that repeating the stimulation before recall might provide an additional boost. This is supported by evidence showing that both the precuneus and gamma oscillations are involved in recall processes (Flanagin et al., Cerebral Cortex 2023; Griffiths et al., Trends in Neurosciences 2023). Furthermore, previous research suggests that reinstating the same brain state as during encoding can enhance recall performance (Javadi et al., The Journal of Neuroscience 2017).

      We will expand the study rationale and include these considerations in the future directions section.

      (3) How was the burst frequency of theta iTBS and gamma frequency of tACS chosen? Were these also personalized to subjects' endogenous theta and gamma oscillations? If not, were increases in gamma oscillations specific to patients' endogenous gamma oscillation frequencies or the tACS frequency?

      The stimulation protocol was chosen based on previous studies (Guerra et al., Brain Stimulation 2018; Maiella et al., Scientific Reports 2022). Gamma tACS sinusoid frequency wave was set at 70 Hz while iTBS consisted of ten bursts of three pulses at 50 Hz lasting 2 s, repeated every 10 s with an 8 s pause between consecutive trains, for a total of 600 pulses total lasting 190 s (see iTBS+γtACS neuromodulation protocol section). In particular, the theta iTBS has been inspired by protocols used in animal models to elicit LTP in the hippocampus (Huang et al., Neuron 2005). Consequently, neither Theta iTBS nor the gamma frequency of tACS were personalized. The increase in gamma oscillations was referred to the patient’s baseline and did not correspond to the administrated tACS frequency.

      (4) The authors do a thorough job of analyzing the increase in gamma oscillations in the precuneus through TMS-EEG; however, the authors may also analyze whether theta oscillations were also enhanced through this protocol due to the iTBS potentially targeting theta oscillations. This may also be more robust than gamma oscillations increases since gamma oscillations detected on the scalp are very low amplitude and susceptible to noise and may reflect activity from multiple overlapping sources, making precise localization difficult without advanced techniques.

      Thank you for the suggestion. We analyzed theta oscillations finding no changes.

      (5) Figure 4: Why are connectivity values pre-stimulation for the iTBS and sham tACS stimulation condition so much higher than the dual stimulation? We would expect baseline values to be more similar.

      We acknowledge that the pre-stimulation connectivity values for the iTBS and sham tACS conditions appear higher than those for the dual stimulation condition. However, as noted in our statistical analyses, there were no significant differences at baseline between conditions (p-FDR= 0.3514), suggesting that any apparent discrepancy is due to natural variability rather than systematic bias. One potential explanation for these differences is individual variability in baseline connectivity measures, which can fluctuate due to factors such as intrinsic neural dynamics, participant state, or measurement noise. Despite these variations, our statistical approach ensures that any observed post-stimulation effects are not confounded by pre-existing differences.

      (6) Figure 2: How are total association scores significantly different between stimulation conditions, but individual name and occupation associations are not? Further clarification of how the total FNAT score is calculated would be helpful.

      We apologize for any lack of clarity. The total FNAT score reflects the ability to correctly recall all the information associated with a person—specifically, the correct pairing of the face, name, and occupation. Participants received one point for each triplet they accurately recalled. The scores were then converted into percentages, as detailed in the Face-Name Associative Task Construction and Scoring section in the supplementary materials.

      Total FNAT was the primary outcome measure. However, we also analyzed name and occupation recall separately to better understand their individual contributions. Our analysis revealed that the improvement in total FNAT was primarily driven by an increase in name recall rather than occupation recall.

      We acknowledge that this distinction may have caused some confusion. To improve clarity, we will revise the manuscript accordingly and consider presenting name and occupation recall in separate plots.

      Reviewer #3 (Public review):

      Weaknesses:

      I want to state clearly that I think the strengths of this study far outweigh the concerns I have. I still list some points that I think should be clarified by the authors or taken into account by readers when interpreting the presented findings.

      I think one of the major weaknesses of this study is the overall low sample size in all of the experiments (between n = 10 and n = 20). This is, as I mentioned when discussing the strengths of the study, partly mitigated by the within-subject design and individualized stimulation parameters. The authors mention that they performed a power analysis but this analysis seemed to be based on electrophysiological readouts similar to those obtained in experiment 3. It is thus unclear whether the other experiments were sufficiently powered to reliably detect the behavioral effects of interest. That being said, the authors do report significant effects, so they were per definition powered to find those. However, the effect sizes reported for their main findings are all relatively large and it is known that significant findings from small samples may represent inflated effect sizes, which may hamper the generalizability of the current results. Ideally, the authors would replicate their main findings in a larger sample. Alternatively, I think running a sensitivity analysis to estimate the smallest effect the authors could have detected with a power of 80% could be very informative for readers to contextualize the findings. At the very least, however, I think it would be necessary to address this point as a potential limitation in the discussion of the paper.

      Thank you for the observation. As you mentioned, our power analysis was based on our previous study investigating the same neuromodulation protocol with a corresponding experimental design. The relatively small sample could be considered a possible limitation of the study which we will add to the discussion. A fundamental future step will be to replay these results on a larger population, however, to strengthen our results we performed the sensitivity analysis you suggested.

      In detail, we performed a sensitivity analysis for repeated-measures ANOVA with α=0.05 and power(1-β)=0.80 with no sphericity correction. For experiment 1, a sensitivity analysis with 1 group and 3 measurements showed a minimal detectable effect size of f=0.524 with 20 participants. In our paper, the ANOVA on total FNAT immediate performance revealed an effect size of η2\=0.274 corresponding to f=0.614; the ANOVA on FNAT delayed performance revealed an effect size of η2 =0.236 corresponding to f=0.556. For experiment 2, a sensitivity analysis for total FNAT immediate performance (1 group and 3 measurements) showed a minimal detectable effect size of f=0.797 with 10 participants. In our paper, the ANOVA on total FNAT immediate performance revealed an effect size of η2 =0.448 corresponding to f=0.901. The sensitivity analysis for total FNAT delayed performance (1 group and 6 measurements) showed a minimal detectable effect size of f=0.378 with 10 participants. In our paper, the ANOVA on total FNAT delayed performance revealed an effect size of η2 =0.484 corresponding to f=0.968. Thus, the sensitivity analysis showed that both experiments were powered enough to detect the minimum effect size computed in the power analysis. We have now added this information to the manuscript and we thank the reviewer for her/his suggestion.

      It seems that the statistical analysis approach differed slightly between studies. In experiment 1, the authors followed up significant effects of their ANOVAs by Bonferroni-adjusted post-hoc tests whereas it seems that in experiment 2, those post-hoc tests where "exploratory", which may suggest those were uncorrected. In experiment 3, the authors use one-tailed t-tests to follow up their ANOVAs. Given some of the reported p-values, these choices suggest that some of the comparisons might have failed to reach significance if properly corrected. This is not a critical issue per se, as the important test in all these cases is the initial ANOVA but non-significant (corrected) post-hoc tests might be another indicator of an underpowered experiment. My assumptions here might be wrong, but even then, I would ask the authors to be more transparent about the reasons for their choices or provide additional justification. Finally, the authors sometimes report exact p-values whereas other times they simply say p < .05. I would ask them to be consistent and recommend using exact p-values for every result where p >= .001.

      Thank you again for the suggestions. Your observations are correct, we used a slightly different statistical depending on our hypothesis. Here are the details:

      In experiment 1, we used a repeated-measure ANOVA with one factor “stimulation condition” (iTBS+γtACS; iTBS+sham-tACS; sham-iTBS+sham-tACS). Following the significant effect of this factor we performed post-hoc analysis with Bonferroni correction.

      In experiment 2, we used a repeated-measures with two factors “stimulation condition” and “time”. As expected, we observed a significant effect of condition, confirming the result of experiment 1, but not of time. Thus, this means that the neuromodulatory effect was present regardless of the time point. However, to explore whether the effects of stimulation condition were present in each time point we performed some explorative t-tests with no correction for multiple comparisons since this was just an explorative analysis.

      In experiment 3, we used the same approach as experiment 1. However, since we had a specific hypothesis on the direction of the effect already observed in our previous study, i.e. increase in spectral power (Maiella et al., Scientific Report 2022), our tests were 1-tailed.

      For the p-values, we will correct the manuscript reporting the exact values for every result.

      While the authors went to great lengths trying to probe the neural changes likely associated with the memory improvement after stimulation, it is impossible from their data to causally relate the findings from experiments 3 and 4 to the behavioral effects in experiments 1 and 2. This is acknowledged by the authors and there are good methodological reasons for why TMS-EEG and fMRI had to be collected in sperate experiments, but it is still worth pointing out to readers that this limits inferences about how exactly dual iTBS and γtACS of the precuneus modulate learning and memory.

      Thank you for your comment. We fully agree with your observation, which is why this aspect has been considered in the study's limitations. To address your concern, we will further emphasize the fact that our findings do not allow precise inferences regarding the specific mechanisms by which dual iTBS and γtACS of the precuneus modulate learning and memory.

      There were no stimulation-related performance differences in the short-term memory task used in experiments 1 and 2. The authors argue that this demonstrates that the intervention specifically targeted long-term associative memory formation. While this is certainly possible, the STM task was a spatial memory task, whereas the LTM task relied (primarily) on verbal material. It is thus also possible that the stimulation effects were specific to a stimulus domain instead of memory type. In other words, could it be possible that the stimulation might have affected STM performance if the task taxed verbal STM instead? This is of course impossible to know without an additional experiment, but the authors could mention this possibility when discussing their findings regarding the lack of change in the STM task.

      Thank you for your insightful observation. We argue that the intervention primarily targeted long-term associative memory formation, as our findings demonstrated effects only on FNAT. However, as you correctly pointed out, we cannot exclude the possibility that the stimulation may also influence short-term verbal associative memory. We will acknowledge this potential effect when discussing the absence of significant findings in the STM task.

      While the authors discuss the potential neural mechanisms by which the combined stimulation conditions might have helped memory formation, the psychological processes are somewhat neglected. For example, do the authors think the stimulation primarily improves the encoding of new information or does it also improve consolidation processes? Interestingly, the beneficial effect of dual iTBS and γtACS on recall performance was very stable across all time points tested in experiments 1 and 2, as was the performance in the other conditions. Do the authors have any explanation as to why there seems to be no further forgetting of information over time in either condition when even at immediate recall, accuracy is below 50%? Further, participants started learning the associations of the FNAT immediately after the stimulation protocol was administered. What would happen if learning started with a delay? In other words, do the authors think there is an ideal time window post-stimulation in which memory formation is enhanced? If so, this might limit the usability of this procedure in real-life applications.

      Thank you for your comment and for raising these important points.

      We hypothesized that co-stimulation would enhance encoding processes. Previous studies have shown that co-stimulation can enhance gamma-band oscillations and increase cortical plasticity (Guerra et al., Brain Stimulation 2018; Maiella et al., Scientific Reports 2022). Given that the precuneus (Brodt et al., Science 2018; Schott et al., Human Brain Mapping 2018), gamma oscillations (Osipova et al., Journal of Neuroscience 2006; Deprés et al., Neurobiology of Aging 2017; Griffiths et al., Trends in Neurosciences 2023), and cortical plasticity (Brodt et al., Science 2018) have all been associated with encoding processes, we decided to apply co-stimulation before the encoding phase, to boost it.

      We applied the co-stimulation immediately before the learning phase to maximize its potential effects. While we observed a significant increase in gamma oscillatory activity lasting up to 20 minutes, we cannot determine whether the behavioral effects we observed would have been the same with a co-stimulation applied 20 minutes before learning. Based on existing literature, a reduction in the efficacy of co-stimulation over time could be expected (Huang et al., Neuron 2005; Thut et al., Brain Topography 2009). However, we hypothesize that multiple stimulation sessions might provide an additional boost, helping to sustain the effects over time (Thut et al., Brain Topography 2009; Koch et al., Neuroimage 2018; Koch et al., Brain 2022).

      Regarding the absence of further forgetting in both stimulation conditions, we think that the clinical and demographical characteristics of the sample (i.e. young and healthy subjects) explain the almost absence of forgetting after one week.

    1. Author response:

      Point-by-point description of the revisions

      Reviewer #1 (Evidence, reproducibility and clarity):

      The study is well-executed and provides many interesting leads for further experimental studies, which makes it very important. One of the significant hypotheses in this context is metazoan Wnt Lipocone domain interactions with lipids, which remain to be explored.

      The manuscript is generally navigable for interesting reading despite being content-rich. Overall, the figures are easy to follow.

      We thank the reviewer for the thoughtful and favorable assessment.

      Major comments:

      I urge the authors to consider creating a first figure summarizing the broad approach and process involved in discovering the lipocone superfamily. This would help the average reader easily follow the manuscript.

      It will be helpful to have the final model/synthesis figure, which provides a take-home message that combines the main deductions from Fig 1c, Fig 4, Fig 5, and Fig 6 to provide an eagle's eye view (also translating the arguments on Page 38 last para into this potential figure).

      We have generated a two-part figure that synthesizes these two requests, also in line with the recommendations made by Reviewer 3. Depending on the accepting Review Commons journal, we plan to either submit this as a graphical abstract/TOC figure (as suggested by Reviewer 3) or as a single figure. We prefer starting with the first approach as it will keep our figure count the same.

      Minor comments:

      Fig 1C: The authors should provide a statistical estimate of the difference in transmembrane tendency scores between the "membrane" and "globular" versions of the Lipocone domains.

      To address this, we calculated group-wise differences using the Kruskal-Wallis nonparametric test, followed by Dunn’s test with Bonferroni correction for a more stringent evaluation. The results of which are presented as a critical difference diagram in the new Supplementary Figure S3. The analysis is explained in the Methods section of the revised manuscript, and the statistically significant difference is mentioned in the text. This analysis identifies three groups of significantly different Lipocone families based on their transmembrane tendency: those predicted (or known) to associate with the prokaryotic membranes, those predicted to be diffusible, and a small number of families residing eukaryotic ER membranes or bacterial outer membranes.

      Reviewer #2 (Evidence, reproducibility and clarity):

      This is a remarkable study, one of a kind. The authors trace the entire huge superfamily containing Wnt proteins which origins remained obscure before this work. Even more amazingly, they show that Wnts originated from transmembrane enzymes. The work is masterfully executed and presented. The conclusions are strongly supported by multiple lines of evidence. Illustrations are beautifully crafted. This is an exemplary work of how modern sequence and structure analysis methods should be used to gain unprecedented insights into protein evolution and origins.

      We thank the reviewer for the positive evaluation of our work.

      Minor comments.

      (1) In fig 1, VanZ structure looks rather different from the rest and is a more tightly packed helical bundle. It might be useful for the readers to learn more about the arguments why authors consider this family to be homologous with the rest, and what caused these structural changes in packing of the helices.

      First, the geometry of an α-helix can be approximated as a cylinder, resulting in contact points that are relatively small. Fewer contact constraints can lead to structural variation in the angular orientations between the helices of an all α-helical domain, resulting in some dispersion in space of the helical axes. As a result, some of the views can be a bit confounding when presented as static 2D images. Second, of the two VanZ clades the characteristic structure similar to the other superfamily members is more easily seen in the VanZ-2 clade (as illustrated in supplementary Figure S2).    

      Importantly, the membership of the VanZ domains was recovered via significant hits in our sequence analysis of the superfamily. Importantly, when the sequence alignments of the active site are compared (Figure 2), VanZ retains the conserved active site residue positions, which are predicted to reside spatially in the same location and project into an equivalent active site pocket as seen in the other families in the superfamily. Further, this sequence relationship is captured by the edges in the network in Figure 1B: multiple members of the superfamily show edges indicating significant relationships with the two VanZ families (e.g., HHSearch hits of probability greater than 90%; p<0.0001 are observed between VanZ-1 and Skillet-DUF2809, Skillet-1, Skillet-4, YfiM-1, YfiM-DUF2279, Wok, pPTDSS, and cpCone-1). Thus, they occupy relatively central locations in the sequence similarity network, indicating a consistent sequence similarity connection to multiple other families.

      (2) Fig. 4 color bars before names show a functional role. How does the blue bar "described for the first time" fits into this logic? Maybe some other way to mark this (an asterisk?) could be better to resolve this sematic inconsistency.

      We have shifted the blue bars into asterisks, which follow family names, now stated in the updated legend.

      Reviewer #3 (Evidence, reproducibility and clarity):

      The manuscript by Burroughs et al. uses informatic sequence analysis and structural modeling to define a very large, new superfamily which they dub the Lipocone superfamily, based on its function on lipid components and cone-shaped structure. The family includes known enzymatic domains as well as previously uncharacterized proteins (30 families in total). Support for the superfamily designation includes conserved residues located on the homologous helical structures within the fold. The findings include analyses that shed light on important evolutionary relationships including a model in which the superfamily originated as membrane proteins where one branch evolved into a soluble version. Their mechanistic proposals suggest possible functions for enzymes currently unassigned. There is also support for the evolutionary connection of this family with the human immune system. The work will be of interest to those in the broad areas of bioinformatics, enzyme mechanisms, and evolution. The work is technically well performed and presented.

      We appreciate the positive evaluation of our work by the reviewer.

      Referees cross-commenting

      All the comments seem useful to me. I like Reviewer 1's suggestion for a flowchart showing the methodology. I think the summarizing figure suggested could be a TOC abstracvt, which many journals request.

      To accommodate this comment (along with Reviewer 1’s comments), we have generated a two-part figure containing the methodology flowchart and the summary of findings. Combining the two provides some before-and-after symmetry to a TOC figure, while also avoiding further inflation of the figure count, which would likely be an issue at one or more of the Review Commons journals.

      The authors may wish to consider the following points (page numbers from PDF for review):

      (1) It would be useful in Fig 1A, either in main text or the supporting information, to also have a an accompanying topology diagram- I like the coloring of the helices to show the homology but the connections between them are hard to follow

      We acknowledge the reviewer’s concern as one shared by ourselves. We have placed such a topology diagram in Figure 1A, and now refer to it at multiple points in the manuscript text.

      (2) Page: 6- In the paragraph marked as an example- please call out Fig1A when the family mentioned is described (I believe SAA is described as one example)

      We have added these pointers in the text, where appropriate.

      (3) Page: 7- The authors state "these 'hydrophobic families' often evince a deeper phyletic distribution pattern than the less-hydrophobic families (Figure S1), implying that the ancestral version of the superfamily was likely a TM domain" there should be more explanation or information here - I am not certain from looking at FigS1 what a deeper phyletic distribution pattern means. Perhaps explaining for a single example? I also see that this important point is discussed in the conclusions- it is useful to point to the conclusion here.

      Our use of the ‘deeper’ in this context is meant to convey the concept that more widely conserved families/clades (both across and within lineages) suggest an earlier emergence. In the Lipocone superfamily, this phylogenetic reasoning supports an evolutionary scenario where the membrane-inserted versions generally emerged early, while the solubilized versions, which are found in relatively fewer lineages, emerged later.

      To address this objectively, we have calculated a simple phyletic distribution metric that combines the phyletic spread of a Lipocone clade with its depth within individual lineages, which is then plotted as a bargraph (Supplemental Figure S1). Briefly, this takes the width of the bar as the phyletic spread across the number of distinct taxonomic lineages and its height as a weighted mean of occurrence within each lineage (depth). The latter helps dampen the effects of sampling bias. In the resulting graph, lineages with a lower height and width are likely to have been derived later than those with a greater height and width. A detailed description clarifying this has been added to the Methods section of the revised manuscript. The results support two statements that are made in the text: 1) that the Wok and VanZ clades are the most widely and deeply represented clades in the superfamily, and 2) that the predicted transmembrane versions tend to be more widely and deeply distributed. We have also added a statement in the results with a pointer to Figure S1 to clarify this point raised by the referee.

      (4) For figure 3 I would suggest instead of coloring by atom type- to color the leaving group red and the group being added blue so the reader can see where the moieties start and end in substrates and products

      We have retained the atom type coloring in the figure for ease of visualizing the atom types. However, to address the reviewer’s concern, we have added dashed colored circles to highlight attacking and leaving groups in the reactions. The legend has been updated accordingly.

      (5) Page: 13- The authors state "While the second copy in these versions is catalytically inactive, the H1' from the second duplicate displaces the H1 from the first copy," So this results in a "sort of domain swap" correct? It may be more clear to label both copies in Figure 3 upper right so it is easier for the reader to follow.

      We have added these labels to the updated Figure S4 (formerly S3).

      (6) The authors state "In addition to the fusion to the OMP β-barrel, the YfiM-DUF2279 family (Figure 5H) shows operonic associations with a secreted MltG-like peptidoglycan lytic transglycosylase (127,128), a lipid anchored cytochrome c heme-binding domain (129), a phosphoglucomutase/phosphomannomutase enzyme (130), a GNAT acyltransferase (131), a diaminopimelate (DAP) epimerase (132), and a lysozyme like enzyme (133). In a distinct operon, YfiM-DUF2279 is combined with a GT-A glycosyltransferase domain (79), a further OMP β-barrel, and a secreted PDZ-like domain fused to a ClpP-like serine protease (134,135) (Figure 5H)." this combination of enzymes sounds like those in the pathways for oligosaccharide synthesis which is cytoplasmic but the flippase acts to bring the product to the periplasm. Please make sure it is clear that these enzymes may act at different faces of the membrane.

      We have made that point explicit in the revised manuscript in the paragraph following the above-quoted statement.

      (7) Page: 21- the authors should remove the unpublished observations on other RDD domain or explain or cite them

      The analysis of the RDD domain is a part of a distinct study whose manuscript we are currently preparing, and explaining its many ramifications would be outside the scope of this manuscript. Moreover, placing even an account of it in this manuscript would break its flow and take the focus away from the Lipocone superfamily. Further, its inclusion of the RDD story would substantially increase the size of the manuscript. However, it is commonly fused to the Lipocone domain; hence, it would be remiss if we entirely remove a reference to it. Accordingly, we retain a brief account of the RDD-fused Lipocone domains in the revised manuscript that is just sufficient to make the relevant functional case”.

      (8) Page: 34- The authors state "For instance, the emergence of the outer membrane in certain bacteria was potentially coupled with the origin of the YfiM and Griddle clades (Figure 4)." I don't see origin point indicated in figure 4 (emergence of outer membrane- this may be helpful to indicate in some way- also I am not certain what the dashed circles in Fig 4 are indicating- its not in the legend?

      This annotation has been added to the revised Figure 4, and the point of recruitment is indicated with a  “X” sign, along with a clarification in the legend regarding the dashed circles.

      (9) In terms of the hydrophobicity analysis, it would be good to mark on the plot (Fig 1C) one or two examples of lipocone members with known structure that are transmembrane proteins as a positive control

      We have added these markers (colored triangles and squares for these families to the plot.

      Grammar, typos

      Page: 3- abstract severance is an odd word to use for hydrolysis or cleavage

      We have changed to “cleavage”.

      Page: 5- "While the structure of Wnt was described over a decade prior" should read "Although the structure of ..."

      Page 7 - "One family did not yield a consistent prediction for orientation"- please state which family

      Page: 8 "While the ancestral pattern is noticeably degraded in the metazoan Wnt (Met-Wnt) family, it is strongly preserved in the prokaryotic Min-Wnt family." Should read "Although the ancestral..."

      throughout- please replace solved with experimentally determined to be clear and avoid jargon

      Please replace "TelC severs the link" with "TelC cleaves the bond "

      We have made the above changes.

      Page: 19- the authors state "a lipobox-containing synaptojanin superfamily phosphoesterase (125) and a secreted R-P phosphatase (126) (see Figure 6, Supplementary Data)" I was uncertain if the authors meant Fig S6 or they meant see Fig 6 and something else in supplementary data. Please fix.

      In this pointer, we intended to flag the relevant gene neighborhoods in both Figures 5H and 6, as well as highlight the additional examples contained in the Supplementary Data. We have updated the point

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):*

      As stated by the authors in the introduction, the RNA-binding protein Sxl is foundational to understanding sex determination in Drosophila. Sxl has been extensively studied as the master regulator of female sex determination in the soma, where it is known to initiate an alternative splicing cascade leading to the expression of DsxF. Additionally, Sxl has been shown to be responsible for keeping X chromosome dosage compensation off in females, while males hyperactivate their X chromosome. While these roles have been well defined, the authors explore an aspect of Sxl that is quite separate from its role as master regulator of female fate. They describe Sxl-RAC, a Sxl isoform that is expressed in the male and female nervous system. Using several genomic techniques, the authors conclude that the Sxl-RAC isoform associates with chromatin in a similar pattern to the RNA polymerase II/III subunit, Polr3E, and Sxl depends on Polr3E for chromatin-association. Further, neuronal loss of Sxl causes changes in lifetime and geotaxis in a similar manner as loss of Polr3E. The work is thorough and significant and should be appropriate for publication if a few issues can be addressed.

      Major Concerns:*

      * 1) How physiological is the Sxl chromatin-association assay? As binding interactions are concentration-dependent, how similar is Sxl-DAM expression to wt Sxl expression in neurons? In addition, does the Sxl-DAM protein function as a wt Sxl protein? Does UAS-Sxl-DAM rescue any Sxl loss phenotypes?*

      Author response:

      As Reviewer 3 correctly notes, Targeted DamID relies on ribosomal re-initiation (codon slippage) to produce only trace amounts of the Dam-fusion protein. By design, this results in expression levels that are significantly lower than those of the endogenous protein. As such, the experiment can be interpreted within a near–wild-type context, rather than as an overexpression model. The primary aim of this experiment was to determine whether Sxl associates with chromatin, and our dataset provides clear evidence supporting such binding.

      2) Is Polr3E chromatin-association also dependent on Sxl? They should do the reciprocal experiment to their examination of Sxl chromatin-association in Polr3E knockdown. This might also help address point 1-if wt Sxl is normally required for aspects of Polr3E chromatin binding, then concerns about whether the Sxl-DAM chromatin-association is real or artifactual would be assuaged.

      Author response:

      This is an interesting thought, however, if Sxl were required for Polr3E recruitment to RNA Pol III, then, in most male Drosophila melanogaster cells, Polr3E would not be incorporated, and males would not be viable (as it is essential for Pol III activity). While it is possible that there could be a subtle effect on Polr3E recruitment, such an experiment, would not alter the central conclusion of our study - that Sxl is recruited to chromatin (accessory to the Pol III complex) via Polr3E.

      Minor concerns:

      * The observed Sxl loss of function phenotypes are somewhat subtle (although perhaps any behavior phenotype at all is a plus). Did they try any other behaviour assays-courtship, learning/memory, anything else at all to test nervous system function?*


      Author response:

      Given the exploratory nature of this study, we focused on broader behavioural and transcriptional assays.

      While well written, it is sometimes difficult to understand how the experiment was performed or what genotypes were used without looking into the methods sections. One example is they should describe the nature of the Sxl-DAM fusion protein clearly in the results.

      Author response:

      We will revise these sections to improve clarity and ensure there is no confusion.

      * Reviewer #1 (Significance (Required)):

      This manuscript represents a dramatic change in our thinking about the action of the Sex-lethal protein. Previously, Sxl was known as the master regulator of both sex determination and dosage compensation, and performed these roles as an RNA-binding protein affecting RNA splicing and translational regulation. Here, the authors describe a sex-non-specific role of Sxl in the male and female nervous system. Further, this activity appears independent of Sxl's RNA binding activity and instead Sxl functions as a chromatin-associating protein working with the RNA pol2/3 factor Polr3E to regulate gene expression. Thus, this represents a highly significant finding. *

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):*

      Summary: In this paper, the authors report on an unexpected activity for Sex lethal (Sxl) (a known splicing regulator that functions in sex determination and dosage compensation) in binding to chromatin. They show, using DamID, that Sxl binds to approximately the same chromatin regions as Polr3E (a subunit of RNA Pol III). They show that this binding to chromatin is unaffected by mutations in the RNA binding domains or by deletions of either N or C terminal regions of the Sxl protein. This leads the authors to conclude that Sxl must bind to chromatin through some interacting protein working through the central region of the Sxl protein. They show that Sxl binding is dependent on Polr3E function. They show that male-specific neuronal knockdown of Sxl gives similar phenotypes to knockdown of Polr3E in terms of lethality and improved negative geotaxis. They show gene expression changes with knockdown of Sxl in male adult neurons - mainly that metabolic and pigmentation genes go down in expression. They also show that expression of a previously discovered male adult specific form of Sxl (that does not have splicing activity) in the same neurons also leads to changes in gene expression, including more upregulated than downregulated tRNAs. But they don't see (or don't show) that the same tRNA genes are down with knockdown of Sxl. Nonetheless, based on these findings, they suggest that Sxl plays an important role in regulating Pol III activity through the Polr3E subunit.

      Major comments:

      *

      *To be honest, I'm not convinced that the conclusions drawn from this study are correct. The fact that every mutant form of Sxl shows the same result from the DamID labelling is a little concerning. I would like to see independent evidence of the SxlRac protein binding chromatin. *

      Do antibodies against this form (or any form) of Sxl bind chromatin in salivary gland polytene chromosomes, for example? Does Sxl from other insects where Sxl has no role in sex determination bind chromatin?


      __Author Response: __

      Regarding the reviewer’s overall concerns about the legitimacy of the Sxl binding data:

      1. i) The fold differences between Dam-Sxl-mutants and the Dam-only control are very robust (up to 9 log2 fold change (500-fold change)), which is higher than what we observe with most transcription factors using Targeted DamID.
      2. ii) We observed that Sxl binding was significantly reduced upon knockdown of Polr3E, confirming that the signal we observe is biologically specific and not due to technical noise or background. iii) If the concern relates to potential Sxl binding in non-neuronal tissues such as salivary glands, we would like to clarify that all DamID constructs were expressed under elav-GAL4, a pan-neuronal driver. Furthermore, dissections were performed to isolate larval brains, with salivary glands carefully removed. This ensures that chromatin profiles were derived from neuronal tissue exclusively.

      3. iv) Salivary gland polytene chromosome staining with a Sxl antibody in a closely related species (Drosophila virilis) show __binding of Sxl to chromatin __in both sexes (Bopp et al., 1996). We will include more text in the revised manuscript to emphasise these points.

      Do antibodies against this form (or any form) of Sxl bind chromatin in salivary gland polytene chromosomes, for example? Does Sxl from other insects where Sxl has no role in sex determination bind chromatin?

      Author Response:

      Prior work in Drosophila virilis (where Sxl is also required for sex determination and Sxl-RAC is conserved) has already demonstrated Sxl-chromatin association (using a full-length Sxl antibody) in salivary glands using polytene chromosome spreads (Bopp et al., 1996). Binding is observed in both sexes and across the genome, reflecting our observations. We will incorporate this into the revised discussion to support the chromatin-binding role of Sxl across species.

      There is a clear and long-overlooked precedent for Sxl's alternative, sex-independent roles, findings that have been largely overshadowed by the gene’s canonical function. Our study not only validates and extends these observations but also brings much-needed attention to this understudied aspect of fundamental biology.

      Bopp D, Calhoun G, Horabin JI, Samuels M, Schedl P. Sex-specific control of Sex-lethal is a conserved mechanism for sex determination in the genus Drosophila. Development. 1996 Mar;122(3):971-82. doi: 10.1242/dev.122.3.971. PMID: 8631274.

      I would like to see independent evidence of the SxlRac protein binding chromatin.

      * *__Author Response: __

      We do not believe this is necessary:

      1. i) Our data demonstrated that a large N-terminal truncation of Sxl (removing far more of the N-terminal region than is absent in Sxl-RAC) does not impair chromatin binding.
      2. ii) Our deletion experiments show that it is the central domain __of Sxl that is required for chromatin association (as removal of the N-or C-terminal domain has no effect). This central domain is __unaffected in Sxl-RAC. iii) Independent Y2H experiments have shown that it is exclusively the__ RBD-1 __(RNA binding domain 1) of the central domain of Sxl that interacts with Polr3E (Dong et al., 1999). Sxl-RAC contains this region, therefore will be recruited by Polr3E.

      iv) Review 3 also believes that this is not necessary (see cross-review below) and highlights the robustness of the Y2H experiments performed by Dong et al., 1999.

      • *

      Also, given that their DamID experiments reveal that Sxl binds half of the genes encoded in the Drosophila genome, finding that it binds around half of the tRNA genes is perhaps not surprising.


      __Author Response: __

      Our data show that Sxl binds to a range of Pol III-transcribed loci, and this binding pattern supports the proposed model that Sxl plays a broader regulatory role in Pol III activity. Within these Pol III targets, tRNA genes represent a specific and biologically relevant subset. The emphasis on tRNAs is not to suggest they are the exclusive or primary targets of Sxl, but rather to__ highlight a functionally important class of Pol III-transcribed elements__ that align with the model we are proposing. We will revise the text to better reflect this framing and avoid any confusion regarding the scope of Sxl’s binding profile.

      *I would like to see evidence beyond citing a 1999 yeast two-hybrid study that Sxl and Polr3E directly interact with one another. *


      Author response:

      We do not believe this is necessary (these points were also mentioned above):

      1. i) The Dong et al., 1999 study was highly comprehensive in its characterisation of Sxl binding to Polr3E.
      2. ii) Our DamID data provide strong complementary evidence for this interaction: knockdown of Polr3E robustly reduces Sxl’s recruitment to chromatin, strongly supporting the relevance of the interaction in vivo. iii) Review 3 highlights the robustness of the Y2H experiments performed by Dong et al., 1999.

      In my opinion, the differences in lethality observed with loss of Sxl versus control are unlikely to be meaningful given the different genetic backgrounds. The similar defects in negative geotaxis could be meaningful, but I'm unsure how often this phenotype is observed. What other class of genes affect negative geotaxis? It's a little unclear why having reduced expression of metabolic and pigment genes or of tRNAs would improve neuronal function.


      Author response:

      While the differences in survival were indeed subtle, they were statistically significant and thus warranted inclusion. Our primary aim in this section was to demonstrate that knockdown of Sxl or Polr3E results in comparable behavioural and transcriptional phenotypes, suggesting overlapping functional roles. In this context, we believe the data were presented transparently and effectively support our interpretation.

      Regarding the negative geotaxis phenotype, we appreciate the reviewer’s interest and agree that it is both intriguing and atypical. For this reason, we performed the assay multiple times, particularly in Polr3e knockdowns, to confirm the robustness of the result. To address potential confounding variables, we carefully selected control lines that account for genetic background and transgene insertion site, including KK controls and attP40-matched lines. We also employed multiple independent RNAi lines targeting Sxl to validate the phenotype across different genetic backgrounds.

      Although the observed improvement in climbing is unexpected, it is not without precedent in the RNA polymerase III field. Notably, Malik et al. (2024) demonstrated that heterozygous Polr3DEY/+ mutants exhibit a significantly delayed decline in climbing ability with age. We allude to this in the discussion and will revise the text to emphasise this connection more explicitly.

      Finally, while we recognise that negative geotaxis is a relatively broad assay and thus does not pinpoint the precise cellular mechanisms involved, we interpret the phenotype as suggesting a neural basis and a functional role for Sxl in the nervous system.

      One would expect that not just the same classes of genes would be affected by loss and overexpression of Sxl, but the same genes would be affected - are the same genes changing in opposite directions in the two experiments or just the same classes of genes. Likewise, are the same genes changing expression in the same direction with both Sxl and the Polr3E loss? Also, why are tRNA genes not also affected with Sxl loss. Finally, they describe the changes in gene expression as being in male adult neurons, but the sequencing was done of entire heads - so no way of knowing which cell type is showing differential gene expression.

      Author response:

      While we do examine gene classes, our approach also includes pairwise correlation analyses of gene expression changes between specific genotypes. Notably, we observed a significant positive correlation between Polr3e knockdowns and Sxl knockdowns, and a significant negative correlation between Sxl-RAC–expressing flies and Sxl knockdowns. Furthermore, we examined Sxl-DamID target genes within our RNA-seq datasets and found a consistent relationship between Sxl targets and genes differentially expressed in Polr3e knockdowns.

      Regarding the Pol III qPCR results, we note that tRNA expression changes may require a longer duration of RNAi induction (e.g., beyond 4 days) to become apparent, especially given that phenotypic effects such as changes in lifespan and negative geotaxis only emerge after 20 days or more. It is also plausible that Sxl knockdown leads to a partial reduction in Pol III efficiency, which may not be readily detectable through bulk Pol III qPCRs. We are willing to repeat Pol III qPCRs at later timepoints to further investigate this trend.

      Finally, we infer that gene expression changes observed in our RNA-seq data are of neuronal origin, as all knockdown and overexpression constructs used in this study were driven pan-neuronally using elav-/nSyb-GAL4. While we acknowledge that bulk RNA-seq does not provide cell-type resolution, tissue-specific assumptions are widely used in the field when driven by a relevant promoter.

      I'm also not sure what I'm supposed to be seeing in panel 5F (or in the related supplemental figure) and if it has any meaning - If they are using the Sxl-T2A-Gal4 to drive mCherry, I think one would expect to see expression since Sxl transcripts are made in both males and in females. Also, one would expect to see active protein expression (OPP staining) in most cells of the adult male brain and I think that is what is observed, but again, I'm not sure what I'm supposed to be looking at given the absence of any arrows or brackets in the figures.

      Author Response:

      Due to the presence of the T2A tag and the premature stop codon in exon 3 of early male Sxl transcripts, GAL4 expression is not expected in males unless the head-specific SxlRAC isoform is produced. The aim of panel 5F is to demonstrate the spatial overlap between SxlRAC expression (as we are examining male brains) and regions of elevated protein synthesis, as detected by OPP staining.

      To quantitatively assess this relationship, we performed colocalisation analysis using ImageJ, which showed a positive correlation between Sxl and OPP signal intensity, supporting this interpretation. It is also evident from our images that regions with lower levels of protein synthesis (such as the neuropil - as shown in independent studies Villalobos-Cantor et al., 2023) concurrently lack Sxl-related signal. We have highlighted regions in Fig. 5 exhibiting higher/lower levels of Sxl/OPP signal to better illustrate this relationship. We can also test the effects of knockdown/overexpression on general protein synthesis if required.

      Villalobos-Cantor S, Barrett RM, Condon AF, Arreola-Bustos A, Rodriguez KM, Cohen MS, Martin I. Rapid cell type-specific nascent proteome labeling in Drosophila. Elife. 2023 Apr 24;12:e83545. doi: 10.7554/eLife.83545. PMID: 37092974; PMCID: PMC10125018.

      Minor comments:

      * Line 223 - 225 - I believe that it is expected that Sxl transcripts would be broadly expressed in the male and female adult, given that it is only the spliced form of the transcript that is female specific in expression. *

      As explained above, the only isoform that will be ‘trapped’ by the T2A-GAL4 in males is the Sxl-RAC isoform (as the other isoforms contain premature stop codons). Our immunohistochemistry data indicate that Sxl-RAC is expressed in the male brain, specifically in neurons. Therefore, knockdown experiments in males will reduce all mRNA isoforms, of which, Sxl-RAC is the only one producing a protein.

      Line 236 - 238 - Sentence doesn't make sense.

      We have addressed and clarified this.

      Reviewer #2 (Significance (Required)):

      It would be significant to discover that a gene previously thought to function in only sex determination and dosage compensation also moonlights as a regulator of RNA polymerase III activity. Unfortunately, I am not convinced by the work presented in this study that this is the case.

      My expertise is in Drosophila biology, including development, transcription, sex determination, morphogenesis, genomics, transcriptomics, DNA binding

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):*

      Storer, McClure and colleagues use genome-wide DNA-protein binding assays, transcriptomics, and genetics to work out that Drosophila Sxl, widely known as an RNA-binding protein which functions as a splicing factor to determine sex identity in Drosophila and related species, is also a chromatin factor that can stimulate transcription by Pol III and Pol II of genes involved with metabolism and protein homeostasis, specifically some encoding tRNAs.

      The evidence for the tenet of the paper -- that Sxl acts as a chromatin regulator with Polr3E, activating at least some of its targets with either Pol III or Pol II -- is logical and compelling, the paper is well written and the figures well presented. Of course, more experiments could always be wished for and proposed, but I think this manuscript could be published in many journals with just a minor revision not involving additional experiments. I have a few specific comments below, all minor.*

      Scientific points: - The approach taken for the evaluation of Sxl DNA-binding activity in Fig2 is not entirely clear. I assume these are crosses of elav-Gal4 x different UAS- lines, then using males or females for UAS-Sxl-Full-Length. But what about the others? Were the experiments done in males only? This is hinted at in the main text but not explicitly indicated in the figure or the methods (at least, that I could easily find). And is this approach extended to all other experiments? Longevity? Climbing assays? Considering the role of Sxl, it may be helpful to be fastidiously systematic with this.


      Author Response:

      We have revised the wording to ensure greater clarity. Males were used for all survival and behavioural experiments (as only males can be leveraged for knocking down Sxl-RAC without affecting the canonical Sxl-F isoform).

      - In the discussion, lines 360-61, the authors say: Indeed, knockdown of Polr3E leads to a loss of Sxl binding to chromatin, suggesting a cooperative mechanism. Maybe I am misunderstanding the authors, but when I read "cooperation" in this context I think of biochemical cooperative binding. This is possible, but I do not think a simple 'requirement' test can suggest specifically that this mechanistic feature of biochemical binding is at play. I would expect, for starters, a reciprocal requirement for binding (which is not tested), and some quantitative features that would be difficult to evaluate in vivo. I do not think cooperative binding needs to be invoked anyway, as the authors do not make any specific point or prediction about it. But if they do think this is going on, I think it would need to be referred to as a speculation.


      Author Response:

      We appreciate that the original wording may have been unclear and will revise the text to more accurately reflect a functional relationship, rather than implying direct cooperation.

      - In lines 428-432, the authors discuss the ancestral role of Sxl and make a comparison with ELAV, in the context of an RNA-binding protein that has molecular functions beyond those of a splicing factor, considering the functions of ELAV in RNA stability and translation, and finishing with "suggesting that similar regulatory mechanisms may be at play". I do not understand this latter sentence. Which mechanisms are these? Are the authors referring to the molecular activities of ELAV and SXL? But what would be the similarity? SXL seems to have a dual capacity to bind RNA and protein interactors, which allows it to work both in chromatin-level regulation as well as post-transcriptionally in splicing; but ELAV seems rather to take advantage of its RNA binding function to make it work in multiple RNA-related contexts, all post-transcriptional. I do not see an obvious parallel beyond the fact that RNA binding proteins can function at different levels of gene expression regulation -- but I would not say this parallel are "similar regulatory mechanisms", so I find the whole comparison a bit confusing.


      Author Response:

      We have reduced this section, as it is largely speculative and intended to highlight potential, though indirect, links in higher organisms. Our goal was primarily to illustrate the possibility that Sxl may have an ancestral role distinct from its well-characterised function, and to suggest a potential avenue for future research into ELAV2’s involvement in chromatin or Pol III regulation.

      - One aspect of the work that I find is missing in the discussion is the possibility that the simultaneous capacity of Sxl for RNA binding and Polr3E binding: are these mutually exclusive? if so, are they competitive or hierarchical? how would they be coordinated anyway?


      Author Response:

      This is an interesting point, and we have expanded on it further in the Discussion section.

      - The only aspect of the paper where I found that one could make an experimental improvement is the claim that Sxl induces the expression of genes that have the overall effect of stimulating protein synthesis. The OPP experiment shows a correlation between the expression of Sxl and the rate of protein synthesis initiation. However, a more powerful experiment would be, rather obviously, to introduce Sxl knock-down in the same experiment, and observe whether in Sxl-expressing neurons the incorporation of OPP is reduced. I put this forth as a minor point because the tenet of the paper would not be affected by the results (though the perception of importance of the newly described function could be reinforced).

      • *

      Author Response:

      This could be a valid experiment and we are prepared to perform it if required.

      - In a similar way, it would be interesting to know whether the recruitment of Polr3E and Sxl to chromatin is co-dependent or Sxl follows Polr3E. This is also a minor point because this would possibly refine the mechanism of recruitment but does not alter the main discovery.

      Author Response:

      We have addressed a similar point for Reviewer 2 (see below) and will include a Discussion point for this:

      If Sxl were required for Polr3E recruitment to RNA Pol III, then, in most male Drosophila melanogaster cells, Polr3E would not be incorporated, and males would not be viable (as it is essential for Pol III activity). While it is possible that there could be a subtle effect on Polr3E recruitment, such an experiment, would not alter the central conclusion of our study - that Sxl is recruited to chromatin (accessory to the Pol III complex) via Polr3E.

      * Figures and reporting:

      • In Figure 2, it would be helpful to see the truncation coordinate for the N and C truncations.

      • In Figure 3D, genomic coordinates are missing.

      • In Figure 3E, the magnitude in the Y axis is not entirely clear (at least not to me). How is the amount of binding across the genome quantified? Is this the average amplitude of normalised TaDa signal across the genome? Or only within binding intervals?

      • Figure S3E-F: it would be interesting to show the degree of overlap between the downregulated genes that are also binding targets (regardless of the outcome).

      • Figure 5C-E: similarly to Figure S3, it would be interesting to know how the transcriptional effects compare with the binding targets.

      • Authors use Gehan-Breslow-Wilcoxon to test survival, which is a bit unusual, as it gives more weight to the early deaths (which are rare in most Drosophila longevity experiments). Is there any rationale behind this? It may be even favour their null hypothesis.*


      Author response:

      Thank you for the detailed feedback on our figures. We have__ incorporated__ the suggested changes.

      We agree that examining the overlap between Sxl binding sites and transcriptional changes is valuable, and we aimed to highlight this in the pie charts shown in Figures S3 and S5. If the reviewer is suggesting a more explicit quantification of the proportion of Sxl-Dam targets with significant transcriptomic changes, we are happy to include this analysis in the final version of the manuscript.

      As noted in the Methods, both Gehan–Breslow–Wilcoxon (GBW) and Kaplan–Meier tests were used. The significance in Figure 4a is specific to the GBW test, which we indicated by describing the effect as mild. Our focus here is not on the magnitude of survival differences, but on the consistent trends observed in both Polr3e and Sxl knockdowns.

      Writing and language:*

      • Introduction finishes without providing an outline of the findings (which is fine by me if that is what the authors wanted).

      • In lines 361-5, the authors say "We speculate that this interaction not only facilitates Pol III transcription but may also influence chromatin architecture and RNA Pol II-driven transcription as observed with Pol III regulation in other organisms". "This interaction" refers to Polr3E-Sxl-DNA interaction and with "Pol III transcription" I presume the authors refer to transcription executed by Pol III. I am not clear about the meaning of the end of the sentence "as observed with Pol III regulation in other organisms". What is the observation, exactly? That Pol III modifies chromatin in Pol II regulated loci, or that Pol III interactors change chromatin architecture?

      • DPE abbreviation is not introduced (and only used once).

      • A few typos: Line 41 ...splicing of the Sxl[late] transcripts, which is [ARE?] constitutively transcribed (Keyes et al.,... Line 76 ...sexes but appears restricted to the nervous system [OF] male pupae and adults (Cline et Line 289 ...and S41). To assess any effect [ON]translational output, O-propargyl-puromycin (OPP)o Line 323 ...illustrating that the majority (72%) changes in tRNA levels [ARE] due to upregulation...hi Line 402 ...it was discovered [WE DISCOVERED] Line 792 ...Sxl across chromosomes X, 2 L/R, 3 L/R and 4. The y-axis represents the log[SYMBOL] ratio... This happens in other figure legends as well.*


      Author response:

      Thank you for the detailed feedback, we have clarified and incorporated the suggested changes.

      **Referee Cross-commenting***

      Reviewer 1 asks how physiological is the Sxl chromatin-association assay. I think the loss of association in Polr3E knock-down and the lack of association of other splicing factors goes a long way into answering this question. It is true that having positive binding data specifically for Sxl-RAC and negative binding data for a deletion mutant of the RMM domain would provide more robust conclusions (see below), but I am not sure it is completely necessary -- though this will depend on which journal the authors want to send the paper to.

      I think that the comment of reviewer 1 about the levels of expression of Sxl-DAM does not apply here because of the way TaDa works - it relies on codon slippage to produce minimal amounts of the DAM fusion protein, so by construction it will be expressed at much lower levels than the endogenous protein.

      Reviewer 1 also asks whether Polr3E chromatin-association is also dependent on Sxl, to round up the model and also as a way to address whether Sxl association to chromatin is real. While I agree with this on the former aim (this would be a nice-to-have), I think I disagree on the latter; there is no need for Polr3E recruitment to depend on Sxl for Sxl association to chromatin to be physiologically relevant. Polr3E is a peripheral component of Pol III and unlikely to depend on a factor of restricted expression like Sxl to interact with chromatin. The recruitment of Sxl could well be entirely 'hierarchical' and subject to Polr3E.

      Revewer 2 is concerned with the fact that every mutant form of Sxl shows the same result from the DamID labelling. I have to agree with this to a point. A deletion mutant of RMM domains would address this. Microscopy evidence in salivary glands would be nice, certainly, but the system may not lend itself to this particular interaction, which might be short-lived and/or weak. I do not immediately see the relevance of the chromatin binding capacity of non-Drosophilidae Sxl -- though it might indicate that the impact of the discovery is less likely to go beyond this group.

      Reviewer 2 does not find surprising that some tRNA genes (less than half) are regulated by Sxl. I think the value of that observation is just qualitative, as tRNAs are Pol III-produced transcripts, but their point is correct. A hypergeometric test could settle this.

      Reviewer 2 is concerned that the evidence of direct interaction between Sxl and Polr3E is a single 1999 two-hybrid study. But that paper contains also GST pull-downs that narrow down the specific domains that mediate binding, and perform the binding in competitive salt conditions. I think it is enough. The author team, I think, are not biochemists, so finding the right collaborators and performing these experiments would take time that I am not sure is warranted.

      Reviewer 2 is also concerned that the longevity assays may not be meaningful due to the difference in genetic backgrounds. This is a very reasonable concern (which I would extend to the climbing assays - any quantitative phenotype is sensitive to genetic background). However, I think the authors here may have already designed the experiment with this in mind - the controls express untargeted RNAi constructs, but I lose track of which one is control of which. This should be clarified in Methods.

      Other comments are in line, I think, with what I have pointed out and I generally agree with everything else that has been said.

      Reviewer #3 (Significance (Required)):

      Drosophila Sxl is widely known as an RNA-binding protein which functions as a splicing factor to determine sex identity in Drosophila and related species. It is a favourite example of how splicing factors and alternative can have profound influence in biology and used cleverly in the molecular circuitry of the cell to enact elegant regulatory decisions.

      In this work, Storer, McClure and colleagues use genome-wide DNA-protein binding assays, transcriptomics, and genetics to work out that Sxl is also a chromatin factor with an sex-independent, neuron-specific role in stimulating transcription by Pol III and Pol II, of genes involved with metabolism and protein homeostasis, including some encoding tRNAs.

      This opens a large number of interesting biological questions that range from biochemistry, gene regulation or neurobiology to evolution. How is the simultaneous capacity of binding RNA and chromatin (with the same protein domain, RRM) regulated/coordinated? How did this dual activity evolve and which one is the ancestral one? How many other RRM-containin RNA-binding proteins can also bind chromatin? How is Sxl recruited to chromatin to both Pol II and Pol III targets and are they functionally related? If so, how is the coordination of cellular functions activated through different RNA polymerases taking place and what is the role of Sxl in this? What are the functional consequences to neuronal biology? Does this affect similarly all Sxl-expressing neurons?

      The evidence for the central tenet of the paper -- that Sxl acts as a chromatin regulator with Polr3E, activating at least some of its targets with either Pol III or Pol II -- is logical and compelling, the paper is well written and the figures well presented. Of course, more experiments could always be wished for and proposed, but I think this manuscript could be published in many journals with just a minor revision not involving additional experiments.*

      Reviewer #4 (Evidence, reproducibility and clarity (Required)):

      *The convincing analysis demonstrates a role for the Drosophila Sex determining gene sex lethal in controlling aspects of transcription in the nervous system independent of its role in splicing. Interaction with an RNA Pol III subunit mediating Sxl association with chromatin and similar knockdown phenotypes strongly support the role of Sxl in the regulation of neuronal metabolism. Given that Sxl is an evolutionary recent acquisition for sex determination, the study may reveal an ancestral role for Sxl.

      The conclusions are well justified by the datasets presented and I have no issues with the study or the interpretation. Throughout the work is well referenced, though perhaps the authors might take a look at Zhang et al (2014) (PMID: 24271947) for an interesting evolutionary perspective for the discussion.*

      Author Response:

      Thank you for the thoughtful suggestion. We will be sure to incorporate the findings from Zhang et al. regarding the evolution of the sex determination pathway.

      *I have some minor comments for clarification:

      There is no Figure 2b, should be labelled 2 or label TaDa plots as 2b

      Clarify if Fig 2 data are larval or adult *

      *Larval

      Fig 3d - are these replicates or female and male?

      Please elaborate on tub-GAL80[ts] developmental defects

      Fig 4e, are transcriptomics done with the VDRC RNAi line? The VDRC and BDSC RNAi lines exhibit different behaviours - former has "better" survival and Better negative geotaxis, the latter seems to have poorer survival but little geotaxis effect?*

      *Fig S3 - volcano plot for Polr3E?

      Fig S4a - legend says downregulated genes?

      The discussion should at least touch on the fact that Sxl amorphs (i.e. Sxl[fP7B0] are male viable and fertile, emphasising that the newly uncovered role is not essential.*

      Author Response:

      We agree with the suggestions outlined in the comments and have made the appropriate revisions.

      Reviewer #4 (Significance (Required)):*

      A nonessential role for Sxl in the nervous system independent of sex-determination contributes to better understanding a) the evolution of sex determining mechanisms, b) the role of RNA PolIII in neuronal homeostasis and c) more widely to the neuronal aging field. I think this well-focused study reveals a hitherto unsuspected role for Sxl.*

    2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #3

      Evidence, reproducibility and clarity

      Storer, McClure and colleagues use genome-wide DNA-protein binding assays, transcriptomics, and genetics to work out that Drosophila Sxl, widely known as an RNA-binding protein which functions as a splicing factor to determine sex identity in Drosophila and related species, is also a chromatin factor that can stimulate transcription by Pol III and Pol II of genes involved with metabolism and protein homeostasis, specifically some encoding tRNAs.

      The evidence for the tenet of the paper -- that Sxl acts as a chromatin regulator with Polr3E, activating at least some of its targets with either Pol III or Pol II -- is logical and compelling, the paper is well written and the figures well presented. Of course, more experiments could always be wished for and proposed, but I think this manuscript could be published in many journals with just a minor revision not involving additional experiments. I have a few specific comments below, all minor.

      Scientific points:

      • The approach taken for the evaluation of Sxl DNA-binding activity in Fig2 is not entirely clear. I assume these are crosses of elav-Gal4 x different UAS- lines, then using males or females for UAS-Sxl-Full-Length. But what about the others? Were the experiments done in males only? This is hinted at in the main text but not explicitly indicated in the figure or the methods (at least, that I could easily find). And is this approach extended to all other experiments? Longevity? Climbing assays? Considering the role of Sxl, it may be helpful to be fastidiously systematic with this.
      • In the discussion, lines 360-61, the authors say: Indeed, knockdown of Polr3E leads to a loss of Sxl binding to chromatin, suggesting a cooperative mechanism. Maybe I am misunderstanding the authors, but when I read "cooperation" in this context I think of biochemical cooperative binding. This is possible, but I do not think a simple 'requirement' test can suggest specifically that this mechanistic feature of biochemical binding is at play. I would expect, for starters, a reciprocal requirement for binding (which is not tested), and some quantitative features that would be difficult to evaluate in vivo. I do not think cooperative binding needs to be invoked anyway, as the authors do not make any specific point or prediction about it. But if they do think this is going on, I think it would need to be referred to as a speculation.
      • In lines 428-432, the authors discuss the ancestral role of Sxl and make a comparison with ELAV, in the context of an RNA-binding protein that has molecular functions beyond those of a splicing factor, considering the functions of ELAV in RNA stability and translation, and finishing with "suggesting that similar regulatory mechanisms may be at play". I do not understand this latter sentence. Which mechanisms are these? Are the authors referring to the molecular activities of ELAV and SXL? But what would be the similarity? SXL seems to have a dual capacity to bind RNA and protein interactors, which allows it to work both in chromatin-level regulation as well as post-transcriptionally in splicing; but ELAV seems rather to take advantage of its RNA binding function to make it work in multiple RNA-related contexts, all post-transcriptional. I do not see an obvious parallel beyond the fact that RNA binding proteins can function at different levels of gene expression regulation -- but I would not say this parallel are "similar regulatory mechanisms", so I find the whole comparison a bit confusing.
      • One aspect of the work that I find is missing in the discussion is the possibility that the simultaneous capacity of Sxl for RNA binding and Polr3E binding: are these mutually exclusive? if so, are they competitive or hierarchical? how would they be coordinated anyway?
      • The only aspect of the paper where I found that one could make an experimental improvement is the claim that Sxl induces the expression of genes that have the overall effect of stimulating protein synthesis. The OPP experiment shows a correlation between the expression of Sxl and the rate of protein synthesis initiation. However, a more powerful experiment would be, rather obviously, to introduce Sxl knock-down in the same experiment, and observe whether in Sxl-expressing neurons the incorporation of OPP is reduced. I put this forth as a minor point because the tenet of the paper would not be affected by the results (though the perception of importance of the newly described function could be reinforced).
      • In a similar way, it would be interesting to know whether the recruitment of Polr3E and Sxl to chromatin is co-dependent or Sxl follows Polr3E. This is also a minor point because this would possibly refine the mechanism of recruitment but does not alter the main discovery.

      Figures and reporting:

      • In Figure 2, it would be helpful to see the truncation coordinate for the N and C truncations.
      • In Figure 3D, genomic coordinates are missing.
      • In Figure 3E, the magnitude in the Y axis is not entirely clear (at least not to me). How is the amount of binding across the genome quantified? Is this the average amplitude of normalised TaDa signal across the genome? Or only within binding intervals?
      • Figure S3E-F: it would be interesting to show the degree of overlap between the downregulated genes that are also binding targets (regardless of the outcome).
      • Figure 5C-E: similarly to Figure S3, it would be interesting to know how the transcriptional effects compare with the binding targets.
      • Authors use Gehan-Breslow-Wilcoxon to test survival, which is a bit unusual, as it gives more weight to the early deaths (which are rare in most Drosophila longevity experiments). Is there any rationale behind this? It may be even favour their null hypothesis.

      Writing and language:

      • Introduction finishes without providing an outline of the findings (which is fine by me if that is what the authors wanted).
      • In lines 361-5, the authors say "We speculate that this interaction not only facilitates Pol III transcription but may also influence chromatin architecture and RNA Pol II-driven transcription as observed with Pol III regulation in other organisms". "This interaction" refers to Polr3E-Sxl-DNA interaction and with "Pol III transcription" I presume the authors refer to transcription executed by Pol III. I am not clear about the meaning of the end of the sentence "as observed with Pol III regulation in other organisms". What is the observation, exactly? That Pol III modifies chromatin in Pol II regulated loci, or that Pol III interactors change chromatin architecture?
      • DPE abbreviation is not introduced (and only used once).
      • A few typos: Line 41 ...splicing of the Sxl[late] transcripts, which is [ARE?] constitutively transcribed (Keyes et al.,... Line 76 ...sexes but appears restricted to the nervous system [OF] male pupae and adults (Cline et Line 289 ...and S41). To assess any effect [ON]translational output, O-propargyl-puromycin (OPP)o Line 323 ...illustrating that the majority (72%) changes in tRNA levels [ARE] due to upregulation...hi Line 402 ...it was discovered [WE DISCOVERED] Line 792 ...Sxl across chromosomes X, 2 L/R, 3 L/R and 4. The y-axis represents the log[SYMBOL] ratio... This happens in other figure legends as well.

      Referee Cross-commenting

      Reviewer 1 asks how physiological is the Sxl chromatin-association assay. I think the loss of association in Polr3E knock-down and the lack of association of other splicing factors goes a long way into answering this question. It is true that having positive binding data specifically for Sxl-RAC and negative binding data for a deletion mutant of the RMM domain would provide more robust conclusions (see below), but I am not sure it is completely necessary -- though this will depend on which journal the authors want to send the paper to.

      I think that the comment of reviewer 1 about the levels of expression of Sxl-DAM does not apply here because of the way TaDa works - it relies on codon slippage to produce minimal amounts of the DAM fusion protein, so by construction it will be expressed at much lower levels than the endogenous protein.

      Reviewer 1 also asks whether Polr3E chromatin-association is also dependent on Sxl, to round up the model and also as a way to address whether Sxl association to chromatin is real. While I agree with this on the former aim (this would be a nice-to-have), I think I disagree on the latter; there is no need for Polr3E recruitment to depend on Sxl for Sxl association to chromatin to be physiologically relevant. Polr3E is a peripheral component of Pol III and unlikely to depend on a factor of restricted expression like Sxl to interact with chromatin. The recruitment of Sxl could well be entirely 'hierarchical' and subject to Polr3E.

      Revewer 2 is concerned with the fact that every mutant form of Sxl shows the same result from the DamID labelling. I have to agree with this to a point. A deletion mutant of RMM domains would address this. Microscopy evidence in salivary glands would be nice, certainly, but the system may not lend itself to this particular interaction, which might be short-lived and/or weak. I do not immediately see the relevance of the chromatin binding capacity of non-Drosophilidae Sxl -- though it might indicate that the impact of the discovery is less likely to go beyond this group.

      Reviewer 2 does not find surprising that some tRNA genes (less than half) are regulated by Sxl. I think the value of that observation is just qualitative, as tRNAs are Pol III-produced transcripts, but their point is correct. A hypergeometric test could settle this.

      Reviewer 2 is concerned that the evidence of direct interaction between Sxl and Polr3E is a single 1999 two-hybrid study. But that paper contains also GST pull-downs that narrow down the specific domains that mediate binding, and perform the binding in competitive salt conditions. I think it is enough. The author team, I think, are not biochemists, so finding the right collaborators and performing these experiments would take time that I am not sure is warranted.

      Reviewer 2 is also concerned that the longevity assays may not be meaningful due to the difference in genetic backgrounds. This is a very reasonable concern (which I would extend to the climbing assays - any quantitative phenotype is sensitive to genetic background). However I think the authors here may have already designed the experiment with this in mind - the controls expres untargeted RNAi constructs, but I lose track of which one is control of which. This should be clarified in Methods.

      Other comments are in line, I think, with what I have pointed out and I generally agree with everything else that has been said.

      Significance

      Drosophila Sxl is widely known as an RNA-binding protein which functions as a splicing factor to determine sex identity in Drosophila and related species. It is a favourite example of how splicing factors and alternative can have profound influence in biology and used cleverly in the molecular circuitry of the cell to enact elegant regulatory decisions.

      In this work, Storer, McClure and colleagues use genome-wide DNA-protein binding assays, transcriptomics, and genetics to work out that Sxl is also a chromatin factor with an sex-independent, neuron-specific role in stimulating transcription by Pol III and Pol II, of genes involved with metabolism and protein homeostasis, including some encoding tRNAs.

      This opens a large number of interesting biological questions that range from biochemistry, gene regulation or neurobiology to evolution. How is the simultaneous capacity of binding RNA and chromatin (with the same protein domain, RRM) regulated/coordinated? How did this dual activity evolve and which one is the ancestral one? How many other RRM-containin RNA-binding proteins can also bind chromatin? How is Sxl recruited to chromatin to both Pol II and Pol III targets and are they functionally related? If so, how is the coordination of cellular functions activated through different RNA polymerases taking place and what is the role of Sxl in this? What are the functional consequences to neuronal biology? Does this affect similarly all Sxl-expressing neurons?

      The evidence for the central tenet of the paper -- that Sxl acts as a chromatin regulator with Polr3E, activating at least some of its targets with either Pol III or Pol II -- is logical and compelling, the paper is well written and the figures well presented. Of course, more experiments could always be wished for and proposed, but I think this manuscript could be published in many journals with just a minor revision not involving additional experiments.

    1. I am sincerely grateful to the editors and peer reviewers at MetaROR for their detailed feedback and valuable comments and suggestions. I have addressed each point below.

      Handling Editor

      1. However, the article’s progression and arguments, along with what it seeks to contribute to the literature need refinement and clarification. The argument for PRC is under-developed due to a lack of clarity about what the article means by scientific

      communication. Clarity here might make the endorsement of PRC seem like less of a foregone conclusion.

      The structure of the paper (and discussion) has changed significantly to address the feedback.

      2. I strongly endorse the main theme of most of the reviews, which is that the progression and underlying justifications for this article’s arguments needs a great deal of work. In my view, this article’s main contribution seems to be the evaluation of the three peer review models against the functions of scientific communication. I say ‘seems to be’ because the article is not very clear on that and I hope you will consider clarifying what your manuscript seeks to add to the existing work in this field. In any case, if that assessment of the three models is your main contribution, that part is somewhat underdeveloped. Moreover, I never got the sense that there is clear agreement in the literature about what the tenets of scientific communication are. Note that scientific communication is a field in its own right.

      I have implemented a more rigorous approach to argumentation in response. “Scientific communication” was replaced by “scholarly communication.”

      3. I also agree that paper is too strongly worded at times, with limitations and assumptions in the analysis minimised or not stated. For example, all of the typologies and categories drawn could easily be reorganised and there is a high degree of subjectivity in this entire exercise. Subjective choices should be highlighted and made salient for the reader. Note that greater clarity, rigour, and humility may also help with any alleged or actual bias.

      I have incorporated the conceptual framework and description of the research methodology. However, the

      Discussion section reflects my personal perspective in some points, which I have explicitly highlighted to ensure clarity.

      4. I agree with Reviewer 3 that the ‘we’ perspective is distracting.

      This has been fixed.

      5. The paragraph starting with ‘Nevertheless’ on page 2 is very long.

      The text was restructured.

      6. There are many points where language could be shortened for readability, for example:

      Page 3: ‘decision on publication’ could be ‘publication decision’.

      Page 5: ‘efficiency of its utilization’ could be ‘its efficiency’.

      Page 7: ‘It should be noted…’ could be ‘Note that…’.

      I have proofread the text.

      7. Page 7: ‘It should be noted that..’ – this needs a reference.

      This statement has been moved to the Discussion section, paraphrased, and reference added.

      “It should be also noted that peer review innovations pull in opposing directions, with some aiming to increase efficiency and reduce costs, while others aim to promote rigor and increase costs

      (Kaltenbrunner et al., 2022).”

      8. I’m not sure that registered reports reflect a hypothetico-deductive approach (page 6). For instance, systematic reviews (even non-quantitative ones) are often published as registered reports and Cochrane has required this even before the move towards registered reports in quantitative psychology.

      I have added this clarification.

      9. I agree that modular publishing sits uneasily as its own chapter.

      Modular publishing has been combined with registered reports into the deconstructed publication group of

      models, now Section 5.1.

      10. Page 14: ‘The "Publish-Review-Curate" model is universal that we expect to be the future of scientific publishing. The transition will not happen today or tomorrow, but in the next 5-10 years, the number of projects such as eLife, F1000Research, Peer Community in, or MetaROR will rapidly increase’. This seems overly strong (an example of my larger critique and that of the reviewers).

      This part of the text has been rewritten.

      Reviewer 1

      11. For example, although Model 3 is less chance to insert bias to the readers, it also weakens the filtering function of the review system. Let’s just think about the dangers of machine-generated articles, paper-mills, p-hacked research reports and so on. Although the editors do some pre-screening for the submissions, in a world with only Model 3 peer review the literature could easily get loaded with even more ‘garbage’ than in a model where additional peers help the screening.

      I think that generated text is better detected by software tools. At the same time, I tried and described the pros and cons of different models in a more balanced way in the concluding section.

      12. Compared to registered reports other aspects can come to focus that Model 3 cannot cover. It’s the efficiency of researchers’ work. In the care of registered reports, Stage 1 review can still help researchers to modify or improve their research design or data collection method. Empirical work can be costly and time-consuming and post-publication review can only say that “you should have done it differently then it

      would make sense”.

      Thank you very much for this valuable contribution, I have added this statement at P. 11.

      13. Finally, the author puts openness as a strength of Model 3. In my eyes, openness is a separate question. All models can work very openly and transparently in the right circumstances. This dimension is not an inherent part of the models.

      I think that the model, providing peer reviews to all the submissions, ensures maximum transparency. However, I have made effort to make the wording more balanced and distinguish my personal perspective from the literature.

      14. In conclusion, I would not make verdict over the models, instead emphasize the different functions they can play in scientific communication.

      This idea has been reflected now in the concluding section.

      15. A minor comment: I found that a number of statements lack references in the Introduction. I would have found them useful for statements such as “There is a point of view that peer review is included in the implicit contract of the researcher.”

      Thank you for your feedback. I have implemented a more rigorous approach to argumentation in response.

      Reviewer 2

      16. The primary weakness of this article is that it presents itself as an 'analysis' from which they 'conclude' certain results such as their typology, when this appears clearly to be an opinion piece. In my view, this results in a false claim of objectivity which detracts from what would otherwise be an interesting and informative, albeit subjective, discussion, and thus fails to discuss the limitations of this approach.

      I have incorporated the conceptual framework and description of the research methodology. However, the

      Discussion section reflects my personal perspective in some points, which I have explicitly highlighted to ensure clarity.

      17. A secondary weakness is that the discussion is not well structured and there are some imprecisions of expression that have the potential to confuse, at least at first.

      The structure of the paper (and discussion) has changed significantly.

      18. The evidence and reasoning for claims made is patchy or absent. One instance of the former is the discussion of bias in peer review. There are a multitude of studies of such bias and indeed quite a few meta-analyses of these studies. A systematic search could have been done here but there is no attempt to discuss the totality of this literature. Instead, only a few specific studies are cited. Why are these ones chosen? We have no idea. To this extent I am not convinced that the references used here are the most appropriate.

      I have reviewed the existing references and incorporated additional sources. However, the study does not claim to conduct a systematic literature review; rather, it adopts an interpretative approach to literature analysis.

      19. Instances of the latter are the claim that "The most well-known initiatives at the moment are ResearchEquals and Octopus" for which no evidence is provided, the claim that "we believe that journal-independent peer review is a special case of Model 3" for which no further argument is provided, and the claim that "the function of being the "supreme judge" in deciding what is "good" and "bad" science is taken on by peer review" for which neither is provided.

      Thank you for your feedback. I have implemented a more rigorous approach to argumentation in response.

      20. A particular example of this weakness, which is perhaps of marginal importance to the overall paper but of strong interest to this reviewer is the rather odd engagement with history within the paper. It is titled "Evolution of Peer Review" but is really focussed on the contemporary state-of-play. Section 2 starts with a short history of peer review in scientific publishing, but that seems intended only to establish what is

      described as the 'traditional' model of peer review. Given that that short history had just shown how peer review had been continually changing in character over centuries - and indeed Kochetkov goes on to describe further changes - it is a little difficult to work out what 'traditional' might mean here; what was 'traditional' in 2010 was not the same as what was 'traditional' in 1970. It is not clear how seriously this history is being taken. Kochetkov has earlier written that "as early as the beginning of the 21st century, it was argued that the system of peer review is 'broken'" but of course criticisms - including fundamental criticisms - of peer review are much older than this. Overall, this use of history seems designed to privilege the

      experience of a particular moment in time, that coincides with the start of the metascience reform movement.

      While the paper addresses some aspects of peer review history, it does not provide a comprehensive examination of this topic. A clarifying statement to this effect has been included in the methodology section.

      “… this section incorporates elements of historical analysis, it does not fully qualify as such because primary sources were not directly utilized. Instead, it functions as an interpretative literature review, and one that is intentionally concise, as a comprehensive history of peer review falls outside the scope of this research”.

      21. Section 2 also demonstrates some of the second weakness described, a rather loose structure. Having moved from a discussion of the history of peer review to detail the first model, 'traditional' peer review, it then also goes on to describe the problems of this model. This part of the paper is one of the best - and best - evidenced. Given the importance of it to the main thrust of the discussion it should probably have been given more space as a Section all on its own.

      This section (now Section 4) has been extended, see also previous comment.

      22. Another example is Section 4 on Modular Publishing, in which Kochetkov notes "Strictly speaking, modular publishing is primarily an innovative approach for the publishing workflow in general rather than specifically for peer review."

      Kochetkov says "This is why we have placed this innovation in a separate category" but if it is not an innovation in peer review, the bigger question is 'Why was it included in this article at all?'.

      Modular publishing has been combined with registered reports into the deconstructed publication group of models, now Section 5.1.

      23. One example of the imprecisions of language is as follows. The author also shifts between the terms 'scientific communication' and 'science communication' but, at least in many contexts familiar to this reviewer, these are not the same things, the former denoting science-internal dissemination of results through publication (which the author considers), conferences and the like (which the author specifically excludes) while the latter denotes the science-external public dissemination of scientific findings to non-technical audiences, which is entirely out of scope for this article.

      Thank you for your remark. As a non- native speaker, I initially did not grasp the distinction between the terms. However, I believe the phrase ‘scholarly communication’ is the most universally applicable term. This adjustment has now been incorporated into the text.

      24. A final note is that Section 3, while an interesting discussion, seems largely derivative from a typology of Waltman, with the addition of a consideration of whether a reform is 'radical' or 'incremental', based on how 'disruptive' the reform is. Given that this is inherently a subjective decision, I wonder if it might not have been more informative to consider 'disruptiveness' on a scale and plot it accordingly. This would allow for some range to be imagined for each reform as well; surely reforms might be more or less disruptive depending on how they are implemented. Given that each reform is considered against each model, it is somewhat surprising that this is not presented in a tabular or graphical form.

      Ultimately, I excluded this metric due to its current reliance on purely subjective judgment. Measuring 'disruptiveness', e.g., through surveys or interviews remains a task for future research. 

      25. Reconceptualize this as an opinion piece. Where systematic evidence can be drawn upon to make points, use that, but don't be afraid to just present a discussion from what is clearly a well-informed author.

      I cannot definitively classify this work as an opinion piece. In fact, this manuscript synthesizes elements of a literature review, research article, and opinion essay. My idea was to integrate the strengths of all three genres.

      26. Reconsider the focus on history and 'evolution' if the point is about the current state of play and evaluation of reforms (much as I would always want to see more studies on the history and evolution of peer review).

      I have revised the title to better reflect the study’s scope and explicitly emphasize its focus on contemporary developments in the field.

      “Peer Review at the Crossroads”

      27. Consider ways in which the typology might be expanded, even if at subordinate level.

      I have updated the typology and introduced the third tier, where it is applicable (see Fig.2).

      Reviewer 3

      28. In my view, the biggest issue with the current peer review system is the low quality of reviews, but the manuscript only mentions this fleetingly. The current system facilitates publication bias, confirmation bias, and is generally very inconsistent. I think this is partly due to reviewers’ lack of accountability in such a closed peer review system, but I would be curious to hear the author’s ideas about this, more elaborately than they provide them as part of issue 2.

      I have elaborated on this issue in the footnote.

      29. I’m missing a section in the introduction on what the goals of peer review are or should be. You mention issues with peer review, and these are mostly fair, but their importance is only made salient if you link them to the goals of peer review. The author does mention some functions of peer review later in the paper, but I think it would be good to expand that discussion and move it to a place earlier in the manuscript.

      The functions of peer review are summarized in the first paragraph of Introduction.

      30. Table 1 is intuitive but some background on how the author arrived at these categorizations would be welcome.

      When is something incremental and when is something radical? Why are some innovations included but not others (e.g., collaborative peer review, see https://content.prereview.org/how-collaborative-peer-review-can-

      transform-scientific-research/)?

      Collaborative peer review, namely, Prereview was mentioned in the context of Model 3 (Publish-Review-Curate). However, I have extended this part of the paper.

      31. “Training of reviewers through seminars and online courses is part of the strategies of many publishers. At the same time, we have not been able to find statistical data or research to assess the effectiveness of such training.” (p. 5)  There is some literature on this, although not recent. See work by Sara Schroter for example, Schroter et al., 2004; Schroter et al., 2008)

      Thank you very much, I have added these studies and a few more recent ones.

      32. “It should be noted that most initiatives aimed at improving the quality of peer review simultaneously increase the costs.” (p. 7) This claim needs some support. Please explicate why this typically is the case and how it should impact our evaluations of these initiatives.

      I have moved this part to the Discussion section.

      33. I would rephrase “Idea of the study” in Figure 2 since the other models start with a tangible output (the manuscript). This is the same for registered reports where they submit a tangible report including hypotheses, study design, and analysis plan. In the same vein, I think study design in the rest of the figure might also not be the best phrasing. Maybe the author could use the terminology used by COS (Stage 1 manuscript, and Stage 2 manuscript, see Details & Workflow tab of https://www.cos.io/initiatives/registered-reports). Relatedly, “Author submits the first version of the manuscript” in the first box after the ‘Manuscript (report)’ node maybe a confusing phrase because I think many researchers see the first version of the manuscript as the stage 1 report sent out for stage 1 review.

      Thank you very much. Stage 1 and Stage 2 manuscripts look like suitable labelling solution.

      34. One pathway that is not included in Figure 2 is that authors can decide to not conduct the study when improvements are required. Relatedly, in the publish- review-curate model, is revising the manuscripts based on the reviews not optional as well? Especially in the case of 3a, authors can hardly be forced to make changes even though the reviews are posted on the platform.

      All the four models imply a certain level of generalization; thus, I tried to avoid redundant details. However, I have added this choice to the PRC model (now, Model 4).

      35. I think the author should discuss the importance of ‘open identities’ more. This factor is now not explicitly included in any of the models, while it has been found to be one of the main characteristics of peer review systems (Ross-Hellauer, 2017).

      This part has been extended.

      36. More generally, I was wondering why the author chose these three models and not others. What were the inclusion criteria for inclusion in the manuscript? Some information on the underlying process would be welcome, especially when claims like “However, we believe that journal-independent peer review is a special case of Model 3 (“Publish-Review-Curate”).” are made without substantiation.

      The study included four generalized models of peer review that involved some level of abstraction.

      37. Maybe it helps to outline the goals of the paper a bit more clearly in the introduction. This helps the reader to know what to expect.

      The Introduction has been revised including the goal and objectives.

      38. The Modular Publishing section is not inherently related to peer review models, as you mention in the first sentence of that paragraph. As such, I think it would be best to omit this section entirely to maintain the flow of the paper. Alternatively, you could shortly discuss it in the discussion section but a separate paragraph seems too much from my point of view.

      Modular publishing has been combined with registered reports into the fragmented publishing group of models, now in Section 5.

      39. Labeling model 3 as post-publication review might be confusing to some readers. I believe many researchers see post-publication review as researchers making comments on preprints, or submitting commentaries to journals. Those activities are substantially different from the publish-review-curate model so I think it is important to distinguish between these types.

      The label was changed into Publish-Review-Curate model.

      40. I do not think the conclusions drawn below Table 3 logically follow from the earlier text. For example, why are “all functions of scientific communication implemented most quickly and transparently in Model 3”? It could be that the entire process takes longer in Model 3 (e.g. because reviewers need more time), so that Model 1 and Model 2 lead to outputs quicker. The same holds for the following claim: “The additional costs arising from the independent assessment of information based on open reviews are more than compensated by the emerging opportunities for scientific pluralism.” What is the empirical evidence for this? While I personally do think that Model 3 improves on Model 1, emphatic statements like this require empirical evidence. Maybe the author could provide some suggestions on how we can attain this evidence. Model 2 does have some empirical evidence underpinning its validity (see Scheel, Schijen, Lakens, 2021; Soderberg et al., 2021; Sarafoglou et al. 2022) but more meta-research inquiries into the effectiveness and cost- benefits ratio of registered reports would still be welcome in general.

      The Discussion section has been substantially revised to address this point. While I acknowledge the current scarcity of empirical studies on innovative peer review models, I have incorporated a critical discussion of this methodological gap. I am grateful for the suggested literature on RRs, which I have now integrated into the relevant subsection.

      41. What is the underlaying source for the claim that openness requires three conditions?

      I have made effort to clarify within the text that this reflects my personal stance.

      42. “If we do not change our approach, science will either stagnate or transition into other forms of communication.” (p. 2) I don’t think this claim is supported sufficiently strongly. While I agree there are important problems in peer review, I think would need to be a more in-depth and evidence-based analysis before claims like this can be made.

      The sentence has been rephrased.

      43. On some occasions, the author uses “we” while the study is single authored.

      This has been fixed.

      44. Figure 1: The top-left arrow from revision to (re-)submission is hidden

      I have updated Figure 1.

      45. “The low level of peer review also contributes to the crisis of reproducibility in scientific research (Stoddart, 2016).” (p. 4) I assume the author means the low quality of peer review.

      This has been fixed.

      46. “Although this crisis is due to a multitude of factors, the peer review system bears a significant responsibility for it.” (p. 4)

      This is also a big claim that is not substantiated

      I have paraphrased this sentence as

      “While multiple factors drive this crisis, deficiencies in the peer review process

      remain a significant contributor.” and added a footnote.

      47. “Software for automatic evaluation of scientific papers based on artificial intelligence (AI) has emerged relatively recently” (p. 5) The author could add RegCheck (https://regcheck.app/) here, even though it is still in development. This tool is especially salient in light of the finding that preregistration-paper checks are rarely done as part of reviews (see Syed, 2023)

      Thank you very much, I have added this information.

      48. There is a typo in last box of Figure 1 (“decicion” instead of “decision”). I also found typos in the second box of Figure 2, where “screns” should be “screens”, and the author decision box where “desicion” should be “decision”

      This has been fixed.

      49. Maybe it would be good to mention results blinded review in the first paragraph of 3.2. This is a form of peer review where the study is already carried out but reviewers are blinded to the results. See work by Locascio (2017), Grand et al. (2018), and Woznyj et al. (2018).

      Thanks, I have added this (now section 5.2)

      50. Is “Not considered for peer review” in figure 3b not the same as rejected? I feel that it is rejected in the sense that neither the manuscript not the reviews will be posted on the platform.

      Changed into “Rejected”

      51. “In addition to the projects mentioned, there are other platforms, for example, PREreview12, which departs even more radically from the traditional review format due to the decentralized structure of work.” (p. 11) For completeness, I think it would be helpful to add some more information here, for example why exactly decentralization is a radical departure from the traditional model.

      I have extended this passage.

      52. “However, anonymity is very conditional - there are still many “keys” left in the manuscript, by which one can determine, if not the identity of the author, then his country, research group, or affiliated organization.” (p.11) I would opt for the neutral “their” here instead of “his”, especially given that this is a paragraph about equity and inclusion.

      This has been fixed.

      53. “Thus, “closeness” is not a good way to address biases.” (p. 11) This might be a straw man argument because I don’t believe researchers have argued that it is a good method to combat biases. If they did, it would be good to cite them here. Alternatively, the sentence could be omitted entirely.

      I have omitted the sentence.

      54. I would start the Modular Publishing section with the definition as that allows readers to interpret the other statements better.

      Modular publishing has been combined with registered reports into the deconstructed publication group of

      models, now in Section 5, general definition added.

      55. It would be helpful if the Models were labeled (instead of using Model 1, Model 2, and Model 3) so that readers don’t have to think back what each model involved.

      All the models represent a kind of generalization, which is why non-detailed labels are used. The text labels may vary depending on the context.

      56. Table 2: “Decision making” for the editor’s role is quite broad, I recommend to specify and include what kind of decisions need to be made.

      Changed into “Making accept/reject decisions”

      57. Table 2: “Aim of review” – I believe the aim of peer review differs also within these models (see the “schools of thought” the author mentions earlier), so maybe a statement on what the review entails would be a better way to phrase this.

      Changed into “What does peer review entail?”

      58. Table 2: One could argue that the object of the review’ in Registered Reports is

      also the manuscript as a whole, just in different stages. As such, I would phrase this differently.

      Current wording fits your remark

      “Manuscript in terms of study design and execution”

      Reviewer 4

      59. Page 3: It’s hard to get a feel for the timeline given the dates that are described. We have peer review becoming standard after WWII (after 1945), definitively established by the second half of the century, an example of obligatory peer review starting in 1976, and in crisis by the end of the 20th century. I would consider adding

      examples that better support this timeline – did it become more common in specific journals before 1976? Was the crisis by the end of the 20th century something that happened over time or something that was already intrinsic to the institution? It doesn’t seem like enough time to get established and then enter crisis, but more details/examples could help make the timeline clear. Consider discussing the benefits of the traditional model of peer review.

      This section has been extended.

      60. Table 1 – Most of these are self- explanatory to me as a reader, but not all. I don’t know what a registered report refers to, and it stands to reason that not all of these innovations are familiar to all readers. You do go through each of these sections, but that’s not clear when I initially look at the table. Consider having a more informative caption. Additionally, the left column is “Course of changes” here but “Directions” in text. I’d pick one and go with it for consistency.

      Table 1 has been replaced by Figure 2. I have also extended text descriptions, added definitions.

      61. With some of these methods, there’s the ability to also submit to a regular journal. Going to a regular journal presumably would instigate a whole new round of review, which may or may not contradict the previous round of post-publication review and would increase the length of time to publication by going through both types. If someone has a goal to publish in a journal, what benefit would they get by going through the post-publication review first, given this extra time?

      Some of these platforms, e.g., F1000, Lifecycle Journal, replace conventional journal publishing. Modular publishing allows for step-by-step feedback from peers.

      An important advantage of RRs over other peer review models lies in their capacity to enhance research efficiency. By conducting peer review at Stage 1, researchers gain the opportunity to refine their study design or data collection protocols before empirical work begins.

      Other models of review can offer critiques such as "the study should have been conducted differently" without

      actionable opportunity for improvement. The key motivation for having my paper reviewed in MetaROR is the quality of peer review – I have never received so many comments, frankly! Moreover, platforms such as MetaROR usually have partnering journals.

      62. There’s a section talking about institutional change (page 14). It mentions that openness requires three conditions – people taking responsibility for scientific communication, authors and reviewers, and infrastructure. I would consider adding some discussion of readers and evaluators. Readers have to be willing to accept these papers as reliable, trustworthy, and respectable to read and use the information in them.

      Evaluators such as tenure committees and potential employers would need to consider papers submitted through these approaches as evidence of scientific scholarship for the effort to be worthwhile for scientists.

      I have omitted these conditions and employed the Moore’s Technology Adoption Life Cycle. Thank you very much for your comment!

      63. Based on this overview, which seems somewhat skewed towards the merits of these methods (conflict of interest, limited perspective on downsides to new methods/upsides to old methods), I am not quite ready to accept this effort as equivalent of a regular journal and pre-publication peer review process. I look forward to learning more about the approach and seeing this review method in action and as it develops.

      The Discussion section has been substantially revised to address this point. While I acknowledge the current scarcity of empirical studies on innovative peer review models, I have incorporated a critical discussion of this methodological gap.

    1. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This paper concerns mechanisms of foraging behavior in C. elegans. Upon removal from food, C. elegans first executes a stereotypical local search behavior in which it explores a small area by executing many random, undirected reversals and turns called "reorientations." If the worm fails to find food, it transitions to a global search in which it explores larger areas by suppressing reorientations and executing long forward runs (Hills et al., 2004). At the population level, the reorientation rate declines gradually. Nevertheless, about 50% of individual worms appear to exhibit an abrupt transition between local and global search, which is evident as a discrete transition from high to low reorientation rate (Lopez-Cruz et al., 2019). This observation has given rise to the hypothesis that local and global search correspond to separate internal states with the possibility of sudden transitions between them (Calhoun et al., 2014). The main conclusion of the paper is that it is not necessary to posit distinct internal states to account for discrete transitions from high to low reorientation rates. On the contrary, discrete transitions can occur simply because of the stochastic nature of the reorientation behavior itself.

      Strengths:

      The strength of the paper is the demonstration that a more parsimonious model explains abrupt transitions in the reorientation rate.

      Weaknesses:

      (1) Use of the Gillespie algorithm is not well justified. A conventional model with a fixed dt and an exponentially decaying reorientation rate would be adequate and far easier to explain. It would also be sufficiently accurate - given the appropriate choice of dt - to support the main claims of the paper, which are merely qualitative. In some respects, the whole point of the paper - that discrete transitions are an epiphenomenon of stochastic behavior - can be made with the authors' version of the model having a constant reorientation rate (Figure 2f).

      We apologize, but we are not sure what the reviewer means by “fixed dt”. If the reviewer means taking discrete steps in time (dt), and modeling whether a reorientation occurs, we would argue that the Gillespie algorithm is a better way to do this because it provides floating-point precision, rather than a time resolution limited by dt, which we hopefully explain in the updated text (Lines 107-192).

      The reviewer is correct that discrete transitions are an epiphenomenon of stochastic behavior as we show in Figure 2f. However, abrupt stochastic jumps that occur with a constant rate do not produce persistent changes in the observed rate because it is by definition, constant. The theory that there are local and global searches is based on the observation that individual worms often abruptly change their reorientation rates. But this observation is only true for a fraction of worms. We are trying to argue that the reason why this is not observed for all, or even most worms is because these are the result of stochastic sampling, not a sudden change in search strategy.

      (2) In the manuscript, the Gillespie algorithm is very poorly explained, even for readers who already understand the algorithm; for those who do not it will be essentially impossible to comprehend. To take just a few examples: in Equation (1), omega is defined as reorientations instead of cumulative reorientations; it is unclear how (4) follows from (2) and (3); notation in (5), line 133, and (7) is idiosyncratic. Figure 1a does not help, partly because the notation is unexplained. For example, what do the arrows mean, what does "*" mean?

      We apologize for this, you are correct, 𝛀 is cumulative reorientations, and we have edited the text for clarity (Lines 107-192):

      We apologize for the arrow notation confusion. Arrow notation is commonly used in pseudocode to indicate variable assignment, and so we used it to indicate variable assignment updates in the algorithm.

      We added Figure 2a to help explain the Gillespie algorithm for people who are unfamiliar with it, but you are correct, some notation, like probabilities, were left unexplained. We have added more text to the figure legend. Hopefully this additional text, along with lines 105-190, provide better clarification.

      (3) In the model, the reorientation rate dΩ⁄dt declines to zero but the empirical rate clearly does not. This is a major flaw. It would have been easy to fix by adding a constant to the exponentially declining rate in (1). Perhaps fixing this obvious problem would mitigate the discrepancies between the data and the model in Figure 2d.

      You are correct that the model deviates slightly at longer times, but this result is consistent with Klein et al. that show a continuous decline of reorientations. However, we have added a constant to the model (b, Equation 2), since an infinite run length is likely not physiological.

      (4) Evidence that the model fits the data (Figure 2d) is unconvincing. I would like to have seen the proportion of runs in which the model generated one as opposed to multiple or no transitions in reorientation rate; in the real data, the proportion is 50% (Lopez). It is claimed that the "model demonstrated a continuum of switching to non-switching behavior" as seen in the experimental data but no evidence is provided.

      We should clarify that the 50% proportion cited by López-Cruz was based on an arbitrary difference in slopes, and by assessing the data visually (López-Cruz, Figure S2). We added a comment in the text to clarify this (Lines 76 – 78). We sought to avoid this subjective assessment by plotting the distribution of slopes and transition times produced by the method used in López-Cruz. We should also clarify by what we meant by “a continuum of switching and non-switching” behavior. Both the transition time distributions and the slope-difference distributions do not appear to be the result of two distributions (the distributions in Figure 1 are not bimodal). This is unlike roaming and dwelling on food, where two distinct distributions of behavioral metrics can be identified based on speed and angular speed (Flavell et al, 2009, Fig S2a).

      Based on the advice of Reviewer #3, we have also modeled the data using different starting amounts of M (M<sub>0</sub>). By definition, an initial value of M<sub>0</sub> = 1 is a two-state switching strategy; the worm either uses a reorientation rate of a (when M = 1) or b (when M = 0). As expected, this does produce a bimodal distribution of slope differences (Figure 3b), which is significantly different than the experimental distribution (Figure 3c). We have added a new section to explain this in more detail (Lines 253 – 297).

      (5) The explanation for the poor fit between the model and data (lines 166-174) is unclear. Why would externally triggered collisions cause a shift in the transition distribution?

      Thank you, we rewrote the text to clarify this better (Lines 227-233). There were no externally triggered collisions; 10 animals were used per experiment. They would occasionally collide during the experiment, but these collisions were excluded from the data that were provided. However, worms are also known to increase reorientations when they encounter a pheromone trail, and it is unknown (from this dataset) which orientations may have been a result of this phenomenon.

      (6) The discussion of Levy walks and the accompanying figure are off-topic and should be deleted.

      Thank you, we agree that this topic is tangential, and we removed it.

      Reviewer #2 (Public review):

      Summary:

      In this study, the authors build a statistical model that stochastically samples from a timeinterval distribution of reorientation rates. The form of the distribution is extracted from a large array of behavioral data, and is then used to describe not only the dynamics of individual worms (including the inter-individual variability in behavior), but also the aggregate population behavior. The authors note that the model does not require assumptions about behavioral state transitions, or evidence accumulation, as has been done previously, but rather that the stochastic nature of behavior is "simply the product of stochastic sampling from an exponential function".

      Strengths:

      This model provides a strong juxtaposition to other foraging models in the worm. Rather than evoking a behavioral transition function (that might arise from a change in internal state or the activity of a cell type in the network), or evidence accumulation (which again maps onto a cell type, or the activity of a network) - this model explains behavior via the stochastic sampling of a function of an exponential decay. The underlying model and the dynamics being simulated, as well as the process of stochastic sampling, are well described and the model fits the exponential function (Equation 1) to data on a large array of worms exhibiting diverse behaviors (1600+ worms from Lopez-Cruz et al). The work of this study is able to explain or describe the inter-individual diversity of worm behavior across a large population. The model is also able to capture two aspects of the reorientations, including the dynamics (to switch or not to switch) and the kinetics (slow vs fast reorientations). The authors also work to compare their model to a few others including the Levy walk (whose construction arises from a Markov process) to a simple exponential distribution, all of which have been used to study foraging and search behaviors.

      Weaknesses:

      This manuscript has two weaknesses that dampen the enthusiasm for the results. First, in all of the examples the authors cite where a Gillespie algorithm is used to sample from a distribution, be it the kinetics associated with chemical dynamics, or a Lotka-Volterra Competition Model, there are underlying processes that govern the evolution of the dynamics, and thus the sampling from distributions. In one of their references, for instance, the stochasticity arises from the birth and death rates, thereby influencing the genetic drift in the model. In these examples, the process governing the dynamics (and thus generating the distributions from which one samples) is distinct from the behavior being studied. In this manuscript, the distribution being sampled is the exponential decay function of the reorientation rate (lines 100-102). This appears to be tautological - a decay function fitted to the reorientation data is then sampled to generate the distributions of the reorientation data. That the model performs well and matches the data is commendable, but it is unclear how that could not be the case if the underlying function generating the distribution was fit to the data.

      Thank you, we apologize that this was not clearer. In the Lotka-Volterra model, the density of predators and prey are being modeled, with the underlying assumption that rates of birth and death are inherently stochastic. In our model, the number of reorientations are being modeled, with the assumption (based on the experiments), that the occurrence of reorientations is stochastic, just like the occurrence (birth) of a prey animal is stochastic. However, the decay in M is phenomenological, and we speculate about the nature of M later in the manuscript.

      You are absolutely right that the decay function for M was fit to the population average of reorientations and then sampled to generate the distributions of the reorientation data. This was intentional to show that the parameters chosen to match the population average would produce individual trajectories with comparable stochastic “switching” as the experimental data. All we’re trying to show really is that observed sudden changes in reorientation that appear persistent can be produced by a stochastic process without resorting to binary state assignments. In Calhoun, et al 2014 it is reported all animals produced switch-like behavior, but in Klein et al, 2017 it is reported that no animals showed abrupt transitions. López-Cruz et al seem to show a mix of these results, which can easily be explained by an underlying stochastic process.

      The second weakness is somewhat related to the first, in that absent an underlying mechanism or framework, one is left wondering what insight the model provides.

      Stochastic sampling a function generated by fitting the data to produce stochastic behavior is where one ends up in this framework, and the authors indeed point this out: "simple stochastic models should be sufficient to explain observably stochastic behaviors." (Line 233-234). But if that is the case, what do we learn about how the foraging is happening? The authors suggest that the decay parameter M can be considered a memory timescale; which offers some suggestion, but then go on to say that the "physical basis of M can come from multiple sources". Here is where one is left for want: The mechanisms suggested, including loss of sensory stimuli, alternations in motor integration, ionotropic glutamate signaling, dopamine, and neuropeptides are all suggested: these are basically all of the possible biological sources that can govern behavior, and one is left not knowing what insight the model provides. The array of biological processes listed is so variable in dynamics and meaning, that their explanation of what governs M is at best unsatisfying. Molecular dynamics models that generate distributions can point to certain properties of the model, such as the binding kinetics (on and off rates, etc.) as explanations for the mechanisms generating the distributions, and therefore point to how a change in the biology affects the stochasticity of the process. It is unclear how this model provides such a connection, especially taken in aggregate with the previous weakness.

      Providing a roadmap of how to think about the processes generating M, the meaning of those processes in search, and potential frameworks that are more constrained and with more precise biological underpinning (beyond the array of possibilities described) would go a long way to assuaging the weaknesses.

      Thank you, these are all excellent points. We should clarify that in López-Cruz et al, they claim that only 50% of the animals fit a local/global search paradigm. We are simply proposing there is no need for designating local and global searches if the data don’t really support it. The underlying behavior is stochastic, so the sudden switches sometimes observed can be explained by a stochastic process where the underlying rate is slowing down, thus producing the persistently slow reorientation rate when an apparent “switch” occurs. What we hope to convey is that foraging doesn’t appear to follow a decision paradigm, but instead a gradual change in reorientations which for individual worms, can occasionally produce reorientation trajectories that appear switch-like.

      As for M, you are correct, we should be more explicit, and we have added text (Lines 319-359) to expand upon its possible biological origin.

      Reviewer #3 (Public review):

      Summary:

      This intriguing paper addresses a special case of a fundamental statistical question: how to distinguish between stochastic point processes that derive from a single "state" (or single process) and more than one state/process. In the language of the paper, a "state" (perhaps more intuitively called a strategy/process) refers to a set of rules that determine the temporal statistics of the system. The rules give rise to probability distributions (here, the probability for turning events). The difficulty arises when the sampling time is finite, and hence, the empirical data is finite, and affected by the sampling of the underlying distribution(s). The specific problem being tackled is the foraging behavior of C. elegans nematodes, removed from food. Such foraging has been studied for decades, and described by a transition over time from 'local'/'area-restricted' search'(roughly in the initial 10-30 minutes of the experiments, in which animals execute frequent turns) to 'dispersion', or 'global search' (characterized by a low frequency of turns). The authors propose an alternative to this two-state description - a potentially more parsimonious single 'state' with time-changing parameters, which they claim can account for the full-time course of these observations.

      Figure 1a shows the mean rate of turning events as a function of time (averaged across the population). Here, we see a rapid transient, followed by a gradual 4-5 fold decay in the rate, and then levels off. This picture seems consistent with the two-state description. However, the authors demonstrate that individual animals exhibit different "transition" statistics (Figure 1e) and wish to explain this. They do so by fitting this mean with a single function (Equations 1-3).

      Strengths:

      As a qualitative exercise, the paper might have some merit. It demonstrates that apparently discrete states can sometimes be artifacts of sampling from smoothly time-changing dynamics. However, as a generic point, this is not novel, and so without the grounding in C. elegans data, is less interesting.

      Weaknesses:

      (1) The authors claim that only about half the animals tested exhibit discontinuity in turning rates. Can they automatically separate the empirical and model population into these two subpopulations (with the same method), and compare the results?

      Thank you, we should clarify that the observation that about half the animals exhibit discontinuity was not made by us, but by López-Cruz et al. The observed fraction of 50% was based on a visual assessment of the dual regression method we described. We added text (Lines 76-79) to clarify this. To make the process more objective, we decided to simply plot the distributions of the metrics they used for this assessment to see if two distinct populations could be observed. However, the distributions of slope differences and transition times do not produce two distinct populations. Our stochastic approach, which does not assume abrupt state-transitions, also produces comparable distributions. To quantify this, we have added a section varying M<sub>0</sub>, including setting M<sub>0</sub> to 1, so that the model by definition is a switch model. This model performs the worst (Lines 253-296, Figure 3).

      (2) The equations consider an exponentially decaying rate of turning events. If so, Figure 2b should be shown on a semi-logarithmic scale.

      We chose to not do this because this average is based on the number of discrete reorientation events observed within a 2-minute window. The range of events ranges from 0 to 6 (hence a rate of 0.5-3 min<sup>-1</sup>), which does not span one order of magnitude. Instead, we included a heat map (Figure 1a, Figure 2b bottom panel) which shows the density that the average is based on. We hope this provides some clarity to the reader.

      (3) The variables in Equations 1-3 and the methods for simulating them are not well defined, making the method difficult to follow. Assuming my reading is correct, Omega should be defined as the cumulative number of turning events over time (Omega(t)), not as a "turn" or "reorientation", which has no derivative. The relevant entity in Figure 1a is apparently <Omega (t)>, i.e. the mean number of events across a population which can be modelled by an expectation value. The time derivative would then give the expected rate of turning events as a function of time.

      Thank you, you are correct. Please see response to Reviewer #1.

      (4) Equations 1-3 are cryptic. The authors need to spell out up front that they are using a pair of coupled stochastic processes, sampling a hidden state M (to model the dynamic turning rate) and the actual turn events, Omega(t), separately, as described in Figure 2a. In this case, the model no longer appears more parsimonious than the original 2-state model. What then is its benefit or explanatory power (especially since the process involving M is not observable experimentally)?

      Thank you, yes we see how as written this was confusing. In our response to Reviewer #1, and in the text, we added an important detail:

      While reorientations are modeled as discrete events, which is observationally true, the amount of M at time t=0 is chosen to be large (M<sub>0</sub> = 1000), so that over the timescale of 40 minutes, the decay in M is practically continuous. This ensures that sudden changes in reorientations are not due to sudden changes in M, but due to the inherent stochasticity of reorientations.

      However you are correct that if M was chosen to have a binary value of 0 or 1, then this would indeed be the two state model. We added a new section to address this (Lines 253-287, Figure 3). Unlike the experiments, the two-state model produces bimodal distributions in slope and transition times, and these distributions are significantly different than the experimental data (Figure 3).

      (5) Further, as currently stated in the paper, Equations 1-3 are only for the mean rate of events. However, the expectation value is not a complete description of a stochastic system. Instead, the authors need to formulate the equations for the probability of events, from which they can extract any moment (they write something in Figure 2a, but the notation there is unclear, and this needs to be incorporated here).

      Thank you, yes please see our response to Reviewer #1. We have clarified the text in Lines 105-190.

      (6) Equations 1-3 have three constants (alpha and gamma which were fit to the data, and M0 which was presumably set to 1000). How does the choice of M0 affect the results?

      Thank you, this is a good question. We address this in lines 253-296. Briefly, the choice of M<sub>0</sub> does not have a strong effect on the results, unless we set it to M<sub>0</sub>, which by definition, creates a two-state model. This model was significantly different than the experimental data, relative to the other models (Figure 3c).

      (7) M decays to near 0 over 40 minutes, abolishing omega turns by the end of the simulations. Are omega turns entirely abolished in worms after 30-40 minutes off food? How do the authors reconcile this decay with the leveling of the turning rate in Figure 1a?

      Yes, Reviewer #1 recommended adding a baseline reorientation rate which we did for all models (Equation 2). However, we should also note that in Klein et al they observed a continuous decay over 50 minutes. Though realistically, it is likely not plausible that worms will produce infinitely long runs at long time points.

      (8) The fit given in Figure 2b does not look convincing. No statistical test was used to compare the two functions (empirical and fit). No error bars were given (to either). These should be added. In the discussion, the authors explain the discrepancy away as experimental limitations. This is not unreasonable, but on the flip side, makes the argument inconclusive. If the authors could model and simulate these limitations, and show that they account for the discrepancies with the data, the model would be much more compelling.

      To do this, I would imagine that the authors would need to take the output of their model (lists of turning times) and convert them into simulated trajectories over time. These trajectories could be used to detect boundary events (for a given size of arena), collisions between individuals, etc. in their simulations and to see their effects on the turn statistics.

      Thank you, we have added dashed lines to indicate standard deviation to Figures 2b and 3a. After running the models several times, we found that some of the small discrepancies noted (like s<sub>1</sub>-s<sub>2</sub> < 0 for experiments but not the model), were spurious due to these data points being <1% of the data, so we cut this from the text. To compare how similar the continuous (M<sub>0</sub> > 1) and discrete (M<sub>0</sub> = 1) models were to the experimental data, we calculated a Jensen-Shannon distance for the models, and found that the discrete model was significantly more dissimilar to the experimental data than the continuous models (Lines 289-296, Figure 3c).

      (9) The other figures similarly lack any statistical tests and by eye, they do not look convincing. The exception is the 6 anecdotal examples in Figure 2e. Those anecdotal examples match remarkably closely, almost suspiciously so. I'm not sure I understood this though - the caption refers to "different" models of M decay (and at least one of the 6 examples clearly shows a much shallower exponential). If different M models are allowed for each animal, this is no longer parsimonious. Are the results in Figure 2d for a single M model? Can Figure 2e explain the data with a single (stochastic) M model?

      We certainly don’t want the panels in Figure 2e to be suspicious! These comparisons were drawn from calculating the correlations between all model traces and all experimental traces, and then choosing the top hits. Every time we run the simulation, we arrive at a different set of examples. Since it was recommended we add a baseline rate, these examples will be a completely different set when we run the simulation, again.

      We apologize for the confusion regarding M. Since the worms do not all start out with identical reorientation rates, we drew the initial M value from a distribution centered on M<sub>0</sub> to match the initial distribution of observed experimental rates (Lines 206-214). However, the decay in M (γ), as well as α and β, are the same for all in silico animals.

      (10) The left axes of Figure 2e should be reverted to cumulative counts (without the normalization).

      Thank you, we made this change.

      (11) The authors give an alternative model of a Levy flight, but do not give the obvious alternative models:<br /> a) the 1-state model in which P(t) = alpha exp (-gamma t) dt (i.e. a single stochastic process, without a hidden M, collapsing equations 1-3 into a single equation).

      b) the originally proposed 2-state model (with 3 parameters, a high turn rate, a low turn rate, and the local-to-global search transition time, which can be taken from the data, or sampled from the empirical probability distributions). Why not? The former seems necessary to justify the more complicated 2-process model, and the latter seems necessary since it's the model they are trying to replace. Including these two controls would allow them to compare the number of free parameters as well as the model results. I am also surprised by the Levy model since Levy is a family of models. How were the parameters of the Levy walk chosen?

      Thank you, we removed this section completely, as it is tangential to the main point of the paper.

      (12) One point that is entirely missing in the discussion is the individuality of worms. It is by now well known that individual animals have individual behaviors. Some are slow/fast, and similarly, their turn rates vary. This makes this problem even harder. Combined with the tiny number of events concerned (typically 20-40 per experiment), it seems daunting to determine the underlying model from behavioral statistics alone.

      Thank you, yes we should have been more explicit in the reasoning behind drawing the initial M from a distribution (response to comment #9). We assume that not every worm starts out with the same reorientation rate, but that some start out fast (high M) and some start out slow (low M). However, we do assume M decays with the same kinetics, which seems sufficient to produce the observed phenomena. Multiple decay rates are not needed to replicate the experimental data.

      (13) That said, it's well-known which neurons underpin the suppression of turning events (starting already with Gray et al 2005, which, strangely, was not cited here). Some discussion of the neuronal predictions for each of the two (or more) models would be appropriate.

      Thank you, yes we will add Gray et al, but also the more detailed response to Reviewer #2 (Lines 319-359 of manuscript).

      (14) An additional point is the reliance entirely on simulations. A rigorous formulation (of the probability distribution rather than just the mean) should be analytically tractable (at least for the first moment, and possibly higher moments). If higher moments are not obtainable analytically, then the equations should be numerically integrable. It seems strange not to do this.

      Thank you for suggesting this. For the Levy section (which we cut) this would have been an improvement. However, since the distributions of slope differences and transition times are based on a recursive algorithm, rather than an analytical formulation, we decided to use the Jensen-Shannon divergence to compare distributions (Lines 272-296, Figure 3c) since this is a parameter-free approach.

      In summary, while sample simulations do nicely match the examples in the data (of discontinuous vs continuous turning rates), this is not sufficient to demonstrate that the transition from ARS to dispersion in C. elegans is, in fact, likely to be a single 'state', or this (eq 1-3) single state. Of course, the model can be made more complicated to better match the data, but the approach of the authors, seeking an elegant and parsimonious model, is in principle valid, i.e. avoiding a many-parameter model-fitting exercise.

      As a qualitative exercise, the paper might have some merit. It demonstrates that apparently discrete states can sometimes be artifacts of sampling from smoothly time-changing dynamics. However, as a generic point, this is not novel, and so without the grounding in C. elegans data, is less interesting.

      Thank you, we agree that this is a generic phenomenon, which is partly why we did this. The data from López-Cruz seem to agree in part with Calhoun et al, that claim abrupt transitions occur, and Klein et al, which claim they do not occur. Since the underlying phenomenon is stochastic, we propose the mixed observations of sudden and gradual changes in search strategy are simply the result of a stochastic process, which can produce both phenomena for individual observations. We hope this work can help clarify why sudden changes in search strategy are not consistently observed. We propose a simple hypothesis that there is no change in search strategy. The reorientation rate decays in time, and due to the stochastic nature of this behavior, what appears as a sudden change for individual observations is not due to an underlying decision, but rather the result of a stochastic process.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      1. General Statements

      • This manuscript represents a full revision incorporating all reviewer recommendations; the additional follow-up experiments and expanded analyses will be presented in dedicated subsequent manuscripts.
      • Congenital dyserythropoietic anemia type I (CDA-I) is a rare hereditary disease characterized by ineffective erythropoiesis and mutations in Codanin1 and CDIN1.
      • Our study reveals the structural and functional dynamics of the CDIN1-Codanin1 complex, shedding light on the molecular mechanisms of protein-protein interactions implicated in CDA-I pathology.
      • The main goal of our study was to examine the interaction between CDIN1 and the C‑terminal binding domain of Codanin1 using complementary biophysical approaches.
      • We quantified binding and identified interacting regions of Codanin1 and CDIN1.
      • We found that CDA-I-associated mutations in interacting regions disturb CDIN1‑Codanin1 complex.
      • We proposed a hypothetical molecular model of CDIN1-Codanin1 role in CDA-I hallmarks development.
      • Our initial studies on BioRxiv (2023) have been cited by leading publications in the field (Jeong, Frater et al. 2025, Sedor and Shao 2025, Nature Communications) and prompted further research on this topic.

      2. Point-by-point description of the revisions

      *Here we provide a point-by-point reply describing the revisions already carried out and included in the transferred manuscript. *

      Reply to the reviewers

      Reviewer #1 – Evidence, reproducibility and clarity

      This is a rigorous biophysical characterization of a protein-protein interaction relevant to CDA-1 disease. The two proteins were purified in an E. coli host but CD and DLS was performed to ensure that the purified protein is well folded. An impressive native protein EMSA was used to show a 1:1 complex. While common for protein-nucleic acid complexes, EMSAs are much more challenging with protein complexes. A higher-running complex, likely a heterotetramer was implied at higher protein concentrations. These results were supported with SEC-MALS analysis and analytic ultracentrifugation analysis. Thermophoresis and ITC were used to report a nanomolar affinity of the proteins for each other. SEC-SAXS supported the conclusions about stoichiometry and composition inferred from the earlier methods and suggested that the dimerization interface comes from CDIN1. Next HDX-MS was used to identify putative interface residues, which were then mutated in each of the proteins and assessed for binding using coimmunoprecipitation. This study uses at least 10 orthogonal biophysical and/or biochemical methodologies to characterize an important protein-protein interaction and the analysis is clear and so is the writing. I couldn't (reading it once) find any grammatical or other errors in the text or figures. This manuscript is top-quality and suitable for publication.

      __Reviewer #1 – Significance __

      Such detailed structural and mechanistic studies are greatly lacking in many clinical conditions for which mutations are known (unless they cause cancer, neurodegenerative disease, and so on). We need more such studies on disease topics! This study will be of interest to the hematologic diseases community.

      1. Response – ____Significance

      We thank Reviewer #1 for the thoughtful and encouraging evaluation of our work. We are particularly grateful for recognizing the significance of studying protein-protein interaction in the context of CDA-I disease, as well as the rigor and clarity of our biophysical and biochemical characterization.

      We appreciate the reviewer's acknowledgment of the challenges associated with native protein EMSAs. We are pleased that our use of multiple orthogonal techniques was recognized as a strength of the study. We are gratified that the comprehensiveness and coherence of our data and the manuscript's clarity were well received.

      We thank the reviewer for noting the broader impact of our findings on the hematologic disease community. As highlighted, there is a pressing need for a mechanistic understanding of non-oncologic, non-neurodegenerative diseases, and our studies address this gap.

      We are honored by the reviewer's endorsement of our manuscript as "top-quality and suitable for publication". We value the reviewer's highly supportive and motivating feedback.

      __Reviewer #2 – 1. Evidence, reproducibility and clarity __

      This manuscript presents structural and biochemical characterization of the interaction between CDIN1 and the C-terminal domain of Codanin1, shedding light on a complex implicated in Congenital Dyserythropoietic Anemia Type I (CDA-I). While the authors provide valuable structural insights and identify disease-associated mutations that impair CDIN1-Codanin1 binding, I think several important concerns should be addressed to strengthen both the mechanistic claims and their functional relevance.

      Contradiction Between Stoichiometry Models:

      The authors propose that CDIN1 and Codanin1Cterm primarily form a heterodimer in vitro. However, this appears to contradict previous reports indicating a tetra-heteromeric arrangement. Additionally, while CDIN1 homodimerize seems confusing to me, do the authors suggest it is stable without Codanin1? This seems contrary to findings that CDIN1 is unstable in the absence of Codanin1 (Sedor, S.F., Shao, S. nature comm 2025, Swickley, G., Bloch, Y., Malka, L. et al 2020 BMC Mol and Cell Biol). These inconsistencies raise concerns about whether the observed stoichiometries are physiologically relevant or artifacts of in vitro reconstitution, especially since full-length Codanin1 was not studied.

      2.1 Response ____– Consistent stoichiometry of Codanin1Cterm

      We thank Reviewer #2 for raising critical points regarding the stoichiometry and physiological relevance of the CDIN1-Codanin1 interaction. The following response clarifies the rationale and interpretation in relation to previous findings.

      Stoichiometry of CDIN1-Codanin1Cterm complex:

      Recent Cryo-EM studies of full-length Codanin1 (Jeong, Frater et al. 2025, Sedor and Shao 2025) suggest independent internal dimerization domains (452-798 and 841-1000 amino acid residue) driving homodimer formation, with each Codanin1 monomer binding one CDIN1 via the C-terminal region (1005-1227 amino acid residue), resulting in a tetra-heteromeric complex. Therefore, the complete assembly appears as a dimer of heterodimers in the full-length context.

      In our study, Codanin1 was truncated to retain only the CDIN1-binding C-terminus (1005-1227 amino acid residues), eliminating the homodimerization ability of Codanin1. Hence, in the case of truncated Codanin1Cterm, the minimal complex we observe is a 1:1 heterodimer of CDIN1-Codanin1Cterm, which is fully consistent with the equimolar stoichiometry of CDIN1-Codanin1 complex seen in the full-length structure.

      Stability and oligomeric state of CDIN1 in the absence of Codanin1:

      We concur with the reviewer that Sedor et al. (2025) and Swickley et al. (2020) reported decreased CDIN1 levels in cells lacking Codanin1, implying in vivo dependence of CDIN1 on Codanin1 partner for stability (Swickley, Bloch et al. 2020, Sedor and Shao 2025). The purified CDIN1 is monodisperse (Supplementary Figure 2D), exhibits thermal stability with a melting temperature of 48 °C (Supplementary Figure 2E), and displays proper folding as indicated by CD measurements (Supplementary Figure 2B). Additionally, SAXS profiles of CDIN1 correspond to AlphaFold predictions (Fig. 2B). Together, our findings indicate that the recombinant CDIN1 forms a stable conformation in vitro without Codanin1. To the best of our knowledge, no previous research has directly identified the endogenous oligomeric states of CDIN1 within cellular content.

      We fully acknowledge that future analysis of the full-length Codanin1-CDIN1 assembly in a cellular context will be necessary for understanding physiological stoichiometries. As outlined in the General statements, our study focuses on the C-terminus of Codanin1 to describe the binding interface and complex biophysical properties of the CDIN-Codanin1Cterm complex.

      __Reviewer #2 – ____2. Unvalidated Functional Claims: __

      The manuscript identifies several CDA-I-associated mutations that disrupt CDIN1-Codanin1 interaction. However, the authors do not test how these mutations affect the biological function of the complex, particularly its role in ASF1 sequestration or histone trafficking. Given the central importance of this axis in their disease model, functional validation (e.g., ASF1 localization, histone deposition assays) is necessary to support these mechanistic conclusions.

      2.2 Response – ____Hypothetical model as discussion merit

      We thank the reviewer for the comment regarding the functional implications of CDA-I-associated mutations and their potential impact on ASF1 sequestration and histone trafficking hypothesized within the Discussion. We fully agree that understanding the downstream biological consequences of disrupted CDIN1-Codanin1 interaction is critical for elucidating the full molecular basis of CDA-I pathogenesis.

      In the Future research directions of the Discussion, we have acknowledged and emphasized the need for follow-up studies using erythroblast cell lines to determine whether specific disease-associated mutations disrupt CDIN1-Codanin1 binding, leading to functional defects relevant to erythropoiesis and nuclear architecture typical for CDA-I disease.

      However, as we respectfully note in General Statements, the main aim of the present study was to provide a rigorous biophysical characterization of the CDIN1-Codanin1Cterm interaction. Proposed cellular experiments, though relevant, are beyond the conceptual scope of the presented studies.

      Reviewer #2 – ____3. Speculative and Potentially Contradictory Model:

      The proposed model suggests that CDIN1 competes with ASF1 for Codanin1 binding, thereby indirectly promoting histone delivery to the nucleus. However, emerging data indicate that Codanin1, CDIN1, and ASF1 can form a stable ternary complex, calling into question this competitive binding hypothesis (Sedor, S.F., Shao, S. nature comm 2025). The authors do not acknowledge or discuss these findings, and the model in its current form may therefore be oversimplified or inaccurate.

      2.3 Response – ____Hypothetical model fully aligned with current knowledge

      We fully acknowledged and discussed in the current manuscript the recent findings demonstrating that Codanin1, CDIN1, and ASF1 can form a ternary complex (Sedor, S.F., Shao, S. Nature Comm. 2025; Jeong, T. K. et al. Nature Comm. 2025). Our revised model was updated accordingly to reflect the collaborative binding of Codanin1, CDIN1, and ASF1, and is presented in alignment with published data.

      While earlier versions of our work published on the BioRxiv server (May 26, 2023) proposed a competitive hypothesis, the current manuscript incorporates recent literature and prior reviewer feedback to offer a refined model. We believe that the updated hypothesis suggests a plausible mechanism for how CDIN1 modulates Codanin1 function, which will be further tested in future cellular studies.

      Reviewer #2 – 4. Significance:

      Overall, the study adds to our structural understanding of CDIN1 and Codanin1 interactions, but the functional interpretations are currently speculative, and in some cases in conflict with existing literature. The manuscript would benefit significantly from addressing these discrepancies, incorporating relevant data on ASF1, and clarifying whether the observed assemblies reflect physiological complexes.

      __2.4 Response – Significance __

      We thank Reviewer #2 for the constructive feedback. As noted in General Statements, our current manuscript is primarily dedicated to defining the molecular architecture and interactions of the CDIN1–Codanin1Cterm core interface. We agree that follow-up ASF1‑dependent functional assays will be critical to fully validate observed assemblies, but these experiments lie outside the scope of the present study and are ongoing in our laboratory.

      To address the reviewer's concern about possible speculative interpretation, we have:

      • Used cautious language in Results and Discussion to prevent overstatement (e.g., page 31, line 754, “leads” exchanged to “may contribute” in legend of Fig. 4).
      • Described in the Discussion how our results enhance and add understanding to the body of published structural data of CDIN1–Codanin1Cterm.
      • Updated our hypothetical model in Fig. 4 to be fully in line with published data.
      • Clearly stated that the working hypothesis is connected with a subset of CDA-I mutations (p. 31, l. 758-759, “The proposed model represents a working hypothesis relating to a subset of CDA-I mutations and is not currently substantiated by experimental evidence at the cellular level.”)
      • Stated in Future research directions of Discussion that functional validation, including ASF1, will motivate future critical studies, p. 32, l. 771-773: “The ability of Codanin1 to interact with both CDIN1 and ASF1 motivates further investigation of how CDIN1 and ASF1 affect the function of full-length Codanin1, which even recent cryo-EM data has not addressed yet.”
      • Highlighted the necessity of complementary in vivo studies in erythroblast cell lines to determine if CDA-I-related mutations in CDIN1-Codanin1 interaction region cause typical CDA-I phenotypes, aiming to clarify the molecular mechanisms of inherited CDA-I anemia. We state in Future research directions in Discussion, p. 32, l. 774-780: “…follow-up research utilizing erythroblast model cell lines must be conducted to determine if specific mutations that disrupt CDIN1-Codanin1 binding also affect ASF1 localization and cause a phenotype typical of CDA-I. In future work, additional Codanin1 mutations, including those outside the C-terminal region, should be evaluated to determine how the mutations affect ASF1’s nuclear concentration and subcellular localization. The proposed research directions will provide additional deeper insights into the underlying mechanisms of the molecular origin of inherited anemia CDA-I.” We believe that the revisions objectively clarify the significance and the limits of the current work and set the stage for the detailed functional studies to follow.

      __Reviewer #3 – Evidence, reproducibility and clarity: __

      Congenital Dyserythropoietic Anemia Type I (CDA I) is an autosomal recessive disorder characterized by ineffective erythropoiesis and distinctive nuclear morphology ("Swiss cheese" heterochromatin) in erythroblasts. CDA I is caused by mutations in CDAN1 and CDIN1. Codanin1, encoded by CDAN1, is part of the cytosolic ASF1-H3.1-H4-Importin-4 complex, which regulates histone trafficking to the nucleus. CDIN1 has been shown to bind the C-terminal domain of Codanin-1, but until now, pathogenic mutations had not been directly linked to the disruption of this interaction.

      In this study, the authors used biophysical techniques to characterize the interaction between Codanin-1's C-terminal region (residues 1005-1227) and CDIN1, demonstrating high-affinity, equimolar binding. HDX-MS identified interaction hotspots, and disease-associated mutations in these regions disrupted complex formation. The authors propose that such disruption prevents ASF1 sequestration in the cytoplasm, thereby reducing nuclear histone levels and contributing to the chromatin abnormalities seen in CDA I.

      Major Comments:

      1. Use of Codanin-1 Fragment:

      Most experiments were conducted using only the C-terminal 223 amino acids of Codanin-1. While this region is known to bind CDIN1, it is unclear whether its conformation is maintained in the context of the full-length protein. This could affect binding properties and structural interpretations. The authors should discuss how structural differences between the isolated C-terminus and the full-length Codanin-1 may influence the conclusions.

      Response of authors ____#3

      3.1 Response: Use of Codanin-1 Fragment as biding part to CDIN1

      We thank the reviewer for the important observation regarding the use of the C-terminal fragment of Codanin1. As noted in the manuscript (e.g., p. 30, line 721 and p. 32, line 761), we fully acknowledge that the truncation of Codanin1 may influence its conformational dynamics or contextual folding relative to the full-length protein.

      However, several lines of evidence suggest that the C-terminal 223 amino acid residues—responsible for CDIN1 binding—are structurally autonomous and have minimal intramolecular contacts with upstream regions. Published cryo-EM and biochemical data (Jeong, Frater et al. 2025, Sedor and Shao 2025), in conjunction with AlphaFold structural predictions (Fig. 2D) and our co-immunoprecipitation assays (Fig. 3F), consistently support a model wherein the CDIN1-binding region is flexible and spatially isolated from the core structural domains of Codanin1. Additionally, results from our co-immunoprecipitation assay (Fig. 3F) indicate that full-length Codanin1 and truncated Codanin1Cterm interact with CDIN1 similarly, further supporting the isolated manner of the C-terminal fragment. The available data together imply that the C-terminal fragment used in our study retains its native conformation and binding properties when expressed independently.

      While our findings are confined to the interaction domain and do not reflect full-length Codanin1’s architecture, we believe the use of the C-terminal minimal fragment of Codanin1 enables precise dissection of the CDIN1-binding interface and yields mechanistic insights without introducing significant structural artifacts.

      We agree with the reviewer that future work incorporating full-length Codanin1, especially in a cellular context, will be instrumental to fully characterize higher-order assembly and regulatory functions.

      __Reviewer #3 – 2. ____Graphical Abstract and Domain Independence: __

      The graphical abstract presents the Codanin-1 C-terminus as an independent domain, but no direct evidence is provided to support its structural autonomy in vivo.

      The authors should clarify whether the C-terminal region functions as a distinct domain in the context of the full-length protein.

      __3.2 Response –____ Independent C-terminal domain __

      We thank the reviewer for bringing up the question of the independence of the C-terminal domain. Although direct in vivo proof of C-terminal autonomy is not yet available, published cryo-EM structures of full-length Codanin1, our biophysical characterization, and AlphaFold models all consistently indicate that the C-terminal 223 amino acid residues of Codanin1 form a structurally independent binding module. In the graphical abstract, we illustrated the C‑terminal domain as a loosely connected part of Codanin1 to highlight its independence and to emphasize the specific focus of our studies.

      To articulate limitations of our studies focused on the C-terminal part of Codanin1, we stated in the Functional implications of CDA-I-related mutations in the Discussion, p. 30, l. 721-724: “However, our measurements do not exclude the possible role of the disordered regions in full-length Codanin1. For example, CDIN1 could potentially stabilize full-length Codanin1 by rearranging the disordered regions into a more condensed structure, thereby augmenting the structural stability of Codanin1.”

      Reviewer #3 – 3.____Pathogenic Mutations Beyond the Binding Site:

      The study highlights a triplet mutation that impairs CDIN1 binding. However, most CDA I‑associated mutations in CDAN1 are dispersed across the entire protein and may not affect CDIN1 interaction directly.

      The authors should discuss alternative mechanisms by which mutations in other regions of Codanin-1 might cause disease.

      3.3 Response – Pathogenic mutations outside the binding site – alternative mechanisms

      We appreciate the reviewer noting that most CDA-I-associated CDAN1 mutations are outside the CDIN1-Codanin1 binding site and suggesting alternative mechanisms. In the revised Discussion, we added a paragraph on alternative pathogenic models, p. 29, l. 702-713:

      "Our study centers on the CDIN1-binding C-terminus, however, most CDA-I-associated CDAN1 mutations lie elsewhere and probably act through alternative mechanisms. Mutations such as P672L and F868I in the LOBE2 (452-798 amino acid residue) and F868I in the coiled-coil (841-1000 amino acid residue) domains may disturb Codanin1 homodimerization and higher-order complex assembly, directly affecting ASF1 sequestration (Jeong, T. K. et al. Nature Comm. 2025). Other mutant variants may also interfere with ASF1 sequestration, nuclear targeting, or chromatin-remodeling functions, while destabilizing mutations may induce misfolding and proteasomal degradation. Moreover, CDA-I-associated mutations, such as R714W and R1042W, might compromise the interaction between Codanin1 and ASF1 (Ask, Jasencakova et al. 2012). Collectively, the complementary alternative pathogenic mechanisms associated with Codanin1 mutations in distal regions and mutations in CDIN1‑binding C-terminus of Codanin1 may contribute to erythroid dysfunction in CDA-I."

      Reviewer #3 – 4. ____Contradictory Functional Models:

      Ask et al. (EMBO J, 2012) reported that Codanin-1 depletion increases nuclear ASF1 and accelerates DNA replication. This contrasts with the current hypothesis that disruption of the Codanin-1/CDIN1 complex reduces nuclear ASF1.

      The authors should attempt to reconcile this apparent contradiction, possibly by proposing a context-specific or dual-function model for Codanin-1 in histone trafficking.

      3.4 Response – ____Clarified explanation of hypothetical functional model

      We thank the reviewer for raising this point, which improved the clarity of our work. There is no real discrepancy between Ask et al. and our findings; both agree that Codanin1 restrains ASF1 in the cytoplasm. Ask et al. examined the complete loss of Codanin1, which abolishes cytoplasmic ASF1 sequestration and thus leads to maximal nuclear accumulation. We suggest the CDA-I-associated mutations selectively disrupt the CDIN1-Codanin1 interface, releasing ASF1 from the cytoplasm into the nucleus.

      To enhance clarity, we now state in the legend of Figure 4 describing the hypothesis (p. 31, l. 752-753): "…CDA-I-associated mutations prevent CDIN1-Codanin1 complex formation, thus prevent ASF1 sequestration to cytoplasm; ASF1 remains accumulated in nucleus."

      Reviewer #3 – 5. ____Conclusions and Claims:

      The proposed model of CDA I pathogenesis (Fig. 4) is plausible but not yet fully supported by the available data. The authors suggest that disruption of the Codanin-1/CDIN1 interaction leads to nuclear histone depletion, but this has not been experimentally confirmed.

      Claims about the general pathogenesis of CDA I should be clearly qualified as hypothetical and applicable to a subset of mutations. The presence and localization of ASF1 in the nucleus following disruption of the Codanin-1/CDIN1 complex should be tested experimentally.

      3.5 Response – __Tempered ____conclusions and claims: __

      We thank the reviewer for underscoring the need to temper our conclusions and to distinguish hypotheses from available results. We fully agree that our Fig. 4 model—linking disruption of the Codanin1-CDIN1 interface to nuclear histone imbalance—remains a working hypothesis, currently supported by indirect biochemical and structural data.

      Accordingly, we have:

      • Revised the text to explicitly state that this model is hypothetical and pertains to a subset of CDA-I-associated CDAN1 mutations. Specifically, we

      • Added to the last paragraph of the section Functional implications of CDA-I-related mutations in Discussion (p. 31, l. 744-749): “In considering functional implications of our findings within available data, it is essential to qualify that mechanistic claims regarding the general pathogenesis of CDA-I remain hypothetical and are restricted to a specific subset of mutations. Furthermore, direct experimental validation, such as immunolocalization or live-cell imaging, to assess ASF1’s nuclear presence and distribution following disruption of the CDIN1-Codanin1 complex is required to substantiate the proposed model.”

      • Included in the legend of Fig. 4: ”The proposed model represents a working hypothesis relating to a subset of CDA-I mutations and is not currently substantiated by experimental evidence at the cellular level.”
      • Replaced any associated definitive language (e.g., “leads to”) with qualified phrasing (e.g., “may contribute to”) in the legend of Fig. 4.
      • Clarified in the Discussion that direct measurement of nuclear ASF1 redistribution and histone levels following interface disruption has not yet been performed. Specifically, we added to the section Functional implications of CDA-I-related mutations in Discussion (p. 30, l. 734-735): “It should be noted, however, that direct quantification of nuclear ASF1 redistribution and histone levels after CDIN1-Codanin1 disruption has not yet been conducted.” Although experimental verification of nuclear ASF1 localization upon CDIN1-Codanin1 complex disruption falls beyond the current manuscript’s scope, we acknowledge its importance and have emphasized the need for such studies in future work within the Future research directions of the Discussion. Specifically, we concluded by stating (p. 32, l. 774-776): “Finally, follow‑up research utilizing erythroblast model cell lines must be conducted to determine if specific mutations that disrupt CDIN1-Codanin1 binding, also affect ASF1 localization and cause a phenotype typical of CDA-I.”

      __Reviewer #3 – 6.____Broader Mutation Analysis and ASF1 Localization: __

      To strengthen the link between Codanin-1/CDIN1 disruption and disease pathogenesis, it would be important to test the effects of additional CDAN1 mutations, including those outside the C-terminal region. Similarly, the impact on ASF1 nuclear concentration and localization should be directly assessed. These experiments would significantly bolster the central hypothesis. If feasible, they should be pursued or at least acknowledged as important future directions.

      3.6 Response – Broader mutation analysis and ASF1 localization in future directions

      We thank Reviewer #3 for emphasizing the value of a broader mutation survey and direct ASF1 localization studies. As noted above, our current manuscript is centered on delineating the molecular architecture of the CDIN1-Codanin1Cterm core interface; comprehensive mutational analyses outside the C-terminal binding region and ASF1-dependent functional assays will be critical to extend these findings but fall beyond the scope of the present work and will be the objective of our following studies. To address the reviewer’s concern, we have:

      • Expanded the Future Directions section to specify that additional CDA-I-linked CDAN1 variants, including non-C-terminal mutations, and quantitative assessments of ASF1 nuclear localization will be the subject of ongoing and planned investigations. Specifically, we added (p. 32, l. 776-778):” In future work, additional Codanin1 mutations, including those outside the C-terminal region, should be evaluated to determine how the mutations affect ASF1’s nuclear concentration and subcellular localization.”

      • Emphasized the need for complementary in vivo validation in erythroblast models to confirm whether the disturbance of CDIN1-Codanin1 binding recapitulates CDA-I phenotypes. We acknowledged the need for cell-line studies in future work within the Future research directions of Discussion (p. 32, l. 774-776): “Finally, follow-up research utilizing erythroblast model cell lines must be conducted to determine if specific mutations that disrupt CDIN1-Codanin1 binding, also affect ASF1 localization and cause a phenotype typical of CDA-I.” We believe these changes more precisely delimit the scope and significance of the current study while laying out a clear roadmap for the essential follow-up experiments.

      Reviewer #3 – 7. ____Rigor and Presentation and Cross-commenting

      __Minor Comments: __

      • Methods and Reproducibility:

      The experimental methods are well described, and the results appear reproducible.

      • Presentation:

      The text and figures are clear and well organized.

      Referee Cross-commenting

      I agree with reviewer 1 that the paper presents detailed structure study of Codanin-1 and CDIN1 protein. However, as reviewer 2 claims functional studies are missing and therefore the hypothesis regarding the pahtogenesis of CDAI is speculaltive especially with no studies regarding ASF1.

      3____.7 Response ____–____ Rigor and Presentation and Cross-commenting:

      We thank the reviewers for their positive appraisal of our results' reproducibility, presentation, and method descriptions. We also appreciate the cross-comment that, while our structural analysis of the CDIN1-Codanin1 complex is thorough, functional validation, particularly regarding ASF1, remains to be addressed.

      As outlined above, we have revised the manuscript to:

      • Emphasize that pathogenic hypotheses drawn from structural data are provisional (refer to Responses 2.2, 2.3, and 3.5).
      • Include follow-up studies for ASF1 localization assays and broader mutation profiling in our Future Directions (refer to Responses 2.4, 3.5, 3.6).
      • Integrate cautious language throughout to clearly delineate verified findings from model-based speculation (refer to Responses 2.4, 3.5, 3.6). The implemented adjustments ensure that the current work is positioned as a detailed structural and interaction foundation, upon which the essential functional studies will build. We believe that all extensions and clarifications fully satisfy the reviewers’ collective recommendations.

      __Reviewer #3 –____ Significance: __

      Nature and Significance of the Advance:

      This study extends prior work (e.g., Swickley et al., BMC Mol Cell Biol 2020; Shroff et al., Biochem J 2020) on Codanin-1/CDIN1 interaction by applying high-resolution biophysical techniques to identify mutations that disrupt this complex. It provides a plausible cellular mechanism by which specific mutations may lead to CDA I through impaired histone trafficking.

      Nevertheless, key question remains: How do mutations outside the Codanin-1 C-terminus contribute to the pathology?

      3.8 Response – Significance:

      • We thank Reviewer #3 for this important point. Although our work specifically dissects the C-terminal CDIN1-binding domain of Codanin1, we fully acknowledge that CDA-I-associated mutations throughout Codanin1 may operate via additional mechanisms. To address the additional mechanisms, we have added a new paragraph describing other possible pathogenic models to the Discussion (please refer to Response 3.3).
      • We also fully acknowledged the need for systematic functional assays of non-C-terminal mutations and their impact on ASF1 localization (please refer to Response 3.6).
      • We revised the text to clarify how mutations beyond the C-terminus may contribute to CDA-I pathogenesis and present the significance of our current structural analyses, biophysical characterizations, and molecular insights as a foundation for future research (please refer to Response 3.6). __Audience: __

      • Molecular and cellular biologists investigating nuclear-cytoplasmic trafficking mechanisms

      • Hematologists and geneticists studying rare red cell disorders
      • Clinicians managing CDA I patients and researchers exploring targeted therapies __Reviewer Expertise: __

      Pediatric hematologist with over 20 years of research experience in CDA I, including the initial identification of CDAN1 and the elucidation of Codanin-1's role in embryonic erythropoiesis. Not a specialist in the biophysical techniques used in this study.

      References

      Ask, K., Z. Jasencakova, P. Menard, Y. Feng, G. Almouzni and A. Groth (2012). "Codanin-1, mutated in the anaemic disease CDAI, regulates Asf1 function in S-phase histone supply." The EMBO Journal 31(8): 2013–2023.

      Jeong, T.-K., R. C. M. Frater, J. Yoon, A. Groth and J.-J. Song (2025). "CODANIN-1 sequesters ASF1 by using a histone H3 mimic helix to regulate the histone supply." Nature Communications 16(1): 2181.

      Sedor, S. F. and S. Shao (2025). "Mechanism of ASF1 engagement by CDAN1." Nature Communications 16(1): 2599.

      Swickley, G., Y. Bloch, L. Malka, A. Meiri, S. Noy-Lotan, A. Yanai, H. Tamary and B. Motro (2020). "Characterization of the interactions between Codanin-1 and C15Orf41, two proteins implicated in congenital dyserythropoietic anemia type I disease." Molecular and Cell Biology 21(1).

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reply to the Reviewers

      I would like to thank the reviewers for their comments and interest in the manuscript and the study.

      Reviewer #1

      1. I would assume that there are RNA-seq and/or ChIP-seq data out there produced after knockdown of one or more of these DBPs that show directional positioning.

      The directional positioning of CTCF-binding sites at chromatin interaction sites was analyzed by CRISPR experiment (Guo Y et al. Cell 2015). We found that the machine learning and statistical analysis showed the same directional bias of CTCF-binding motif sequence and RAD21-binding motif sequence at chromatin interaction sites as the experimental analysis of Guo Y et al. (lines 229-253, Figure 3b, c, d and Table 1). Since CTCF is involved in different biological functions (Braccioli L et al. Essays Biochem. 2019 ResearchGate webpage), the directional bias of binding sites may be reduced in all binding sites including those at chromatin interaction sites (lines 68-73). In our study, we investigated the DNA-binding sites of proteins using the ChIP-seq data of DNA-binding proteins and DNase-seq data. We also confirmed that the DNA-binding sites of SMC3 and RAD21, which tend to be found in chromatin loops with CTCF, also showed the same directional bias as CTCF by the computational analysis.

      __2. Figure 6 should be expanded to incorporate analysis of DBPs not overlapping CTCF/cohesin in chromatin interaction data that is important and potentially more interesting than the simple DBPs enrichment reported in the present form of the figure. __

      Following the reviewer's advice, I performed the same analysis with the DNA-binding sites that do no overlap with the DNA-binding sites of CTCF and cohesin (RAD21 and SMC3) (Fig. 6 and Supplementary Fig. 4). The result showed the same tendency in the distribution of DNA-binding sites. The height of a peak on the graph became lower for some DNA-binding proteins after removing the DNA-binding sites that overlapped with those of CTCF and cohesin. I have added the following sentence on lines 435 and 829: For the insulator-associated DBPs other than CTCF, RAD21, and SMC3, the DNA-binding sites that do not overlap with those of CTCF, RND21, and SMC3 were used to examine their distribution around interaction sites.

      3. Critically, I would like to see use of Micro-C/Hi-C data and ChIP-seq from these factors, where insulation scores around their directionally-bound sites show some sort of an effect like that presumed by the authors - and many such datasets are publicly-available and can be put to good use here.

      As suggested by the reviewer, I have added the insulator scores and boundary sites from the 4D nucleome data portal as tracks in the UCSC genome browser. The insulator scores seem to correspond to some extent to the H3K27me3 histone marks from ChIP-seq (Fig. 4a and Supplementary Fig. 3). We found that the DNA-binding sites of the insulator-associated DBPs were statistically overrepresented in the 5 kb boundary sites more than other DBPs (Fig. 4d). The direction of DNA-binding sites on the genome can be shown with different colors (e.g. red and green), but the directionality of insulator-associated DNA-binding sites is their overall tendency, and it may be difficult to notice the directionality from each binding site because the directionality may be weaker than that of CTCF, RAD21, and SMC3 as shown in Table 1 and Supplementary Table 2. We also observed the directional biases of CTCF, RAD21, and SMC3 by using Micro-C chromatin interaction data as we estimated, but the directionality was more apparent to distinguish the differences between the four directions of FR, RF, FF, and RR using CTCF-mediated ChIA-pet chromatin interaction data (lines 287 and 288).

       I found that the CTCF binding sites examined by a wet experiment in the previous study may not always overlap with the boundary sites of chromatin interactions from Micro-C assay (Guo Y et al. *Cell* 2015). The chromatin interaction data do not include all interactions due to the high sequencing cost of the assay, and include less long-range interactions due to distance bias. The number of the boundary sites may be smaller than that of CTCF binding sites acting as insulators and/or some of the CTCF binding sites may not be locate in the boundary sites. It may be difficult for the boundary location algorithm to identify a short boundary location. Due to the limitations of the chromatin interaction data, I planned to search for insulator-associated DNA-binding proteins without using chromatin interaction data in this study.
      
       I discussed other causes in lines 614-622: Another reason for the difference may be that boundary sites are more closely associated with topologically associated domains (TADs) of chromosome than are insulator sites. Boundary sites are regions identified based on the separation of numerous chromatin interactions. On the other hand, we found that the multiple DNA-binding sites of insulator-associated DNA-binding proteins were located close to each other at insulator sites and were associated with distinct nested and focal chromatin interactions, as reported by Micro-C assay. These interactions may be transient and relatively weak, such as tissue/cell type, conditional or lineage-specific interactions.
      
       Furthermore, I have added the statistical summary of the analysis in lines 372-395 as follows: Overall, among 20,837 DNA-binding sites of the 97 insulator-associated proteins found at insulator sites identified by H3K27me3 histone modification marks (type 1 insulator sites), 1,315 (6%) overlapped with 264 of 17,126 5kb long boundary sites, and 6,137 (29%) overlapped with 784 of 17,126 25kb long boundary sites in HFF cells. Among 5,205 DNA-binding sites of the 97 insulator-associated DNA-binding proteins found at insulator sites identified by H3K27me3 histone modification marks and transcribed regions (type 2 insulator sites), 383 (7%) overlapped with 74 of 17,126 5-kb long boundary sites, 1,901 (37%) overlapped with 306 of 17,126 25-kb long boundary sites. Although CTCF-binding sites separate active and repressive domains, the limited number of DNA-binding sites of insulator-associated proteins found at type 1 and 2 insulator sites overlapped boundary sites identified by chromatin interaction data. Furthermore, by analyzing the regulatory regions of genes, the DNA-binding sites of the 97 insulator-associated DNA-binding proteins were found (1) at the type 1 insulator sites (based on H3K27me3 marks) in the regulatory regions of 3,170 genes, (2) at the type 2 insulator sites (based on H3K27me3 marks and gene expression levels) in the regulatory regions of 1,044 genes, and (3) at insulator sites as boundary sites identified by chromatin interaction data in the regulatory regions of 6,275 genes. The boundary sites showed the highest number of overlaps with the DNA-binding sites. Comparing the insulator sites identified by (1) and (3), 1,212 (38%) genes have both types of insulator sites. Comparing the insulator sites between (2) and (3), 389 (37%) genes have both types of insulator sites. From the comparison of insulator and boundary sites, we found that (1) or (2) types of insulator sites overlapped or were close to boundary sites identified by chromatin interaction data.
      

      4. The suggested alternative transcripts function, also highlighted in the manuscripts abstract, is only supported by visual inspection of a few cases for several putative DBPs. I believe this is insufficient to support what looks like one of the major claims of the paper when reading the abstract, and a more quantitative and genome-wide analysis must be adopted, although the authors mention it as just an 'observation'.

      According to the reviewer's comment, I performed the genome-wide analysis of alternative transcripts where the DNA-binding sites of insulator-associated proteins are located near splicing sites. The DNA-binding sites of insulator-associated DNA-binding proteins were found within 200 bp centered on splice sites more significantly than the other DNA-binding proteins (Fig. 4e and Table 2). I have added the following sentences on lines 405 - 412: We performed the statistical test to estimate the enrichment of insulator-associated DNA-binding sites compared to the other DNA-binding proteins, and found that the insulator-associated DNA-binding sites were significantly more abundant at splice sites than the DNA-binding sites of the other proteins (Fig 4e and Table 2; Mann‒Whitney U test, p value 5. Figure 1 serves no purpose in my opinion and can be removed, while figures can generally be improved (e.g., the browser screenshots in Figs 4 and 5) for interpretability from readers outside the immediate research field.

      I believe that the Figure 1 would help researchers in other fields who are not familiar with biological phenomena and functions to understand the study. More explanation has been included in the Figures and legends of Figs. 4 and 5 to help readers outside the immediate research field understand the figures.

      6. Similarly, the text is rather convoluted at places and should be re-approached with more clarity for less specialized readers in mind.

      Reviewer #2's comments would be related to this comment. I have introduced a more detailed explanation of the method in the Results section, as shown in the responses to Reviewer #2's comments.

      Reviewer #2

      1. Introduction, line 95: CTCF appears two times, it seems redundant.

      On lines 91-93, I deleted the latter CTCF from the sentence "We examine the directional bias of DNA-binding sites of CTCF and insulator-associated DBPs, including those of known DBPs such as RAD21 and SMC3".

      2. Introduction, lines 99-103: Please stress better the novelty of the work. What is the main focus? The new identified DPBs or their binding sites? What are the "novel structural and functional roles of DBPs" mentioned?

      Although CTCF is known to be the main insulator protein in vertebrates, we found that 97 DNA-binding proteins including CTCF and cohesin are associated with insulator sites by modifying and developing a machine learning method to search for insulator-associated DNA-binding proteins. Most of the insulator-associated DNA-binding proteins showed the directional bias of DNA-binding motifs, suggesting that the directional bias is associated with the insulator.

       I have added the sentence in lines 96-99 as follows: Furthermore, statistical testing the contribution scores between the directional and non-directional DNA-binding sites of insulator-associated DBPs revealed that the directional sites contributed more significantly to the prediction of gene expression levels than the non-directional sites. I have revised the statement in lines 101-110 as follows: To validate these findings, we demonstrate that the DNA-binding sites of the identified insulator-associated DBPs are located within potential insulator sites, and some of the DNA-binding sites in the insulator site are found without the nearby DNA-binding sites of CTCF and cohesin. Homologous and heterologous insulator-insulator pairing interactions are orientation-dependent, as suggested by the insulator-pairing model based on experimental analysis in flies. Our method and analyses contribute to the identification of insulator- and chromatin-associated DNA-binding sites that influence EPIs and reveal novel functional roles and molecular mechanisms of DBPs associated with transcriptional condensation, phase separation and transcriptional regulation.
      

      3. Results, line 111: How do the SNPs come into the procedure? From the figures it seems the input is ChIP-seq peaks of DNBPs around the TSS.

      On lines 121-124, to explain the procedure for the SNP of an eQTL, I have added the sentence in the Methods: "If a DNA-binding site was located within a 100-bp region around a single-nucleotide polymorphism (SNP) of an eQTL, we assumed that the DNA-binding proteins regulated the expression of the transcript corresponding to the eQTL".

      4. Again, are those SNPs coming from the different cell lines? Or are they from individuals w.r.t some reference genome? I suggest a general restructuring of this part to let the reader understand more easily. One option could be simplifying the details here or alternatively including all the necessary details.

      On line 119, I have included the explanation of the eQTL dataset of GTEx v8 as follows: " The eQTL data were derived from the GTEx v8 dataset, after quality control, consisting of 838 donors and 17,382 samples from 52 tissues and two cell lines". On lines 681 and 865, I have added the filename of the eQTL data "(GTEx_Analysis_v8_eQTL.tar)".

      5. Figure 1: panel a and b are misleading. Is the matrix in panel a equivalent to the matrix in panel b? If not please clarify why. Maybe in b it is included the info about the SNPs? And if yes, again, what is then difference with a.

      The reviewer would mention Figure 2, not Figure 1. If so, the matrices in panels a and b in Figure 2 are equivalent. I have shown it in the figure: The same figure in panel a is rotated 90 degrees to the right. The green boxes in the matrix show the regions with the ChIP-seq peak of a DNA-binding protein overlapping with a SNP of an eQTL. I used eQTL data to associate a gene with a ChIP-seq peak that was more than 2 kb upstream and 1 kb downstream of a transcriptional start site of a gene. For each gene, the matrix was produced and the gene expression levels in cells were learned and predicted using the deep learning method. I have added the following sentences to explain the method in lines 133 - 139: Through the training, the tool learned to select the binding sites of DNA-binding proteins from ChIP-seq assays that were suitable for predicting gene expression levels in the cell types. The binding sites of a DNA-binding protein tend to be observed in common across multiple cell and tissue types. Therefore, ChIP-seq data and eQTL data in different cell and tissue types were used as input data for learning, and then the tool selected the data suitable for predicting gene expression levels in the cell types, even if the data were not obtained from the same cell types.

      6. Line 386-388: could the author investigate in more detail this observation? Does it mean that loops driven by other DBPs independent of the known CTCF/Cohesin? Could the author provide examples of chromatin structural data e.g. MicroC?

      As suggested by the reviewer, to help readers understand the observation, I have added Supplementary Fig. S4c to show the distribution of DNA-binding sites of "CTCF, RAD21, and SMC3" and "BACH2, FOS, ATF3, NFE2, and MAFK" around chromatin interaction sites. I have modified the following sentence to indicate the figure on line 501: Although a DNA-binding-site distribution pattern around chromatin interaction sites similar to those of CTCF, RAD21, and SMC3 was observed for DBPs such as BACH2, FOS, ATF3, NFE2, and MAFK, less than 1% of the DNA-binding sites of the latter set of DBPs colocalized with CTCF, RAD21, or SMC3 in a single bin (Fig. S4c).

       In Aljahani A et al. *Nature Communications* 2022, we find that depletion of cohesin causes a subtle reduction in longer-range enhancer-promoter interactions and that CTCF depletion can cause rewiring of regulatory contacts. Together, our data show that loop extrusion is not essential for enhancer-promoter interactions, but contributes to their robustness and specificity and to precise regulation of gene expression. Goel VY et al. *Nature Genetics* 2023 mentioned in the abstract: Microcompartments frequently connect enhancers and promoters and though loss of loop extrusion and inhibition of transcription disrupts some microcompartments, most are largely unaffected. These results suggested that chromatin loops can be driven by other DBPs independent of the known CTCF/Cohesin.
      
      I added the following sentence on lines 569-577: The depletion of cohesin causes a subtle reduction in longer-range enhancer-promoter interactions and that CTCF depletion can cause rewiring of regulatory contacts. Another group reported that enhancer-promoter interactions and transcription are largely maintained upon depletion of CTCF, cohesin, WAPL or YY1. Instead, cohesin depletion decreased transcription factor binding to chromatin. Thus, cohesin may allow transcription factors to find and bind their targets more efficiently. Furthermore, the loop extrusion is not essential for enhancer-promoter interactions, but contributes to their robustness and specificity and to precise regulation of gene expression.
      
       FOXA1 pioneer factor functions as an initial chromatin-binding and chromatin-remodeling factor and has been reported to form biomolecular condensates (Ji D et al. *Molecular Cell* 2024). CTCF have also found to form transcriptional condensate and phase separation (Lee R et al. *Nucleic acids research* 2022). FOS was found to be an insulator-associated DNA-binding protein in this study and is potentially involved in chromatin remodeling, transcription condensation, and phase separation with the other factors such as BACH2, ATF3, NFE2 and MAFK. I have added the following sentence on line 556: FOXA1 pioneer factor functions as an initial chromatin-binding and chromatin-remodeling factor and has been reported to form biomolecular condensates.
      

      7. In general, how the presented results are related to some models of chromatin architecture, e.g. loop extrusion, in which it is integrated convergent CTCF binding sites?

      Goel VY et al. Nature Genetics 2023 identified highly nested and focal interactions through region capture Micro-C, which resemble fine-scale compartmental interactions and are termed microcompartments. In the section titled "Most microcompartments are robust to loss of loop extrusion," the researchers noted that a small proportion of interactions between CTCF and cohesin-bound sites exhibited significant reductions in strength when cohesin was depleted. In contrast, the majority of microcompartmental interactions remained largely unchanged under cohesin depletion. Our findings indicate that most P-P and E-P interactions, aside from a few CTCF and cohesin-bound enhancers and promoters, are likely facilitated by a compartmentalization mechanism that differs from loop extrusion. We suggest that nested, multiway, and focal microcompartments correspond to small, discrete A-compartments that arise through a compartmentalization process, potentially influenced by factors upstream of RNA Pol II initiation, such as transcription factors, co-factors, or active chromatin states. It follows that if active chromatin regions at microcompartment anchors exhibit selective "stickiness" with one another, they will tend to co-segregate, leading to the development of nested, focal interactions. This microphase separation, driven by preferential interactions among active loci within a block copolymer, may account for the striking interaction patterns we observe.

       The authors of the paper proposed several mechanisms potentially involved in microcompartments. These mechanisms may be involved in looping with insulator function. Another group reported that enhancer-promoter interactions and transcription are largely maintained upon depletion of CTCF, cohesin, WAPL or YY1. Instead, cohesin depletion decreased transcription factor binding to chromatin. Thus, cohesin may allow transcription factors to find and bind their targets more efficiently (Hsieh TS et al. *Nature Genetics* 2022). Among the identified insulator-associated DNA-binding proteins, Maz and MyoD1 form loops without CTCF (Xiao T et al. *Proc Natl Acad Sci USA* 2021 ; Ortabozkoyun H et al. *Nature genetics* 2022 ; Wang R et al. *Nature communications* 2022). I have added the following sentences on lines 571-575: Another group reported that enhancer-promoter interactions and transcription are largely maintained upon depletion of CTCF, cohesin, WAPL or YY1. Instead, cohesin depletion decreased transcription factor binding to chromatin. Thus, cohesin may allow transcription factors to find and bind their targets more efficiently. I have included the following explanation on lines 582-584: Maz and MyoD1 among the identified insulator-associated DNA-binding proteins form loops without CTCF.
      
       As for the directionality of CTCF, if chromatin loop anchors have some structural conformation, as shown in the paper entitled "The structural basis for cohesin-CTCF-anchored loops" (Li Y et al. *Nature* 2020), directional DNA binding would occur similarly to CTCF binding sites. Moreover, cohesin complexes that interact with convergent CTCF sites, that is, the N-terminus of CTCF, might be protected from WAPL, but those that interact with divergent CTCF sites, that is, the C-terminus of CTCF, might not be protected from WAPL, which could release cohesin from chromatin and thus disrupt cohesin-mediated chromatin loops (Davidson IF et al. *Nature Reviews Molecular Cell Biology* 2021). Regarding loop extrusion, the 'loop extrusion' hypothesis is motivated by in vitro observations. The experiment in yeast, in which cohesin variants that are unable to extrude DNA loops but retain the ability to topologically entrap DNA, suggested that in vivo chromatin loops are formed independently of loop extrusion. Instead, transcription promotes loop formation and acts as an extrinsic motor that extends these loops and defines their final positions (Guerin TM et al. *EMBO Journal* 2024). I have added the following sentences on lines 543-547: Cohesin complexes that interact with convergent CTCF sites, that is, the N-terminus of CTCF, might be protected from WAPL, but those that interact with divergent CTCF sites, that is, the C-terminus of CTCF, might not be protected from WAPL, which could release cohesin from chromatin and thus disrupt cohesin-mediated chromatin loops. I have included the following sentences on lines 577-582: The 'loop extrusion' hypothesis is motivated by in vitro observations. The experiment in yeast, in which cohesin variants that are unable to extrude DNA loops but retain the ability to topologically entrap DNA, suggested that in vivo chromatin loops are formed independently of loop extrusion. Instead, transcription promotes loop formation and acts as an extrinsic motor that extends these loops and defines their final positions.
      
       Another model for the regulation of gene expression by insulators is the boundary-pairing (insulator-pairing) model (Bing X et al. *Elife* 2024) (Ke W et al. *Elife* 2024) (Fujioka M et al. *PLoS Genetics* 2016). Molecules bound to insulators physically pair with their partners, either head-to-head or head-to-tail, with different degrees of specificity at the termini of TADs in flies. Although the experiments do not reveal how partners find each other, the mechanism unlikely requires loop extrusion. Homologous and heterologous insulator-insulator pairing interactions are central to the architectural functions of insulators. The manner of insulator-insulator interactions is orientation-dependent. I have summarized the model on lines 559-567: Other types of chromatin regulation are also expected to be related to the structural interactions of molecules. As the boundary-pairing (insulator-pairing) model, molecules bound to insulators physically pair with their partners, either head-to-head or head-to-tail, with different degrees of specificity at the termini of TADs in flies (Fig. 7). Although the experiments do not reveal how partners find each other, the mechanism unlikely requires loop extrusion. Homologous and heterologous insulator-insulator pairing interactions are central to the architectural functions of insulators. The manner of insulator-insulator interactions is orientation-dependent.
      

      8. Do the authors think that the identified DBPs could work in that way as well?

      The boundary-pairing (insulator-pairing) model would be applied to the insulator-associated DNA-binding proteins other than CTCF and cohesin that are involved in the loop extrusion mechanism (Bing X et al. Elife 2024) (Ke W et al. Elife 2024) (Fujioka M et al. PLoS Genetics 2016).

       Liquid-liquid phase separation was shown to occur through CTCF-mediated chromatin loops and to act as an insulator (Lee, R et al. *Nucleic Acids Research* 2022). Among the identified insulator-associated DNA-binding proteins, CEBPA has been found to form hubs that colocalize with transcriptional co-activators in a native cell context, which is associated with transcriptional condensate and phase separation (Christou-Kent M et al. *Cell Reports* 2023). The proposed microcompartment mechanisms are also associated with phase separation. Thus, the same or similar mechanisms are potentially associated with the insulator function of the identified DNA-binding proteins. I have included the following information on line 554: CEBPA in the identified insulator-associated DNA-binding proteins was also reported to be involved in transcriptional condensates and phase separation.
      

      9. Also, can the authors comment about the mechanisms those newly identified DBPs mediate contacts by active processes or equilibrium processes?

      Snead WT et al. Molecular Cell 2019 mentioned that protein post-transcriptional modifications (PTMs) facilitate the control of molecular valency and strength of protein-protein interactions. O-GlcNAcylation as a PTM inhibits CTCF binding to chromatin (Tang X et al. Nature Communications 2024). I found that the identified insulator-associated DNA-binding proteins tend to form a cluster at potential insulator sites (Supplementary Fig. 2d). These proteins may interact and actively regulate chromatin interactions, transcriptional condensation, and phase separation by PTMs. I have added the following explanation on lines 584-590: Furthermore, protein post-transcriptional modifications (PTMs) facilitate control over the molecular valency and strength of protein-protein interactions. O-GlcNAcylation as a PTM inhibits CTCF binding to chromatin. We found that the identified insulator-associated DNA-binding proteins tend to form a cluster at potential insulator sites (Fig. 4f and Supplementary Fig. 3c). These proteins may interact and actively regulate chromatin interactions, transcriptional condensation, and phase separation through PTMs.

      10. Can the author provide some real examples along with published structural data (e.g. the mentioned micro-C data) to show the link between protein co-presence, directional bias and contact formation?

      Structural molecular model of cohesin-CTCF-anchored loops has been published by Li Y et al. Nature 2020. The structural conformation of CTCF and cohesin in the loops would be the cause of the directional bias of CTCF binding sites, which I mentioned in lines 539 - 543 as follows: These results suggest that the directional bias of DNA-binding sites of insulator-associated DBPs may be involved in insulator function and chromatin regulation through structural interactions among DBPs, other proteins, DNAs, and RNAs. For example, the N-terminal amino acids of CTCF have been shown to interact with RAD21 in chromatin loops.

       To investigate the principles underlying the architectural functions of insulator-insulator pairing interactions, two insulators, Homie and Nhomie, flanking the *Drosophila even skipped *locus were analyzed. Pairing interactions between the transgene Homie and the eve locus are directional. The head-to-head pairing between the transgene and endogenous Homie matches the pattern of activation (Fujioka M et al. *PLoS Genetics* 2016).
      

      Reviewer #3

      Major Comments:

      1. Some of these TFs do not have specific direct binding to DNA (P300, Cohesin). Since the authors are using binding motifs in their analysis workflow, I would remove those from the analysis.

      When a protein complex binds to DNA, one protein of the complex binds to the DNA directory, and the other proteins may not bind to DNA. However, the DNA motif sequence bound by the protein may be registered as the DNA-binding motif of all the proteins in the complex. The molecular structure of the complex of CTCF and Cohesin showed that both CTCF and Cohesin bind to DNA (Li Y et al. Nature 2020). I think there is a possibility that if the molecular structure of a protein complex becomes available, the previous recognition of the DNA-binding ability of a protein may be changed. Therefore, I searched the Pfam database for 99 insulator-associated DNA-binding proteins identified in this study. I found that 97 are registered as DNA-binding proteins and/or have a known DNA-binding domain, and EP300 and SIN3A do not directory bind to DNA, which was also checked by Google search. I have added the following explanation in line 257 to indicate direct and indirect DNA-binding proteins: Among 99 insulator-associated DBPs, EP300 and SIN3A do not directory interact with DNA, and thus 97 insulator-associated DBPs directory bind to DNA. I have updated the sentence in line 20 of the Abstract as follows: We discovered 97 directional and minor nondirectional motifs in human fibroblast cells that corresponded to 23 DBPs related to insulator function, CTCF, and/or other types of chromosomal transcriptional regulation reported in previous studies.

      2. I am not sure if I understood correctly, by why do the authors consider enhancers spanning 2Mb (200 bins of 10Kb around eSNPs)? This seems wrong. Enhancers are relatively small regions (100bp to 1Kb) and only a very small subset form super enhancers.

      As the reviewer mentioned, I recognize enhancers are relatively small regions. In the paper, I intended to examine further upstream and downstream of promoter regions where enhancers are found. Therefore, I have modified the sentence in lines 929 - 931 of the Fig. 2 legend as follows: Enhancer-gene regulatory interaction regions consist of 200 bins of 10 kbp between -1 Mbp and 1 Mbp region from TSS, not including promoter.

      3. I think the H3K27me3 analysis was very good, but I would have liked to see also constitutive heterochromatin as well, so maybe repeat the analysis for H3K9me3.

      Following the reviewer's advice, I have added the ChIP-seq data of H3K9me3 as a truck of the UCSC Genome Browser. The distribution of H3K9me3 signal was different from that of H3K27me3 in some regions. I also found the insulator-associated DNA-binding sites close to the edges of H3K9me3 regions and took some screenshots of the UCSC Genome Browser of the regions around the sites in Supplementary Fig. 3b. I have modified the following sentence on lines 974 - 976 in the legend of Fig. 4: a Distribution of histone modification marks H3K27me3 (green color) and H3K9me3 (turquoise color) and transcript levels (pink color) in upstream and downstream regions of a potential insulator site (light orange color). I have also added the following result on lines 356 - 360: The same analysis was performed using H3K9me3 marks, instead of H3K27me3 (Fig. S3b). We found that the distribution of H3K9me3 signal was different from that of H3K27me3 in some regions, and discovered the insulator-associated DNA-binding sites close to the edges of H3K9me3 regions (Fig. S3b).

      4. I was not sure I understood the analysis in Figure 6. The binding site is with 500bp of the interaction site, but micro-C interactions are at best at 1Kb resolution. They say they chose the centre of the interaction site, but we don't know exactly where there is the actual interaction. Also, it is not clear what they measure. Is it the number of binding sites of a specific or multiple DBP insulator proteins at a specific distance from this midpoint that they recover in all chromatin loops? Maybe I am missing something. This analysis was not very clear.

      The resolution of the Micro-C assay is considered to be 100 bp and above, as the human nucleome core particle contains 145 bp (and 193 bp with linker) of DNA. However, internucleosomal DNA is cleaved by endonuclease into fragments of multiples of 10 nucleotides (Pospelov VA et al. Nucleic Acids Research 1979). Highly nested focal interactions were observed (Goel VY et al. Nature Genetics 2023). Base pair resolution was reported using Micro Capture-C (Hua P et al. Nature 2021). Sub-kilobase (20 bp resolution) chromatin topology was reported using an MNase-based chromosome conformation capture (3C) approach (Aljahani A et al. Nature Communications 2022). On the other hand, Hi-C data was analyzed at 1 kb resolution. (Gu H et al. bioRxiv 2021). If the resolution of Micro-C interactions is at best at 1 kb, the binding sites of a DNA-binding protein will not show a peak around the center of the genomic locations of interaction edges. Each panel shows the number of binding sites of a specific DNA-binding protein at a specific distance from the midpoint of all chromatin interaction edges. I have modified and added the following sentences in lines 593-597: High-resolution chromatin interaction data from a Micro-C assay indicated that most of the predicted insulator-associated DBPs showed DNA-binding-site distribution peaks around chromatin interaction sites, suggesting that these DBPs are involved in chromatin interactions and that the chromatin interaction data has a high degree of resolution. Base pair resolution was reported using Micro Capture-C.

      Minor Comments:

      1. PIQ does not consider TF concentration. Other methods do that and show that TF concentration improves predictions (e.g., ____https://www.biorxiv.org/content/10.1101/2023.07.15.549134v2____or ____https://pubmed.ncbi.nlm.nih.gov/37486787____/). The authors should discuss how that would impact their results.

      The directional bias of CTCF binding sites was identified by ChIA-pet interactions of CTCF binding sites. The analysis of the contribution scores of DNA-binding sites of proteins considering the binding sites of CTCF as an insulator showed the same tendency of directional bias of CTCF binding sites. In the analysis, to remove the false-positive prediction of DNA-binding sites, I used the binding sites that overlapped with a ChIP-seq peak of the DNA-binding protein. This result suggests that the DNA-binding sites of CTCF obtained by the current analysis have sufficient quality. Therefore, if the accuracy of prediction of DNA-binding sites is improved, although the number of DNA-binding sites may be different, the overall tendency of the directionality of DNA-binding sites will not change and the results of this study will not change significantly.

       As for the first reference in the reviewer's comment, chromatin interaction data from Micro-C assay does not include all chromatin interactions in a cell or tissue, because it is expensive to cover all interactions. Therefore, it would be difficult to predict all chromatin interactions based on machine learning. As for the second reference in the reviewer's comment, pioneer factors such as FOXA are known to bind to closed chromatin regions, but transcription factors and DNA-binding proteins involved in chromatin interactions and insulators generally bind to open chromatin regions. The search for the DNA-binding motifs is not required in closed chromatin regions.
      

      2. DeepLIFT is a good approach to interpret complex structures of CNN, but is not truly explainable AI. I think the authors should acknowledge this.

      In the DeepLIFT paper, the authors explain that DeepLIFT is a method for decomposing the output prediction of a neural network on a specific input by backpropagating the contributions of all neurons in the network to every feature of the input (Shrikumar A et al. ICML 2017). DeepLIFT compares the activation of each neuron to its 'reference activation' and assigns contribution scores according to the difference. DeepLIFT calculates a metric to measure the difference between an input and the reference of the input.

       Truly explainable AI would be able to find cause and reason, and to make choices and decisions like humans. DeepLIFT does not perform causal inferences. I did not use the term "Explainable AI" in our manuscript, but I briefly explained it in Discussion. I have added the following explanation in lines 623-628: AI (Artificial Intelligence) is considered as a black box, since the reason and cause of prediction are difficult to know. To solve this issue, tools and methods have been developed to know the reason and cause. These technologies are called Explainable AI. DeepLIFT is considered to be a tool for Explainable AI. However, DeepLIFT does not answer the reason and cause for a prediction. It calculates scores representing the contribution of the input data to the prediction.
      
       Furthermore, to improve the readability of the manuscript, I have included the following explanation in lines 159-165: we computed DeepLIFT scores of the input data (i.e., each binding site of the ChIP-seq data of DNA-binding proteins) in the deep leaning analysis on gene expression levels. DeepLIFT compares the importance of each input for predicting gene expression levels to its 'reference or background level' and assigns contribution scores according to the difference. DeepLIFT calculates a metric to measure the difference between an input and the reference of the input.
      
    1. Author Response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public Review):

      (1) I think the article is a little too immature in its current form. I'd recommend that the authors work on their writing. For example, the objectives of the article are not completely clear to me after reading the manuscript, composed of parts where the authors seem to focus on SGCs, and others where they study "engram" neurons without differentiating the neuronal type (Figure 5). The next version of the manuscript should clearly establish the objectives and sub-aims.

      We now provide clarification for focusing on the labeling status versus the cell types in figure 5. Since figure 5 focuses on inputs to labeled pairs versus Labeledunlabeled pairs the pairs include mixed groups with GCs and SGCs. Since the question pertains to inputs rather than cell types, we did not specifically distinguish the cell types. This is now explained in the text on page 15:  “Note that since the intent was to determine the input correlation depending on labeling status of the cell pairs rather than based on cell type, we do not explicitly consider whether analyzed cell pairs included GCs or SGCs.”

      (2) In addition, some results are not entirely novel (e.g., the disproportionate recruitment as well as the distinctive physiological properties of SGCs), and/or based on correlations that do not fully support the conclusions of the article. In addition to re-writing, I believe that the article would benefit from being enriched with further analyses or even additional experiments before being resubmitted in a more definitive form.

      We now indicate the data comparing labeled versus unlabeled SGCs is novel. Moreover, we also highlight that (1) recruitment of SGCs has not been previously examined in Barnes Maze or Enriched Environment, (2) that our unbiased morphological analysis of SGC recruitment is more robust than subsampling of recorded neurons in prior studies and (3) that our data show that prior may have overestimated SGC recruitment to engrams. Thus, the data characterized as “not novel” are essential for appropriate analysis of behaviorally tagged neurons which is the thrust of our study.  

      Reviewer #2 (Public Review):

      (1) The authors conclude that SGCs are disproportionately recruited into cfos assemblies during the enriched environment and Barnes maze task given that their classifier identifies about 30% of labelled cells as SGCs in both cases and that another study using a different method (Save et al., 2019) identified less than 5% of an unbiased sample of granule cells as SGCs. To make matters worse, the classifier deployed here was itself established on a biased sample of GCs patched in the molecular layer and granule cell layer, respectively, at even numbers (Gupta et al., 2020). The first thing the authors would need to show to make the claim that SGCs are disproportionately recruited into memory ensembles is that the fraction of GCs identified as SGCs with their own classifier is significantly lower than 30% using their own method on a random sample of GCs (e.g. through sparse viral labelling). As the authors correctly state in their discussion, morphological samples from patch-clamp studies are problematic for this purpose because of inherent technical issues (i.e. easier access to scattered GCs in the molecular layer).

      We now clarify, on page 9, that a trained investigator classified cell types based on predefined morphological criteria.  No automated classifiers were used to assign cell types in the current study.

      (2) The authors claim that recurrent excitation from SGCs onto GCs or other SGCs is irrelevant because they did not find any connections in 32 simultaneous recordings (plus 63 in the next experiment). Without a demonstration that other connections from SGCs (e.g. onto mossy cells or interneurons) are preserved in their preparation and if so at what rates, it is unclear whether this experiment is indicative of the underlying biology or the quality of the preparation. The argument that spontaneous EPSCs are observed is not very convincing as these could equally well arise from severed axons (in fact we would expect that the vast majority of inputs are not from local excitatory cells). The argument on line 418 that SGCs have compact axons isn't particularly convincing either given that the morphologies from which they were derived were also obtained in slice preparations and would be subject to the same likelihood of severing the axon. Finally, even in paired slice recordings from CA3 pyramidal cells the experimentally detected connectivity rates are only around 1% (Guzman et al., 2016). The authors would need to record from a lot more than 32 pairs (and show convincing positive controls regarding other connections) to make the claim that connectivity is too low to be relevant.

      We have conducted additional control experiments (detailed in response to Editorial comment #3), in which we replicated the results of Stefanelli et al (2016) identifying that optogenetic activation of a focal cohort of ChR2 expressing granule cells leads to robust feedback inhibition of adjacent granule cells. These control experiments demonstrate that the slice system supports the feedback inhibitory circuit which requires GC/SGC to hilar neuron synapses.

      (3) Another troubling sign is the fact that optogenetic GC stimulation rarely ever evokes feedback inhibition onto other cells which contrasts with both other in vitro (e.g. Braganza et al., 2020) and in vivo studies (Stefanelli et al., 2016) studies. Without a convincing demonstration that monosynaptic connections between SGCs/GCs and interneurons in both directions is preserved at least at the rates previously described in other slice studies (e.g. Geiger et al., 1997, Neuron, Hainmueller et al., 2014, PNAS, Savanthrapadian et al., 2014, J. Neurosci), the notion that this setting could be closer to naturalistic memory processing than the in vivo experiments in Stefanelli et al. (e.g. lines 443-444) strikes me as odd. In any case, the discussion should clearly state that compromised connectivity in the slice preparation is likely a significant confound when comparing these results.

      We have conducted additional control experiments (detailed in response to Editorial comment #3), in which we replicated the results of Stefanelli et al identifying that optogenetic activation of a focal cohort of ChR2 expressing granule cells leads to robust feedback inhibition of adjacent granule cells. These control experiments demonstrate that the slice system in our studies support the feedback inhibitory circuit detailed in prior studies. We also clarify that Stefanelli study labeled random neurons and did not examine natural behavioral engrams and  discuss (on page 20) the correspondence/consistency of our results with that of Braganza et al 2020.

      (4) Probably the most convincing finding in this study is the higher zero-time lag correlation of spontaneous EPSCs in labelled vs. unlabeled pairs. Unfortunately, the fact that the authors use spontaneous EPSCs to begin with, which likely represent a mixture of spontaneous release from severed axons, minis, and coordinated discharge from intact axon segments or entire neurons, makes it very hard to determine the meaning and relevance of this finding. At the bare minimum, the authors need to show if and how strongly differences in baseline spontaneous EPSC rates between different cells and slices are contributing to this phenomenon. I would encourage the authors to use low-intensity extracellular stimulation at multiple foci to determine whether labelled pairs really share higher numbers of input from common presynaptic axons or cells compared to unlabeled pairs as they claim. I would also suggest the authors use conventional Cross correlograms (CCG; see e.g. English et al., 2017, Neuron; Senzai and Buzsaki, 2017, Neuron) instead of their somewhat convoluted interval-selective correlation analysis to illustrate codependencies between the event time series. The references above also illustrate a more robust approach to determining whether peaks in the CCGs exceed chance levels.

      We have included data on sEPSC frequency in the recorded cell pairs (Supplemental Fig 4) and have also conducted additional experiments and present data demonstrating that labeled cell show higher sEPSC frequency and amplitude than corresponding unlabeled cells in both cell types (new Fig 5).  We also include data from new  experiments to show that over 50% of the sEPSCs represent action potential driven events (Supplemental fig 3). 

      We thank the reviewer for the suggestion to explore alternative methods of analyses including CCGs to further strengthen our findings. We have now conducted CCGs on the same data set and report that “The dynamics of the cross-correlograms generated from our data sets using previously established methods to evaluate monosynaptic connectivity (Bartho et al., 2004; Senzai and Buzsaki, 2017) parallelled that of the CCP plots (Supplemental Fig. 6) illustrating that the methods similarly capture co-dependencies between event time series. We note, here, that while the CCG and CCP are qualitatively similar, the magnitude of the peaks were different, due to the sparseness of synaptic events. 

      (5) Finally, one of the biggest caveats of the study is that the ensemble is labelled a full week before the slice experiment and thereby represents a latent state of a memory rather than encoding consolidation, or recall processes. The authors acknowledge that in the discussion but they should also be mindful of this when discussing other (especially in vivo) studies and comparing their results to these. For instance, Pignatelli et al 2018 show drastic changes in GC engram activity and features driven by behavioral memory recall, so the results of the current study may be very different if slices were cut immediately after memory acquisition (if that was possible with a different labelling strategy), or if animals were re-exposed to the enriched environment right before sacrificing the animal.

      As noted by the reviewer, we fully acknowledge and are cognizant of the concern that slices prepared a week after labeling may not reflect ongoing encoding. Although our data show that labeled cells are reactivated in higher proportion during recall, we have discussed this caveat and will include alternative experimental strategies in the discussion.

      Reviewer #3 (Public Review):

      (1) Engram cells are (i) activated by a learning experience, (ii) physically or chemically modified by the learning experience, and (iii) reactivated by subsequent presentation of the stimuli present at the learning experience (or some portion thereof), resulting in memory retrieval. The authors show that exposure to Barnes Maze and the enriched environment-activated semilunar granule cells and granule cells preferentially in the superior blade of the dentate gyrus, and a significant fraction were reactivated on re-exposure. However, physical or chemical modification by experience was not tested. Experience modifies engram cells, and a common modification is the Hebbian, i.e., potentiation of excitatory synapses. The authors recorded EPSCs from labeled and unlabeled GCs and SGCs. Was there a difference in the amplitude or frequency of EPSCs recorded from labeled and unlabeled cells?

      We have included data on sEPSC frequency in the recorded cell pairs (Supplemental Fig 4) and have also conducted additional experiments and report and present data demonstrating that labeled cell show higher sEPSC frequency and amplitude than corresponding unlabeled cells in both cell types (new Fig 5).  We also include data from new  experiments to show that over 50% of the sEPSCs represent action potential driven events (Supplemental fig 3).

      (2) The authors studied five sequential sections, each 250 μm apart across the septotemporal axis, which were immunostained for c-Fos and analyzed for quantification. Is this an adequate sample? Also, it would help to report the dorso-ventral gradient since more engram cells are in the dorsal hippocampus. Slices shown in the figures appear to be from the dorsal hippocampus. 

      We thank the reviewer for the comment. We analyzed sections along the dorsoventral gradient. As explained in the methods, there is considerable animal to animal variability in the number of labeled cells which was why we had to use matched littermate pairs in our experiments This variability could render it difficult to tease apart dorsoventral differences. 

      (3) The authors investigated the role of surround inhibition in establishing memory engram SGCs and GCs. Surprisingly, they found no evidence of lateral inhibition in the slice preparation. Interneurons, e.g., PV interneurons, have large axonal arbors that may be cut during slicing.

      Similarly, the authors point out that some excitatory connections may be lost in slices. This is a limitation of slice electrophysiology.

      We have conducted additional control experiments (detailed in response to Editorial comment #3), in which we replicated the results of Stefanelli et al identifying that optogenetic activation of a focal cohort of ChR2 expressing granule cells leads to robust feedback inhibition of adjacent granule cells. These control experiments demonstrate that the slice system supports the feedback inhibitory circuit detailed in prior studies. 

      We now discuss (page 21) that “the possibility that slice recordings lead to underestimation of feedback dendritic inhibition cannot be ruled out.”

      Reviewer #1 (Recommendations for the authors):

      (1) I struggle to understand the added value of the Barnes Maze data (Figures 1 and S1), since the authors then focus on the EE for practical reasons. In particular, the analysis of mouse performance (presented in supplemental Figure 1) does not seem traditional to me. For example, instead of the 3 classical exploration strategies (i.e., random, serial, direct), the authors describe 6, and assign each of these strategies a score based on vague criteria (why are "long corrected" and "focused research" both assigned a score of 0.5?). Unless I'm mistaken, no other classic parameters are described (e.g., success rate, latency, number of errors). If the authors decide to keep the BM results, I recommend better justifying its existence and adding more details, including in the method section. Otherwise, perhaps they should consider withdrawing it. Even if we had to use two different behavioral contexts, wouldn't it have made sense to use, in addition to the EE, the fear conditioning test, which is widely used in the study of engrams? Under these conditions (Stefanelli et al., 2016), the number of cells recruited after fear conditioning seems sufficient to reproduce the analyses presented in Figures 2-5 and determine whether or not lateral inhibition is dependent on the type of context (Stefanelli and colleagues suggest significant strong lateral inhibition during fear conditioning, whereas the data from Dovek and colleagues suggest quite the opposite after exposure to EE).

      The Barnes Maze data was included to evaluate the DG ensemble activation during a dentate dependent non-fear based behavioral task. This is now introduced and explained in the results. We have now included plots of the primary latency and number of errors in finding the escape hole to confirm the improvement over time (Supplemental Fig. 1). We specifically used the BUNS analysis to evaluate the use of spatial strategy and show that by day 6, day of tamoxifen induction, the mice are using a spatial strategy for navigation. Our approach to evaluate exploration strategy is based on criteria published in Illouz et al 2016. This is now detailed in the methods on page 25. We hope that  the inclusion of the supplemental data and revisions to methods and results address the concerns regarding Barnes Maze experiments. 

      Regarding Stefanelli et al., 2016, please note that the study adopted random labeling of neurons using a CaMKII promotor driven reporter expression which they activated during spatial exploration of fear conditioning behaviors. As such labeled neurons in the Stefanelli study were NOT behaviorally driven, rather they were optically activated. This is now clarified in the text. The main drive for our study was to evaluate behaviorally tagged neurons which is novel, distinct from the Stefanelli study, and, we would argue, more behaviorally realistic and relevant.

      Additionally, the lateral inhibition observed in Stafanelli et al was in response to activation of GCs labeled by virally mediate CAMKII-driven ChR2 expression. Using a similar labeling approach, new control data presented in Supplemental fig. 3 show that we are fully able to replicate the lateral inhibition observed by Stefanalli et al. These control experiments further suggest that the sparse and distributed GC/SGC ensembles activated during non-aversive behavioral tasks may not be sufficient to elicit robust lateral inhibition as has been observed when a random population of adjacent neurons are activated. Our findings are also consistent with observations by Barganza et al., 2020. This is now Discussed on page 21.

      (2) The authors recorded sEPSCs received by recruited and non-recruited GCs and SGCs after EE exposure. However, it appears that they studied them very little, apart (from a temporal correlation analysis (Figure 5). Yet it would be interesting to determine whether or not the four neuronal populations possess different synaptic properties. 

      What is the frequency and amplitude of sEPSCs in GCs and SGCs recruited or not after EE exposure? Similarly, can the author record the sIPSCs received by dentate gyrus engram and non-engram GCs and SGCs? If so, what is their frequency and amplitude?

      As suggested by the editorial comment #2, we how include data on the frequency and amplitude of the sEPSCs in GCs and SGCs used in our analysis of figure 5. Given the low numbers of unlabeled SGCs and labeled GCs in our paired recordings (Supplemental Fig. 5), we choose not to use this data set for analysis of cell-type and labeling based differences in EPSC parameters. However, we have previously reported that sIPSC frequency is higher in SGCs than in GCs. Additionally, we have identified that sEPSC frequency in SGCs is higher than in GC (Dovek et al, in preprint, DOI: 10.1101/2025.03.14.643192).  

      To specifically address reviewer concerns, we have conducted new recorded EPSCs in a cohort of labeled and unlabeled GCs and SGCs and present data demonstrating that labeled cell show higher sEPSC frequency and amplitude than corresponding unlabeled cells in both cell types (new Fig 5). These experiments were conducted in TRAP2-tdT labeled cells which were not stable in cesium based recordings. As such we, we deferred the IPSC analysis for later and restricted analysis to sEPSCs for this study. 

      (3) Previous data showed that dentate gyrus neurons that are recruited or not in a given context could exhibit distinct morphological characteristics (Pléau et al. 2021) and biochemical content (Penk expression, Erwin et al., 2020). In order to enrich the electrophysiological data presented in Figure 2, could the authors take advantage of the biocytin filling to perform a morphological and biochemical comparison of the different neuronal types (i.e., GCs and SGCs recruited or not after EE)?

      Thank you for this suggestion. Unfortunately, detailed morphometry and biochemical analysis on labeled and unlabeled neurons was not conducted as part of this study as our focus was on circuit differences. In our experience, unless the sections are imaged soon after staining, the sections are suboptimal for detailed morphological reconstruction and analysis. Our ongoing studies suggest that PENK is an activity marker and not a selective marker for SGCs and we are undertaking transcriptomic analysis to identify molecular differences between GCs and SGCs. We respectfully submit that these experiments are outside the scope of this study.

      (4) Figures 3 and 4 show only schematic diagrams and representative data. No quantification is shown. Instead of pie charts showing the identity of each pair (which I find unnecessary), I'll use pie charts representing the % of each pair in which an excitatory or inhibitory drive was recorded (with the corresponding n).

      Please note that we did not observe evoked synaptic potentials in any except one pair precluding the possibility of quantification. However, we submit that it is important for the readers to have information on the number of pairs and the types of pre-post synaptic pairs in which the connections were tested.

      (5) Figure 3: Given that GCs form very few recurrences in non-pathological conditions, it hardly surprises me that they form few or no local glutamatergic connections. In contrast, this result surprises me more for SGCs, whose axons form collaterals in the dentate gyrus granular and molecular layers (Williams et al., 2007; Save et al., 2019). To control the reliability of their conditions, could the authors check whether SGCs do indeed form connections with hilar mossy cells, as has been reported in the past? To test whether this lack of interconnectivity is specific to neurons belonging to the same engram (or not), could the authors test whether or not the stimulation of labeled GCs/SGCs (via membrane depolarization or even optogenetics) generates EPSCs in unlabeled GCs?

      As suggested by the reviewer, we have examined whether widefield optical activation of all labeled neurons including GCs and SGCs lead to EPSCs in unlabeled GCs (63 cells tested). However, we did not observe eEPSCs. This data is presented on page 13, (Fig 4F) in the results and discussed on page 20. Since the wide field stimulation should activate terminals and lead to release even if the axon is severed, our data suggest the glutamatergic drive from SGC to GC may be limited.

      As noted above, we have demonstrated the presence of lateral inhibition consistent with data in Stefanelli et al in our new supplementary figure 3. We have also shown that sustained SGC firing upon perforant path stimulations is associated with sustained firing in hilar interneurons (Afrasiabi et al., 2022) indicating presence of the SGC to hilar connectivity in our slice preparation. Therefore, we choose not to undertake challenging 2P guided paired recording of SGCs and mossy cells adjacent to SGC axon terminals reported in Williams et al 2007 to replicate the 9%  SGC to MC synaptic connections. These 2P guided slice physiology studies are outside the technical scope of our study.

      (6) Figure 4: The results are relatively in contradiction with the strong lateral inhibition reported in the past (Stefanelli et al., 2016), but the experimental conditions are different in the two studies. Stimulation of a single labeled GC or SGC may not be sufficient to activate an inhibitory neuron, and for the latter to inhibit an unlabeled GC or SGC. Is it possible to measure the sIPSCs received by unlabelled neurons during optogenetic stimulation of all labelled neurons? Could the authors verify whether under their experimental conditions GCs and SGCs do indeed form connections with interneurons, as reported before? Finally, Stefanelli and colleagues (2016) suggest that lateral inhibition is provided by dendrites- targeting somatostatin interneurons. If the authors are recording in the soma, could they underestimate more distal inhibitory inputs? If so, could they record the dendrites of unlabeled neurons?

      Our new control data (Supplementary Fig. 3) using an AAV mediated CAMKII promotor driven random expression of ChR2 on GCs, similar to Stefanelli et al (2016) demonstrates our ability replicate the lateral inhibition observed by Stefanalli et al. (2016). Thus, our findings more accurately represent lateral inhibition supported by a sparse behaviorally labeled cohort than findings of Stefanelli et al based on randomly labeled neurons. This is now discussed on page 22-23. We respectfully submit that dendritic recordings are outside the scope of the current study.

      We also discuss the possibility that somatic recordings may under sample dendritic inhibitory inputs on page 23 “the possibility that slice recordings lead to underestimation of feedback dendritic inhibition cannot be ruled out.”

      (7) Figure 5: For ease of reading, I would substantially simplify the Results section related to Figure 5, keeping only the main general points of the analysis and the results themselves. The details of the analysis strategy, and the justification for the choices made, are better placed in the Method section (I advise against "data not shown").

      We thank the reviewer for the suggestion to improve accessibility of the results and have moved text related to justification of strategy and controls to the methods. We have also removed references to data not shown.

      (8) Figure 5: why do the authors no longer discriminate between GCs and SGCs?

      Since figure 5 focuses on inputs to labeled pairs versus labeled-unlabeled pairs the pairs include mixed groups with GCs and SGCs. Since the question pertains to inputs rather than cell types, we did not specifically distinguish the cell types. This is now explained in the text on page 15.

      (9) Figure 5: I would like to know more about the temporally connected inputs and their implication in context-dependent recruitment of dentate gyrus neurons. What could be the origin of the shared input received by the neurons recruited after EE exposure? For example, do labeled neurons receive more (temporally correlated or not) inputs from the entorhinal cortex (or any other upstream brain region) than unlabeled neurons? Is there any way (e.g., PP stimulation or any kind of manipulation) to test the causal relationship between temporally correlated input and the context-dependent recruitment of a given neuron?

      We appreciate the reviewer’s comments on the need to examine the source and nature of the correlated inputs to behaviorally labeled neurons. However, the suggested experiments are nontrivial as artificial stimulation of afferent fibers is unlikely to be selective for labeled and unlabeled cells. Given the complexities in design, implementation and interpretation of these experiments we respectfully submit that these are outside the scope of the current study.

      Reviewer #2 (Recommendations for the authors):

      There are a few minor issues limiting the extent of interpretations of the data:

      (1) Only about 7% of the 'engram' cells are re-activated one week after exposure (line 147), it is unclear how meaningful this assembly is given the high number of cells that may either be labelled unrelated to the EE or no longer be part of the memory-related ensemble.

      We now discuss (page 22-23) that the % labeling is consistent with what has been observed in the DG 1 week after fear conditioning (DeNardo et al., 2019) and discuss the caveat that all labeled cells may not represent an engram.  

      (2) Line 215: The wording '32 pairwise connections examined' suggests that there actually were synaptic connections, would recommend altering the wording to 'simultaneously recorded cells examined' to avoid confusion.

      Revised as suggested

    1. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Expressed concern that FOOOF may not be sensitive to peaks located at the edges of the spectrum and suggested using rhythmicity as an alternative measure of oscillatory activity.

      To address this concern, we first conducted a simulation in which we generated power spectra with a single periodic component while varying its parameters. The results confirmed that FOOOF may indeed have reduced sensitivity to low-frequency periodic components. In such cases, periodic activity can be conflated with aperiodic activity, leading to inflated estimates of the aperiodic component. These simulation results are presented in detail at the end of the Supplement.

      To further investigate whether the low-frequency activity in our datasets may be oscillatory, we employed the phase-autocorrelation function (pACF), a measure of rhythmicity developed by Myrov et al. (2024). We compared pACF and FOOOF-derived parameters using linear mixed models at each channel–frequency– time point (see Methods for details). Our analyses showed that pACF activity closely resembles periodic activity across all three datasets, and is dissimilar to aperiodic parameters (see Figures 5, S4, S5, S21, S22, S34, S35). This supports the interpretation that, in our data, aperiodic activity is not conflated with periodic activity.

      I was concerned that “there were no dedicated analyses in the paper to show that the aperiodic changes account for the theta changes.”

      To address this concern, we used linear mixed models to estimate the association between FOOOF parameters and baseline-corrected time-frequency activity. These models were fitted at each channel-frequency-time point. Our results indicate that aperiodic activity is correlated with low-frequency (theta) baseline-corrected activity, while periodic activity is correlated primarily with activity in the alpha/beta range, but not with theta (see Figures 4, S3, S20, S33). Additionally, the exponent parameter exhibited a negative correlation in the gamma frequency range.

      These findings support the reviewer's hypothesis: “I would also like to note that if the theta effect is only the aperiodic shift in disguise, we should see a concomitant increase in delta activity too – maybe even a decrease at high frequencies.” Overall, the results are consistent with our interpretation that low-frequency baseline-corrected activity reflects changes in aperiodic, rather than periodic, activity.

      “On page 7 it is noted that baseline correction might subtract a significant amount of ongoing periodic activity. I would replace the word "subtract" with "remove" as not all baseline correction procedures are subtractive. Furthermore, while this sentence makes it sound like a problem, this is, to my mind, a feature, not a bug - baseline correction is meant to take away whatever is ongoing, be it oscillatory or not, and emphasise changes compared to that, in response to some event.”

      We thank the reviewer for this helpful clarification. We have revised the sentence accordingly to read: “Our results show that classical baseline correction can remove continuous oscillatory activity that is present both during baseline and after stimulus onset, because it treats all baseline signals as 'background' to be removed without distinguishing between transient and continuous oscillations. While this is consistent with the intended purpose of baseline correction---to highlight changes relative to ongoing activity---it may also lead to unintended consequences, such as misinterpreting aperiodic activity as an increase in poststimulus theta oscillations.”

      In addition, we have made several broader revisions throughout the manuscript to improve clarity and accuracy in response to the reviewer’s feedback:

      (1) We have softened our interpretation of changes in the theta range. We no longer claim that these effects are solely due to aperiodic activity; rather, we now state that our findings suggest a potential contribution of aperiodic activity to signals typically interpreted as theta oscillations.

      (2) We have revised our language to avoid suggesting a direct “interplay” between periodic and aperiodic components. Instead, we emphasize the concurrent presence of both components, using more precise and cautious formulations.

      (3) We have clarified our discussion of baseline normalization approaches, explicitly noting that our findings hold regardless of whether a subtractive or divisive baseline correction was applied.

      (4) Finally, we have restructured the introduction to improve readability and address points of potential confusion. Specifically, we have clarified the definition and role of 1/f activity, refined the discussion linking baseline correction to aperiodic activity, and improved transitions between key concepts.

      Reviewer suggested that “it might be good to show that the findings were not driven by the cognitive-complaint subgroup (although the internal replications suggest they were not).”

      We agree that it is important to demonstrate that our findings are not driven solely by the cognitive-complaint subgroup. While we did not include additional figures in the manuscript due to their limited relevance to the primary research question, we have attached figures that explicitly show the comparison between the clinical and control groups here in the response to reviewers. These figures include non-significant effects.

      Author response image 1.

      Results of the linear mixed model analysis of periodic activity for comparison between conditions, including non-significant effect (see also Figure 7 in the paper)

      Author response image 2.

      Results of the linear mixed model analysis of aperiodic exponent for comparison between conditions, including nonsignificant effects (see also Figure 9 in the paper)

      Author response image 3.

      Results of the linear mixed model analysis of aperiodic offset for comparison between conditions, including non-significant effects (see also Figure S11 in the paper)

      “Were lure trials discarded completely, or were they included in the non-target group?”

      Thank you for the question. As described in the Methods section (EEG data preprocessing), lure trials were discarded entirely from further analysis and were not included in the non-target group.

      “Also, just as a side note, while this time-resolved approach is definitely new, it is not novel to this paper, at least two other groups have tried similar approaches, e.g., Wilson, da Silva Castanheira, & Baillet, 2022; Ameen, Jacobs, et al., 2024.”

      Thank you for drawing our attention to these relevant studies. We have now cited both Wilson et al. (2022) and Ameen et al. (2024) in our manuscript. While these papers did indeed use time-resolved approaches, to our knowledge our study is the first to use such an approach within a task-based paradigm.

      noted that it was unclear how the periodic component was reconstructed: “I understand that a Gaussian was recreated based on these parameters, but were frequencies between and around the Gaussians just zeroed out? Or rather, given a value of 1, so that it would be 0 after taking its log10.”

      The periodic component was reconstructed by summing the Gaussians derived from the FOOOF model parameters. Since the Gaussians asymptotically approach, but never reach, zero, there were no explicit zeros between them. We have included this explanation in the manuscript.

      “If my understanding is correct, the periodic and aperiodic analyses were not run on the singletrial level, but on trial-averaged TF representations. Is that correct? In that case, there was only a single observation per participant for each within-subject cell at each TF point. This means that model (4) on p. 15 just simplifies to a repeated-measures ANOVA, does it not? As hinted at later in this section, the model was run at each time point for aperiodic analyses, and at each TF point for periodic analyses, resulting in a series of p-values or a map of p-values, respectively, is that correct?”

      We thank the reviewer for this careful reading and helpful interpretation. The reviewer is correct that analyses were conducted on trial-averaged time-frequency representations. Model presented in equation 7 (as referred to in the current version of the manuscript) is indeed conceptually similar to a repeated-measures ANOVA in that it tests within-subject effects across conditions. However, due to some missing data (i.e., excluded conditions within subjects), we employed linear mixed-effects models (LMER), which can handle unbalanced data without resorting to listwise deletion. This provides more flexibility and preserves statistical power.

      The reviewer is also correct that the models were run at each channel-time point for the aperiodic analyses, and at each channel-time-frequency point for the periodic analyses, resulting in a series or map of p-values, respectively.

      suggested marking the mean response time and contrasting scalp topographies of response-related ERPs with those of aperiodic components.

      We thank the reviewer for this helpful suggestion. In response, we have now marked the mean response time and associated confidence intervals on the relevant figures (Figures 8 and S8). Additionally, we have included a new figure (Figure S13) presenting both stimulus- and response-locked ERP scalp topographies for comparison with aperiodic activity.

      In the previous version of the manuscript, we assessed the relationship between ERPs and aperiodic parameters by computing correlations between their topographies at each time point. However, to maintain consistency with our other analyses and to provide a more fine-grained view, we revised this approach and now compute correlations at each channel–time point. This updated analysis is presented in Figure S14. The results confirm that the correlation between ERPs and aperiodic activity remains low, and we discuss these findings in the manuscript.

      Regardless of the low correlation, we have added the following statement to the manuscript to clarify our conceptual stance: “While contrasting response-related ERPs with aperiodic components can help address potential confounds, we believe that ERPs are not inherently separate from aperiodic or periodic activity. Instead, ERPs may reflect underlying changes in aperiodic and periodic activity. Therefore, different approaches to studying EEG activity should be seen as providing complementary rather than competing perspectives.”

      “On page 3, it is noted that distinct theta peaks were only observed in 2 participants. Was this through visual inspection?”

      Yes, this observation was based on visual inspection of the individual power spectra. We have included this explanation in the text.

      suggested improving the plots by reducing the number of conditions (e.g., averaging across conditions), increasing the size of the colorbars, and using different color scales for different frequency bands, given their differing value ranges. Additionally, the reviewer noted that the theta and alpha results appeared surprising and lacked their expected topographical patterns, possibly due to the color scale.

      We appreciate these thoughtful suggestions and have implemented all of them to improve the clarity and interpretability of the figures. Specifically, we reduced the number of conditions by averaging across them where appropriate, enlarged the colorbars for better readability, and applied separate color scales for different frequency bands to account for variability in dynamic range.

      In the process, we also identified and corrected an error in the code that had affected the topographies of periodic activity in the previous version of the manuscript. With this correction, the resulting topographical patterns are now more consistent with canonical findings and are easier to interpret. For example, activity in the beta range now shows a clear central distribution (see Figure 6B and Figure S5B), and frontal activity in the theta range is more apparent.

      This correction also directly addresses the reviewer’s concern that the “theta and alpha results (where visible) look surprising – the characteristic mid-frontal and posterior topographies, respectively, are not really present.” These unexpected patterns were primarily due to the aforementioned error.

      “Relatedly, why is the mu parameter used here for correlations? Why not simply the RT mean/median, or one of the other ex-Gaussian parameters? Was this an a priori decision?”

      We appreciate the reviewer's thoughtful question. While mean and median RTs are indeed commonly used as summary measures, we chose the mu parameter because it provides a more principled estimate of central tendency that explicitly accounts for the positive skew typically observed in RT distributions. Although we did not directly compare mu, mean and median in this dataset, our experience with similar datasets suggests that differences between them are typically small. We chose not to include other ex-Gaussian parameters (e.g., sigma, tau) to avoid unnecessary model complexity and potential overfitting, especially since our primary interest was not in modelling the full distribution of response variability. This decision was made a priori, although we note that the study was not pre-registered. We have now added a clarification in the manuscript to reflect this rationale.

      “Relatedly, were (some) analyses of the study preregistered?”

      The analyses were not preregistered. Our initial aim was to investigate differences in phaseamplitude coupling (PAC) between the clinical and control groups. However, we did not observe clear PAC in either group—an outcome consistent with recent concerns about the validity of PAC measures in scalp EEG data (see: https://doi.org/10.3390/a16120540). This unexpected finding prompted us to shift our focus toward examining the presence of theta activity and assessing its periodicity.

      The reviewer suggested examining whether there might be differences between trials preceded by a target versus trials preceded by a non-target, potentially reflecting a CNV-like mechanism.

      We appreciate the reviewer’s insightful suggestion. The idea of investigating differences between trials preceded by a target versus a non-target, possibly reflecting a CNV-like mechanism, is indeed compelling. However, this question falls outside the scope of the current study and was not addressed in our analyses. We agree that this represents an interesting direction for future research.

      Reviewer #2 (Public review):

      “For the spectral parameterization, it is recommended to report goodness-of-fit measures, to demonstrate that the models are well fit and the resulting parameters can be interpreted.”

      We thank the reviewer for this suggestion. We have added reports of goodness-of-fit measures in the supplementary material (Fig. S9, S25, S41). However, we would like to note that our simulation results suggest that high goodness-of-fit values are not always indicative of accurate parameter estimation. For example, in our simulations, the R² values remained high even when the periodic component was not detectable or when it was conflated with the aperiodic component (e.g., compare Fig. S48 with Fig. S47). We now mention this limitation in the revised manuscript to clarify the interpretation of the goodness-of-fit metrics.

      “Relatedly, it is typically recommended to set a maximum number of peaks for spectral parameterization (based on the expected number in the analyzed frequency range). Without doing so, the algorithm can potentially overfit an excessive number of peaks. What is the average number of peaks fit in the parameterized spectra? Does anything change significantly in setting a maximum number of peaks? This is worth evaluating and reporting.”

      We report the average number of peaks, which was 1.9—2 (Figure S10). The results were virtually identical when setting number of peaks to 3.

      “In the main text, I think the analyses of 'periodic power' (e.g. section ‘Periodic activity...’ and Figures 4 & 5 could be a little clearer / more explicit on the measure being analyzed. ‘Periodic’ power could in theory refer to the total power across different frequency bands, the parameterized peaks in the spectral models, the aperiodic-removed power across frequencies, etc. Based on the methods, I believe it is either the aperiodic power or an estimate of the total power in the periodic-only model fit. The methods should be clearer on this point, and the results should specify the measure being used.”

      We thank the reviewer for highlighting this point. In our analyses, “periodic power” (or “periodic activity”) refers specifically to the periodic-only model fit. We have added clarifications under Figure 3 and in the Methods section to make this explicit in the revised manuscript.

      “The aperiodic component was further separated into the slope (exponent) and offset components". These two parameters describe the aperiodic component but are not a further decomposition per se - could be rephrased.”

      We thank the reviewer for alerting us to this potential misunderstanding. We have now rephrased the sentence to read: “The aperiodic component was characterised by the aperiodic slope (the negative counterpart of the exponent parameter) and the offset, which together describe the underlying broadband spectral shape.”

      “In the figures (e.g. Figure 5), the channel positions do not appear to be aligned with the head layout (for example - there are channels that extend out in front of the eyes).”

      Corrected.

      “Page 2: aperiodic activity 'can be described by a linear slope when plotted in semi-logarithmic space'. This is incorrect. A 1/f distributed power spectrum has a linear slope in log-log space, not semi-log.”

      Corrected.

      Page 7: "Our results clearly indicate that the classical baseline correction can subtract a significant amount of continuous periodic activity". I am unclear on what this means - it could be rephrased.

      We thank the reviewer to pointing out that the statement is not clear. We have now rephrased is to read: “Our results show that classical baseline correction can remove continuous oscillatory activity that is present both during baseline and after stimulus onset, because it treats all baseline signals as 'background' to be removed without distinguishing between transient and continuous oscillations.”

      ”Page 14: 'the FOOOF algorithm estimates the frequency spectrum in a semi-log space'. This is not quite correct - the algorithm parameterizes the spectrum in semi-log but does not itself estimate the spectrum.”

      Again, we thank the reviewer for alerting us to imprecise description. We have now changed the sentence to: “The FOOOF algorithm parameterises the frequency spectrum in a semi-logarithmic space”.

      We have made refinements to improve clarity, consistency, and flow of the main text. First, we streamlined the introduction by removing redundancies and ensuring a more concise presentation of key concepts. We also clarified our use of terminology, consistently referring to the ‘aperiodic slope’ throughout the manuscript, except where methodological descriptions necessitate the term ‘exponent.’ Additionally, we revised the final section of the introduction to better integrate the discussion of generalisability, ensuring that the inclusion of additional datasets feels more seamlessly connected to the study’s main objectives rather than appearing as an addendum. Finally, we carefully reviewed the entire manuscript to enhance coherence, particularly ensuring that discussions of periodic and aperiodic activity remain precise and do not imply an assumed interplay between the two components. We believe these revisions align with the reviewer’s suggestions and improve the overall readability and logical structure of the manuscript.

      Reviewer #3 (Public review):

      Raised concerns regarding the task's effectiveness in evoking theta power and the ability of our spectral parameterization method (specparam) to adequately quantify background activity around theta bursts.

      We thank Reviewer #3 for their constructive feedback. To address the concerns regarding the task’s effectiveness in evoking theta power and the adequacy of our spectral parameterization method, we have added additional visualizations using a log-y axis ****(Figures S1, S19, S32). These figures demonstrate that, in baseline-corrected data, low-frequency activity during working memory tasks appears as both theta and delta activity. Additionally, we have marked the borders between frequency ranges with dotted lines to facilitate clearer visual differentiation between these bands. We believe these additions help clarify the results and address the reviewer’s concerns.

      The reviewer noted that “aperiodic activity seems specifically ~1–2 Hz.”

      In our data baseline-corrected low-frequency post-stimulus increase in EEG activity spans from approximately 3 to 7 Hz, with no prominent peak observed in the canonical theta band (4–7 Hz). While we did not analyze frequencies below 3 Hz, we agree with the reviewer that some of this activity could potentially fall within the delta range.

      Nonetheless, we would like to emphasize that similar patterns of activity have often been interpreted as theta in the literature,  even  in  the  absence  of a distinct spectral  peak (see: https://doi.org/10.1016/j.neulet.2012.03.076;    https://doi.org/10.1016/j.brainres.2006.12.076; https://doi.org/10.1111/psyp.12500; https://doi.org/10.1038/s42003-023-05448-z — particularly, see the interpretation of State 1 as a “theta prefrontal state”).

      To accommodate both interpretations, we have opted to use the more neutral term “low-frequency activity” where appropriate. However, we also clarify that such activity is frequently referred to as “theta” in prior studies, even in the absence of a clear oscillatory peak.

      “Figure 4 [now Figure 6]: there is no representation of periodic theta.”

      Yes, this is one of the main findings of our study - periodic theta is absent in the vast majority of participants. A similar finding was found in a recent preprint on a working memory task (https://doi.org/10.1101/2024.12.16.628786), which further supports our results.

      “Figure 5 [now Figure 7]: there is some theta here, but it isn't clear that this is different from baseline corrected status-quo activity.”

      This figure shows comparisons of periodic activity between conditions. Although there are differences between conditions in the theta band, this does not indicate the presence of theta oscillations. Instead, the differences between the conditions in the theta band are most likely due to alpha components extending into the theta band (see Figure S6). This is further supported by the large overlap of significant channels between theta and alpha in Figure 7.

      “Figure 8: On the item-recognition task, there appears to be a short-lived burst in the high delta / low theta band, for about 500 ms. This is a short phenomenon, and there is no evidence that specparam techniques can resolve such time-limited activity.”

      We thank the reviewer for their comment. As we noted in our preliminary response, specparam, in the form we used, does not incorporate temporal information; it can be applied to any power spectral density (PSD), regardless of how the PSD is derived. Therefore, the ability of specparam to resolve temporal activity depends on the time-frequency decomposition method used. In particular, the performance of specparam is limited by the underlying time-frequency decomposition method and the data available for it. In fact, Wilson et al. (2022, https://doi.org/10.7554/eLife.77348), who have developed an approach for timeresolved estimation of aperiodic parameters, actually compare two approaches that differ only in their underlying time-frequency estimation method, while the specparam algorithm is the same in both cases. For the time-frequency decomposition we used superlets (https://doi.org/10.1038/s41467-020-20539-9), which have been shown to resolve short bursts of activity more effectively than other methods. To our knowledge, superlets provide the highest resolution in both time and frequency compared to wavelets or STFT.

      To improve the stability of the estimates, we performed spectral parameterisation on trial-averaged power rather than on individual trials (unlike the approach in Wilson et al., 2022). In contrast, Gyurkovics et al. (2022) who also investigated task-related changes in aperiodic activity, estimated power spectra at the single-trial level, but stabilised their estimates by averaging over 1-second time windows; however, this approach reduced their temporal resolution. We have now clarified this point in the manuscript.

      “The authors note in the introduction that ‘We hypothesised that the aperiodic slope would be modulated by the processing demands of the n-back task, and that this modulation would vary according to differences in load and stimulus type.’. This type of parametric variation would be a compelling test of the hypothesis, but these analyses only included alpha and beta power (Main text & Figure 4)”

      We appreciate the reviewer's comment, but would like to clarify that the comparison between conditions was performed separately for both periodic power and aperiodic parameters. The periodic power analyses included all frequencies from 3 to 50 Hz (or 35 Hz in the case of the second dataset). All factors were included in the linear model (see LMM formula in equation 7 - subsection Methods / Comparisons between experimental conditions), but the figures only include fixed effects that were statistically significant. For example, Figure 7 shows the periodic activity and Figure 9 shows the exponent, with further details provided in other supplementary figures.

      “Figure 5 does show some plots with some theta activity, but it is unclear how this representation of periodic activity has anything to do with the major hypothesis that aperiodic slope accounts for taskevoked theta.” /…/ In particular, specparam is a multi-step model fitting procedure and it isn't impressively reliable even in ideal conditions (PMID: 38100367, 36094163, 39017780). To achieve the aim stated in the title, abstract, and discussion, the authors would have to first demonstrate the robustness of this technique applied to these data.

      We acknowledge these concerns and have taken several steps to clarify the relationship between the aperiodic slope and low-frequency activity, and to assess the robustness of the specparam (FOOOF) approach in our data.

      First, we directly compared baseline-corrected activity with periodic and aperiodic components in all three data sets. These analyses showed that low-frequency increases in baseline-corrected signals consistently tracked aperiodic parameters - in particular the aperiodic exponent - rather than periodic theta activity (see Figs 4, S3, S20, S33). Periodic components, on the other hand, were primarily associated with baseline corrected activity in the alpha and beta bands. The aperiodic exponent also showed negative correlations with high beta/gamma baseline-corrected activity, which is exactly what would be expected in the case of a shift in the aperiodic slope (rather than delta/theta oscillations). See also examples at https://doi.org/10.1038/s41593-020-00744-x (Figures 1c-iv) or https://doi.org/10.1111/ejn.15361 (Figures 3c,d).

      Next, because reviewer #1 was concerned that FOOOF might be insensitive to peaks at the edges of the spectrum, we ran a simulation that confirmed this concern. We then applied an alternative phase-based measure of oscillatory activity: the phase-autocorrelation function (pACF; Myrov et al., 2024). This method does not rely on spectral fitting and is sensitive to phase rather than amplitude. Across all datasets, pACF results were in close agreement with periodic estimates from FOOOF and were not correlated with aperiodic parameter estimates (Figs 5, S4, S5, S21, S22, S34, S35).

      Taken together, these complementary analyses suggest that the apparent low-frequency (delta, theta) activity observed in the baseline-corrected data is better explained by changes in the aperiodic slope than by true low-frequency oscillations. While we acknowledge the limitations of any single method, the convergence between the techniques increases our confidence in this interpretation.

      “How did the authors derive time-varying changes in aperiodic slope and exponent in Figure 6 [now Figure 8]?”

      We thank the reviewer for this question. As explained in the Methods section, we first performed a time-frequency decomposition, averaged across trials, and then applied a spectral decomposition to each time point.

      “While these methodological details may seem trivial and surmountable, even if successfully addressed the findings would have to be very strong in order to support the rather profound conclusions that the authors made from these analyses, which I consider unsupported at this time:

      (a) ‘In particular, the similarities observed in the modulation of theta-like activity attributed to aperiodic shifts provide a crucial validation of our conclusions regarding the nature of theta activity and the aperiodic component.’

      (b) ‘where traditional baseline subtraction can obscure significant neural dynamics by misrepresenting aperiodic activity as theta band oscillatory activity’

      (d) ‘our findings suggest that theta dynamics, as measured with scalp EEG, are predominantly a result of aperiodic shifts.’

      (e)  ‘a considerable proportion of the theta activity commonly observed in scalp EEG may actually be due to shifts in the aperiodic slope’.

      (f) ‘It is therefore essential to independently verify whether the observed theta activity is genuinely oscillatory or primarily aperiodic’

      [this would be great, but first we need to know that specparam is capable of reliably doing this].”

      We believe that our claims are now supported by the aforementioned analyses, namely associations between baseline-corrected time-frequency activity and FOOOF parameters and associations between FOOOF parameters and PACF.

      The reviewer found it unclear what low-frequency phase has to do with 1/f spectral changes: ‘Finally, our findings challenge the established methodologies and interpretations of EEG-measured crossfrequency coupling, particularly phase-amplitude coupling’

      We thank the reviewer for their comment. To address this concern, we have added further clarification in the Discussion section. Our results are particularly relevant for phase-amplitude coupling (PAC) based on theta, such as theta-gamma coupling. PAC relies on the assumption that there are distinct oscillations at both frequencies. However, if no clear oscillations are present at these frequencies— specifically, if theta oscillations are absent—then the computation of PAC becomes problematic.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Most studies in sensory neuroscience investigate how individual sensory stimuli are represented in the brain (e.g., the motion or color of a single object). This study starts tackling the more difficult question of how the brain represents multiple stimuli simultaneously and how these representations help to segregate objects from cluttered scenes with overlapping objects.

      Strengths

      The authors first document the ability of humans to segregate two motion patterns based on differences in speed. Then they show that a monkey's performance is largely similar; thus establishing the monkey as a good model to study the underlying neural representations.

      Careful quantification of the neural responses in the middle temporal area during the simultaneous presentation of fast and slow speeds leads to the surprising finding that, at low average speeds, many neurons respond as if the slowest speed is not present, while they show averaged responses at high speeds. This unexpected complexity of the integration of multiple stimuli is key to the model developed in this paper.

      One experiment in which attention is drawn away from the receptive field supports the claim that this is not due to the involuntary capture of attention by fast speeds.

      A classifier using the neuronal response and trained to distinguish single-speed from bi-speed stimuli shows a similar overall performance and dependence on the mean speed as the monkey. This supports the claim that these neurons may indeed underlie the animal's decision process.

      The authors expand the well-established divisive normalization model to capture the responses to bi-speed stimuli. The incremental modeling (eq 9 and 10) clarifies which aspects of the tuning curves are captured by the parameters.

      We thank the Reviewer for the thorough summary of the findings and supportive comments.

      Weaknesses

      While the comparison of the overall pattern of behavioral performance between monkeys and humans is important, some of the detailed comparisons are not well supported by the data. For instance, whether the monkey used the apparent coherence simply wasn't tested and a difference between 4 human subjects and a single monkey subject cannot be tested statistically in a meaningful manner. I recommend removing these observations from the manuscript and leaving it at "The difference between the monkey and human results may be due to species differences or individual variability" (and potentially add that there are differences in the task as well; the monkey received feedback on the correctness of their choice, while the humans did not.)

      Thanks for the suggestion. We agree and have modified the text accordingly. We now state on page 8, lines 189-191, "The difference between the monkey and human results may be due to species differences or individual variability. The differences in behavioral tasks may also play a role – the monkey received feedback on the correctness of the choice, whereas human subjects did not."

      A control experiment aims to show that the "fastest speed takes all" behavior is general by presenting two stimuli that move at fast/slow speeds in orthogonal directions. The claim that these responses also show the "fastest speed takes all" is not well supported by the data. In fact, for directions in which the slow speed leads to the largest response on its own, the population response to the bi-speed stimulus is the average of the response to the components (This is fine. One model can explain all direction tuning curve, which also explain averaging at the slower speed stronger directions). Only for the directions where the fast speed stimulus is the preferred direction is there a bias towards the faster speed (Figure 7A). The quantification of this effect in Figure 7B seems to suggest otherwise, but I suspect that this is driven by the larger amplitude of Rf in Figure 8, and the constraint that ws and wf are constant across directions. The interpretation of this experiment needs to be reconsidered.

      The Reviewer raised a good question. Our model with fixed weights for faster and slower components across stimulus directions provided a parsimonious explanation for the whole tuning curve, regardless of whether the faster component elicited a stronger response than the slower component. Because the model can be well constrained by the measured direction-tuning curves, we did not restrain 𝑤 and 𝑤 to sum to one, which is more general. The linear weighted summation (LWS) model fits the neuronal responses to the bi-speed stimuli very well, accounting for an average of 91.8% (std = 7.2%) of the response variance across neurons. As suggested by the Reviewer, we now use the normalization model to fit the data with fixed weights across all motion directions. The normalization model also provides a good fit, accounting for an average of 90.5% (std = 7.1%) of the response variance across neurons.

      Note that in the new Figure 8A, at the left side of the tuning curve (i.e., at negative vector average (VA) directions), where the slower component moving in a more preferred direction of the neurons than the faster component, the bi-speed response (red curve) is slightly lower than the average of the component response (gray curve), indicating a bias toward the weaker faster component. Therefore, the faster speed bias does not occur only when the faster component moves in the more preferred direction. This can also be seen in the direction-tuning curves of an example neuron that we added to the figure (new Fig. 8B). The peak responses to the slower and faster component were about the same, but the neuron still showed a faster-speed bias. At negative VA directions, the red curve is lower than the response average (gray curve) and is biased toward the weaker (faster) component.  

      The faster-speed bias also occurs when the peak response to the slower component is stronger than the faster component. As a demonstration, Author response image 1 1 shows an example MT neuron that has a slow preferred speed (PS = 1.9 deg/s) and was stimulated by two speeds of 1.2 and 4.8 deg/s. The peak response to the faster component (blue) was weaker than that to the slower component (green). However, this neuron showed a strong bias toward the faster component. A normalization model fit with fixed weights for the faster and slower components (black curve) described the neuronal response to both speeds (red) well. This neuron was not included in the neuron population shown in Figure 8 because it was not tested with stimulus speeds of 2.5 and 10 deg/s.

      Author response image 1.

      An example MT neuron was tested with stimulus speeds of 1.2 and 4.8 deg/s. The preferred speed of this neuron was 1.9 deg/s. Fixed weights of 0.59 for the faster component and 0.12 for the slower component described the responses to the bispeed stimuli well using a normalization model. The neuron showed a faster-speed bias although its peak response to the slower component was higher than that of the faster component.

      We modified the text to clarify these points:

      Page 19, lines 405 – 410, “The bi-speed response was biased toward the faster component regardless of whether the response to the faster component was stronger (in positive VA directions) or weaker (in negative VA directions) than that to slower component (Fig. 8A). The result from an example neuron further demonstrated that, even when the peak firing rates of the faster and slower component responses were similar, the response elicited by the bi-speed stimuli was still biased toward the faster component (Fig. 8B). ”

      Page 19, lines 421 – 427, “Because the model can be well constrained by the measured direction-tuning curves, it is not necessary to require 𝑤 and 𝑤 to sum to one, which is more general. An implicit assumption of the model is that, at a given pair of stimulus speeds, the response weights for the slower and faster components are fixed across motion directions. The model fitted MT responses very well, accounting for an average of 91.8% of the response variance (std = 7.2%, N = 21) (see Methods). The success of the model supports the assumption that the response weights are fixed across motion directions.”

      Reviewer #2 (Public Review):

      Summary:

      This is a paper about the segmentation of visual stimuli based on speed cues. The experimental stimuli are random dot fields in which each dot moves at one of two velocities. By varying the difference between the two speeds, as well as the mean of the two speeds, the authors estimate the capacity of observers (human and non-human primates) to segment overlapping motion stimuli. Consistent with previous work, perceptual segmentation ability depends on the mean of the two speeds. Recordings from area MT in monkeys show that the neuronal population to compound stimuli often shows a bias towards the faster-speed stimuli. This bias can be accounted for with a computational model that modulates single-neuron firing rates by the speed preferences of the population. The authors also test the capacity of a linear classifier to produce the psychophysical results from the MT data.

      Strengths:

      Overall, this is a thorough treatment of the question of visual segmentation with speed cues. Previous work has mostly focused on other kinds of cues (direction, disparity, color), so the neurophysiological results are novel. The connection between MT activity and perceptual segmentation is potentially interesting, particularly as it relates to existing hypotheses about population coding.

      We thank the Reviewer for the summary and comments.

      Weaknesses:

      Page 10: The relationship between (R-Rs) and (Rf-Rs) is described as "remarkably linear". I don't actually find this surprising, as the same term (Rs) appears on both the x- and y-axes. The R^2 values are a bit misleading for this reason.

      The Reviewer is correct that subtracting a common term Rs from R and Rf would introduce correlation between (R-Rs) and (Rf-Rs). To address this concern, we conducted an additional analysis. We showed that, at most speed pairs, the R^2 values between (R-Rs) and (Rf-Rs) based on the data are significantly higher than the R^2 values between (R’-Rs) and (RfRs), in which R’ was a random combination of Rs and Rf. Since the same Rs was commonly subtracted in calculating R^2 (data) and R^2 (simulation), the difference between R^2 (data) and R^2 (simulation) suggests that the response pattern of R contributes to the additional correlation.

      We now acknowledge this confounding factor and describe the new analysis results on page 14, lines 309 – 326. Please also see the response to Reviewer 3 about a similar concern.

      Figure 9: I'm confused about the linear classifier section of the paper. The idea makes sense - the goal is to relate the neuronal recordings to the psychophysical data. However the results generally provide a poor quantitative match to the psychophysical data. There is mention of a "different paper" (page 26) involving a separate decoding study, as well as a preprint by Huang et al. (2023) that has better decoding results. But the Huang et al. preprint appears to be identical to the current manuscript, in that neither has a Figure 12, 13, or 14. The text also says (page 26) that the current paper is not really a decoding study, but the linear classifier (Figure 9F) is a decoder, as noted on page 10. It sounds like something got mixed up in the production of two or more papers from the same dataset.

      We apologize for the confusion regarding the reference of Huang et al. (2023, bioRxiv). We referred to an earlier version of this bioRxiv manuscript (version 1), which included decoding analysis. In the bibliography, we provided two URLs for this pre-print. While the second link was correct, the first URL automatically links to the latest version (version 2), which did not have the abovementioned decoding analysis.

      The analysis in Figure 9 is to apply a classifier to discriminate two-speed from singlespeed stimuli, which is a decoding analysis as the Reviewer pointed out. We revised the result section about the classifier to make it clear what the classifier can and cannot explain (pages 2223, lines 516-534). We also included a sentence at the end of this section that leads to additional decoding analysis to extract motion speed(s) from MT population responses (page 23, lines 541543), “To directly evaluate whether the population neural responses elicited by the bi-speed stimulus carry information about two speeds, it is important to conduct a decoding analysis to extract speed(s) from MT population responses.”

      In any case, I think that some kind of decoding analysis would really strengthen the current paper by linking the physiology to the psychophysics, but given the limitations of the linear classifier, a more sophisticated approach might be necessary -- see for example Zemel, Dayan, and Pouget, 1998. The authors might also want to check out closely related work by Treue et al. (Nature Neuroscience 2000) and Watamaniuk and Duchon (1992).

      We thank the Reviewer for the suggestion and agree that it is useful to incorporate additional decoding analysis that can better link physiology results to psychophysics. The decoding analysis we conducted was motivated by the framework proposed by Zemel, Dayan, and Pouget (1998), and also similar to the idea briefly mentioned in the Discussion of Treue et al. (2000). We have added the decoding analysis to this paper on pages 25-32.  

      What do we learn from the normalization model? Its formulation is mostly a restatement of the results - that the faster and slower speeds differentially affect the combined response. This hypothesis is stated quantitatively in equation 8, which seems to provide a perfectly adequate account of the data. The normalization model in equation 10 is effectively the same hypothesis, with the mean population response interposed - it's not clear how much the actual tuning curve in Figure 10A even matters, since the main effect of the model is to flatten it out by averaging the functions in Figure 10B. Although the fit to the data is reasonable, the model uses 4 parameters to fit 5 data points and is likely underconstrained; the parameters other than alpha should at least be reported, as it would seem that sigma is actually the most important one. And I think it would help to examine how robust the statistical results are to different assumptions about the normalization pool.

      In the linear weighted summation model (LWS) model (Eq. 8), the weights Ws and Wf are free parameters. We think the value of the normalization model (Eq. 9) is that it provides an explanation of what determines the response weights. We agree with the Reviewer that using the normalization model (Eq. 9) with 4 parameters to fit 5 data points of the tuning curves to bispeed stimuli of individual neurons is under-constrained. We, therefore, removed the section using the normalization model to fit overlapping stimuli moving in the same direction at different speeds.

      A better way to constrain the normalization model is to use the full direction-tuning curves of MT neurons in response to two stimulus components moving in different directions at different speeds, as shown in Figure 8. We now use the normalization model (Eq. 9) to fit this data set (also suggested by Reviewer 1), in addition to the LWS model. We now report the median values of the model parameters of the normalization model, including the exponent n, sigma, alpha, and the constant c. We also compared the normalization model fit with the linear summation (LWS) model. We discuss the limitations of our data set and what needs to be done in future studies. The revisions are on page 20, lines 434-467 in the Results, and pages 34-35, lines 818-829 in Discussion.

      Reviewer #3 (Public Review):

      Summary:

      This study concerns how macaque visual cortical area MT represents stimuli composed of more than one speed of motion.

      Strengths:

      The study is valuable because little is known about how the visual pathway segments and preserves information about multiple stimuli. The study presents compelling evidence that (on average) MT neurons represent the average of the two speeds, with a bias that accentuates the faster of the two speeds. An additional strength of the study is the inclusion of perceptual reports from both humans and one monkey participant performing a task in which they judged whether the stimuli involved one vs two different speeds. Ultimately, this study raises intriguing questions about how exactly the response patterns in visual cortical area MT might preserve information about each speed, since such information could potentially be lost in an average response as described here, depending on assumptions about how MT activity is evaluated by other visual areas.

      Weaknesses:

      My main concern is that the authors are missing an opportunity to make clear that the divisive normalization, while commonly used to describe neural response patterns in visual areas (and which fits the data here), fails on the theoretical front as an explanation for how information about multiple stimuli can be preserved. Thus, there is a bit of a disconnect between the goal of the paper - how does MT represent multiple stimuli? - and the results: mostly averaging responses which, while consistent with divisive normalization, would seem to correspond to the perception of a single intermediate speed. This is in contrast to the psychophysical results which show that subjects can at least distinguish one from two speeds. The paper would be strengthened by grappling with this conundrum in a head-on manner.

      We thank the Reviewer for the constructive comments. We agree with the Reviewer that it is important to connect the encoding of multiple speeds with the perception. The Reviewer also raised an important question regarding whether multiple speeds can be extracted from population neural responses, given the encoding rules characterized in this study.

      It is a hard problem to extract multiple stimulus values from the population neural response. Inspired by the theoretical framework proposed by Zemel et al. (1998), we conducted a detailed decoding study to extract motion speed(s) from MT population responses. We used the decoded speed(s) to perform a discrimination task similar to our psychophysics task and compared the decoder's performance with perception. We found that, at X4 speed difference, we could decode two speeds based on MT response, and the decoder's performance was similar to that of perception. However, at X2 speed difference, except at the slowest speeds of 1.25 and 2.5 deg/s, the decoder cannot extract two speeds and cannot differentiate between a bi-speed stimulus and a single log-mean speed stimulus. We have added the decoding analysis to this paper on pages 25-32. We also discuss the implications and limitations of these results (pages 35-36, lines 852-884).

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Classifier:

      One question I have is how the classifier's performance scales with the number of neurons used in the analysis. Here that number is set to the number that was recorded, but it is a free parameter in this analysis. Why does the arbitrary choice of 100 neurons match the animals' performance?

      We apologize for the unclearness of this point. The decoding using the classifier was based on the neural responses of 100 recorded MT neurons in our data set. The number of 100 neurons was not a free parameter. We need to reconstruct the population neural response based on the responses of the recorded neurons and their preferred speeds (red and black dots in Figure 9A-E).  

      We spline-fitted the reconstructed population neural responses (red and black curves in Figure 9-E). One way to change the number of neurons used for the decoding is to resample N points along the spline-fitted population responses, using N as a free parameter. However, we think it is better to conduct decoding based on the responses from the recorded neurons rather than based on interpolated responses. We now clarify on page 22, lines 520-522, that we based on the responses of the 100 recorded neurons in our dataset to do the classification (decoding).

      Normalization Model:

      Although the model is phenomenological, a schematic circuit diagram could help the reader understand how this could work (I think this is worthwhile even though the data cannot distinguish among different implementations of divisive normalization).

      Thanks for this suggestion. We agree that a circuit diagram would help the readers understand how the model works. However, as the Reviewer pointed out, our data cannot distinguish between different implementations of the model. For example, divisive normalization can occur on the inputs to MT neurons or on MT neurons themselves. The circuit mechanism of weighting the component responses is not clear either. A schematic circuit diagram then mainly serves to recapitulate the normalization model in Equation 9. We, therefore, choose not to add a schematic circuit diagram at this time. We are interested in developing a circuit model to account for how visual neurons represent multiple stimuli in future studies.

      Another suggestion is that the time courses could be used to constrain the model; the fact that it takes a while after the onset of the slow-speed response for averaging to reveal itself suggests the presence of inertia/hysteresis in the circuit).

      We agree that the time course of MT responses could be used to constrain the model. This is also why we think it is important to document the time course in this paper. We now state in the Results, page 17, lines 354-357:

      “At slow speeds, the very early faster-speed bias suggests a likely role of feedforward inputs to MT on the faster-speed bias. The slightly delayed reduction (normalization) in the bispeed response relative to the stronger component response also helps constrain the circuit model for divisive normalization.”

      Two-Direction Experiment:

      Applying the normalization model to this dataset could help determine its generality.

      This is a good point. We now apply the normalization model (Eq. 9) to fit this data set with the full direction tuning curves in response to two stimuli moving in different directions at different speeds. Please also see the response to Reviewer 2 about the normalization model fit.

      The results of the normalization model fit are now described on page 20 and Figure 8A, B, D.

      Reviewer #2 (Recommendations For The Authors):

      In terms of impact, I would say that the presentation is geared largely toward people who go to VSS. To broaden the appeal, the authors might consider a more general formulation of the four hypotheses stated at the bottom of page 3. These are prominent ideas in systems neuroscience - population encoding, Bayesian inference, etc.

      We thank the Reviewer for the suggestion. We have revised the Introduction accordingly on pages 3-4, lines 43-69. Please also see the response to Reviewer 3 about the Introduction.

      Figure 5: It might be helpful to show the predictions for different hypotheses. If the response to the transparent stimulus is equal to that of the faster stimulus, you will have a line with slope 1. If it is equal to the response to the slow stimulus, all points will lie on the x-axis. In between you get lines with slopes less than 1.

      In Figures 5F1 and 5F2, we show dotted lines indicating faster-all (i.e., faster-componenttake-all), response averaging, and slower-all (i.e., slower-component-take-all) on the X-axis. We show those labels in between Figs. 5F1 and F2.

      Figure 6: The analysis is not motivated by any particular question, and the results are presented without any quantitation. This section could be better motivated or else removed.

      We now better motivate the section about the response time course on page 16, lines 336 – 339: “The temporal dynamics of the response bias toward the faster component may provide a useful constraint on the neural model that accounts for this phenomenon. We therefore examined the timecourse of MT response to the bi-speed stimuli. We asked whether the faster-speed bias occurred early in the neuronal response or developed gradually.”

      On page 17, lines 354-357, we also state that “At slow speeds, the very early faster-speed bias suggests a likely role of feedforward inputs to MT on the faster-speed bias. The slightly delayed reduction (normalization) in the bi-speed response relative to the stronger component response also helps constrain the circuit model for divisive normalization.”

      Equation (9): There appears to be an "S" missing in the denominator.

      We double-checked and did not see a missing "S" in Equation 9, on page 20.  

      Reviewer #3 (Recommendations For The Authors):

      This is an impressive study, with the chief strengths being the computational/theoretical motivation and analyses and the inclusion of psychophysics together with primate neurophysiology. The manuscript is well-written and the figures are clear and convincing (with a couple of suggestions detailed below).

      We thank the Reviewer for the comments.

      Specific suggestions:

      (1) Intro para 3

      "It is conceivable that the responses of MT neurons elicited by two motion speeds may follow one of the following rules: (1) averaging the responses elicited by the individual speed components; (2) bias toward the speed component that elicits a stronger response, i.e. "soft-max operation" (Riesenhuber and Poggio, 1999); (3) bias toward the slower speed component, which may better represent the more probable slower speeds in nature scenes (Weiss et al., 2002); (4) bias toward the faster speed component, which may benefit the segmentation of a faster-moving stimulus from a slower background."

      This would be a good place to point out which of these options is likely to preserve vs. lose information and how.

      It seems to me that only #2 is clearly information-preserving, assuming that there are neurons with a variety of different speed preferences such that different neurons will exhibit different "winners". #1 would predict subjects would perceive only an intermediate speed, whereas #3 would predict perceiving only/primarily the slower speed and #4 would predict only/primarily perceiving the faster speed.

      The difference between "only" and "primarily" would depend on whether the biases are complete or only partial. I acknowledge that the behavioral task in the study is not a "report all perceived speeds" task, but rather a 1 vs 2 speeds task, so the behavioral assay is not a direct assessment of the question I'm raising here, but I think it should still be possible to write about the perceptual implications of these different possibilities for encoding in an informative way.

      Thanks for the suggestions. We have revised this paragraph in the Introduction on pages 3 – 4, lines 43 – 69.

      (2) Analysis clarifications

      The section "Relationship between the responses to bi-speed stimuli and constituent stimulus components" could use some clarification/rearrangement/polish. I had to read it several times. Possibly, rearrangement, simplification/explanation of nomenclature, and building up from a simpler to a more complex case would help. If I understand correctly, the outcome of the analysis is to obtain a weight value for every combination of slow and fast speeds used. The R's in equation 5 are measured responses, observed on the single stimulus and combined stimulus trials. It was not clear to me if the R's reflect average responses or individual trial responses; this should be clarified. Ws = 1- wf so in essence only 1 weight is computed for each combination. Then, in the subsequent sections of the manuscript, the authors explore whether the weight computed for each stimulus combination is the same or does it vary across conditions. If I have this right, then walking through these steps will aid the reader.

      The Reviewer is correct. We now walk through these steps and better state the rationale for this approach. The R's in Equation 5 are trial-averaged responses, not trial-by-trial responses.

      We have clarified these points on page 13.

      To take a particular example, the sentence "Using this approach to estimate the response weights for individual neurons can be inaccurate because, at each speed pair, the weights are determined only by three data points" struck me as a rather backdoor way to get at the question. Is the estimate noisy? Or does the weighting vary systematically across speeds? I think the authors are arguing the latter; if so, it would be valuable to say so.

      We wanted to estimate the weighting for each speed pair and determine whether the weights change with the stimulus speeds. Indeed, we found that the weights change systematically across speed pairs. The issue was not because the estimate was noisy (see below in response to the second paragraph for point 3.  

      We have clarified this point in the text, on page 13, lines 273 – 280: “Our goal was to estimate the weights for each speed pair and determine whether the weights change with the stimulus speeds. In our main data set, the two speed components moved in the same direction. To determine the weights of 𝑤 and w<sub>f</sub> for each neuron at each speed pair, we have three data points R, R<sub>s</sub>, and R<sub>f</sub>, which are trial-averaged responses. Since it is not possible to solve for both variables, 𝑤 and w<sub>f</sub>, from a single equation (Eq. 5) with three data values, we introduced an additional constraint: 𝑤 + w<sub>f</sub> =1. While this constraint may not yield the exact weights that would be obtained with a fully determined system, it nevertheless allows us to characterize how the relative weights vary with stimulus speed.”

      (3) Figure 5

      Related to the previous point, Figures 5A-E are subject to a possible confound. When plotting x vs y values, it is critical that the x and y not depend trivially on the same value. Here, the plots are R-Rs and Rf-Rs. Rs, therefore, is contained in both the x and y values. Assume, for the sake of argument, that R and Rf are constants, whereas Rs is drawn from a distribution of random noise. When Rs, by chance, has an extreme negative value, R-Rs and Rf-Rs will be large positive values. The solution to this artificial confound is to split the trials that generate Rs into two halves and subtract one half from R and the other half from Rf. Then, the same noisy draw will not be contributing to both x and y. The above is what is needed if the authors feel strongly about including this analysis.

      The Reviewer is correct that subtracting a common term (Rs) would introduce a correlation between (R-Rs) and (Rf-Rs) (Reviewer 2 also raised this point). R's in Equations 5, 6, 7 (and Figure 5A-E) are trial-averaged responses. So, we cannot address the issue by dividing R’s into two halves. Our results showed that the regression slope (W<sub>f</sub>) changed from near 1 to about 0.5 as the stimulus speeds increased, and the correlation coefficient between (R – Rs) and (R<sub>f</sub> – Rs) was high at slow stimulus speeds. To determine whether these results can be explained by the confounding factor of subtracting a common term Rs, rather than by the pattern of R in representing two speeds, we did an additional analysis. We acknowledged the issue and described the new analysis on page 13, lines 303 – 326:

      “Our results showed that the bi-speed response showed a strong bias toward the faster component when the speeds were slow and changed progressively from a scheme of ‘fastercomponent-take-all’ to ‘response-averaging’ as the speeds of the two stimulus components increased (Fig. 5F1). We found similar results when the speed separation between the stimulus components was small (×2), although the bias toward the faster component at low stimulus speeds was not as strong as x4 speed separation (Fig. 5A2-F2 and Table 1).  

      In the regression between (𝑅 – 𝑅<sub>s</sub>) and (𝑅<sub>f</sub> – 𝑅<sub>s</sub>), 𝑅<sub>s</sub> was a common term and therefore could artificially introduce correlations. We wanted to determine whether our estimates of the regression slope (𝑤<sub>f</sub>) and the coefficient of determination (𝑅<sup>2</sup>) can be explained by this confounding factor. At each speed pair and for each neuron from the data sample of the 100 neurons shown in Figure 5, we simulated the response to the bi-speed stimuli (𝑅 <sub>e</sub>) as a randomly weighted sum of 𝑅<sub>f</sub> and 𝑅<sub>s</sub> of the same neuron.

      𝑅<sub>e</sub> = 𝑎𝑅<sub>f</sub> + (1 − 𝑎)𝑅<sub>s</sub>,

      in which 𝑎 was a randomly generated weight (between 0 and 1) for 𝑅<sub>f</sub>, and the weights for 𝑅<sub>f</sub> and 𝑅<sub>s</sub> summed to one. We then calculated the regression slope and the correlation coefficient between the simulated 𝑅<sub>e</sub> - 𝑅<sub>s</sub> and 𝑅<sub>f</sub> - 𝑅<sub>s</sub> across the 100 neurons. We repeated the process 1000 times and obtained the mean and 95% confidence interval (CI) of the regression slope and the 𝑅<sup>2</sup>. The mean slope based on the simulated responses was 0.5 across all speed pairs. The estimated slope (𝑤<sub>f</sub>) based on the data was significantly greater than the simulated slope at slow speeds of 1.25/5, 2.5/10 (Fig. 5F1), and 1.25/2.5, 2.5/5, and 5/10 degrees/s (Fig. 5F2) (bootstrap test, see p values in Table 1). The estimated 𝑅<sup>2</sup> based on the data was also significantly higher than the simulated 𝑅<sup>2</sup> for most of the speed pairs (Table 1). These results suggest that the faster-speed bias at the slow stimulus speeds and the consistent response weights across the neuron population at each speed pair are not analysis artifacts.”

      However, I don't see why the analysis is needed at all. Can't Figure 5F be computed on its own? Rather than computing weights from the slopes in 5A-E, just compute the weights from each combination of stimulus conditions for each neuron, subject to the constraint ws=1-wf. I think this would be simpler to follow, not subject to the noise confound described in the previous point, and likely would make writing about the analysis easier.

      We initially tried the suggested approach to determine the weights of the individual neurons. The weights from each speed combination for each neuron are calculated by:  𝑤<sub>s</sub> = , 𝑤<sub>f</sub> , and 𝑤<sub>s</sub> and 𝑤<sub>f</sub> sum to 1. 𝑅, 𝑅<sub>f</sub> and  𝑅<sub>s</sub> are the responses to the same motion direction. Using this approach to estimate response weights for individual neurons can be unreliable, particularly when 𝑅<sub>f</sub> and 𝑅<sub>s</sub> are similar. This situation often arises when the two speeds fall on opposite sides of the neuron's preferred speed, resulting in a small denominator (𝑅<sub>f</sub> - 𝑅<sub>s</sub>) and, consequently, an artificially inflated weight estimate. We therefore used an alternative approach. We estimated the response weights for the neuronal population at each speed pair (𝑅<sub>f</sub> - 𝑅<sub>s</sub>) using linear regression of (𝑅 - 𝑅<sub>s</sub>) against (𝑅<sub>f</sub> - 𝑅<sub>s</sub>). The slope is the weight for the faster component for the population. This approach overcame the difficulty of determining the response weights for single neurons.

      Nevertheless, if the data provide better constraints, it is possible to estimate the response weights for each speed pair for individual neurons. For example, we can calculate the weights for single neurons by using stimuli that move in different directions at two speeds. By characterizing the full direction tuning curves for R, R<sub>f</sub>, and Rs, we have sufficient data to constrain the response weights for single neurons, as we did for the speed pair of 2.5 and 10º/s in Figure 8. In future studies, we can use this approach to measure the response weights for single neurons at different speed pairs and average the weights across the neuron population.  

      We explain these considerations in the Results (pages 13–14, lines 265-326) and Discussion (pages 34-35, lines 818-829).

      (4) Figure 7

      Bidirectional analysis. It would be helpful to have a bit more explanation for why this analysis is not subject to the ws=1-wf constraint. In Figure 7B, a line could be added to show what ws + wf =1 would look like (i.e. a line with slope -1 going from (0,1) to (1,0); it looks like these weights are a little outside that line but there is still a negative trend suggesting competition.

      For the data set when visual stimuli move in the same direction at different speeds, we included a constraint that W<sub>s</sub> and W<sub>f</sub> sum to 1. This is because one cannot solve two independent variables (Ws and Wf) using one equation R = W<sub>s</sub> · R<sub>s</sub> + W<sub>f</sub> R<sub>f</sub>, with three data values (R, Rs, Rf).

      In the dataset using bi-directional stimuli (now Fig. 8), we can use the full direction tuning curves to constrain the linear weighted (LWS) summation model and the normalization model. So, we did not need to impose the additional constraint that Ws and Wf sum to one, which is more general. We now clarify this in the text, on page 19, lines 421-423.

      As suggested, we added a line showing Ws + Wf = 1 for the LWS model fit (Fig. 8C) and the normalization model fit (Fig. 8D) (also see page 21, lines 482-484). Although 𝑤 and 𝑤 are not constrained to sum to one in the model fits, the fitted weights are roughly aligned with the dashed lines of Ws + Wf = 1.

      (5) Attention task

      General wording suggestions - a caution against using "attention" as a causal/mechanistic explanation as opposed to a hypothesized cognitive state. For example, "We asked whether the faster-speed bias was due to bottom-attention being drawn toward the faster stimulus component". This could be worded more conservatively as whether the bias is "still present if attention is directed elsewhere" - i.e. a description of the experimental manipulation.

      We intended to test the hypothesis of whether the faster-speed bias can be explained by attention automatically drawn to the faster component and therefore enhance the contribution of the faster component to the bi-speed response. We now state it as a possible explanation to be tested. We changed the subtitle of this section to be more conservative: “Faster-speed bias still present when attention was directed away from the RFs”, on page 18, line 363.

      We also modified the text on page 18, lines 364-367: “One possible explanation for the faster-speed bias may be that bottom-up attention is drawn toward the faster stimulus component, enhancing the response to the faster component. To address this question, we asked whether the faster-speed bias was still present if attention was directed away from the RFs.”

      Relatedly, in the Discussion, the section on "Neural mechanisms", the sentence "The faster-speed bias was not due to an attentional modulation" should be rephrased as something like 'the bias survived or was still present despite an attentional modulation requiring the monkey to attend elsewhere'.

      Our motivation for doing the attention-away experiment was to determine whether a bottom-up attentional modulation can explain the faster-speed bias. We now describe the results as suggested by the Reviewer. But we’d also like to interpret the implications of the results. In Discussion, page 34, lines 789-790, we now state: “We found that the faster-speed bias was still present when attention was directed away from the RFs, suggesting that the faster-speed bias cannot be explained by an attentional modulation.”  

      (6) "A model that accounts for the neuronal responses to bi-speed stimuli". This section opens with: "We showed that the neuronal response in MT to a bi-speed stimulus can be described by a weighted sum of the neuron's responses to the individual speed components". "Weighted average" would be more appropriate here, given that ws = 1-wf.

      As mentioned above, the added constraint of Ws+Wf = 1 was only a practical solution for determining the weights for the data set using visual stimuli moving in the same direction. More generally, Ws and Wf do not need to sum to one. As such, we prefer the wording of weighted sum.

      (7) "As we have shown previously using visual stimuli moving transparently in different directions, a classifier's performance of discriminating a bi-directional stimulus from a singledirection stimulus is worse when the encoding rule is response-averaging than biased toward one of the stimulus components" - this is important! Can this be worked into the Introduction?

      Yes, we now also mention this point in the Introduction regarding response averaging on page 4, lines 54-57: “While decoding two stimuli from a unimodal response is theoretically possible (Zemel et al., 1998; Treue et al., 2000), response averaging may result in poorer segmentation compared to encoding schemes that emphasize individual components, as demonstrated in neural coding of overlapping motion directions (Xiao and Huang, 2015).” Also, please see the response to point 1 above.

      (8) Minor, but worth catching now - is the use of initials for human participants consistent with best practices approved at your institution?

      Thanks for checking. The letters are not the initials of the human subjects. They are coded characters. We have clarified it in the legend of Figure 1, on page 7, line 168.

    1. Reviewer #2 (Public review):

      In this valuable manuscript, Lin et al attempt to examine the role of long non coding RNAs (lncRNAs) in human evolution, through a set of population genetics and functional genomics analyses that leverage existing datasets and tools. Although the methods are incomplete and at times inadequate, the results nonetheless point towards a possible contribution of long non coding RNAs to shaping humans, and suggest clear directions for future, more rigorous study.

      Comments on revisions:

      I thank the authors for their revision and changes in response to previous rounds of comments. As it had been nearly two years since I last saw the manuscript, I reread the full text to familiarise myself again with the findings presented. While I appreciate the changes made and think they have strengthened the manuscript, I still find parts of it a bit too speculative or hyperbolic. In particular, I think claims of evolutionary acceleration and adaptation require more careful integration with existing human/chimpanzee genetics and functional genomics literature. For example:

      Line 155: "About 5% of genes have significant sequence differences in humans and chimpanzees," This statement needs a citation, and a definition of what is meant by 'significant', especially as multiple lines below instead mention how it's not clear how many differences matter, or which of them, etc.

      line 187: "Notably, 97.81% of the 105141 strong DBSs have counterparts in chimpanzees, suggesting that these DBSs are similar to HARs in evolution and have undergone human-specific evolution." I do not see any support for the inference here. Identifying HARs and acceleration relies on a far more thorough methodology than what's being presented here. Even generously, pairwise comparison between two taxa only cannot polarise the direction of differences; inferring human-specific change requires outgroups beyond chimpanzee.

      line 210: "Based on a recent study that identified 5,984 genes differentially expressed between human-only and chimpanzee-only iPSC lines (Song et al., 2021), we estimated that the top 20% (4248) genes in chimpanzees may well characterize the human-chimpanzee differences" I do not agree with the rationale for this claim, and do not agree that it supports the cutoff of 0.034 used below. I also find that my previous concerns with the very disparate numbers of results across the three archaics have not been suitably addressed.

      I also think that there is still too much of a tendency to assume that adaptive evolutionary change is the only driving force behind the observed results in the results. As I've stated before, I do not doubt that lncRNAs contribute in some way to evolutionary divergence between these species, as do other gene regulatory mechanisms; the manuscript leans down on it being the sole, or primary force, however, and that requires much stronger supporting evidence. Examples include, but are not limited to:

      line 230: "These results reveal when and how HS lncRNA-mediated epigenetic regulation influences human evolution." This statement is too speculative.

      Line 268: "yet the overall results agree well with features of human evolution." What does this mean? This section is too short and unclear.

      Line 325: "and form 198876 HS lncRNA-DBS pairs with target transcripts in all tissues." This has not been shown in this paper - sequence based analyses simply identify the *potential* to form pairs.

      Line 423: "Our analyses of these lncRNAs, DBSs, and target genes, including their evolution and interaction, indicate that HS lncRNAs have greatly promoted human evolution by distinctly rewiring gene expression." I do not agree that this conclusion is supported by the findings presented - this would require significant additional evidence in the form of orthogonal datasets.

      I also return briefly to some of my comments before, in particular on the confounding effects of gene length and transcript/isoform number. In their rebuttal the authors argued that there was no need to control for this, but this does in fact matter. A gene with 10 transcripts that differ in the 5' end has 10 times as many chances of having a DBS than a gene with only 1 transcript, or a gene with 10 transcripts but a single annotated TSS. When the analyses are then performed at the gene level, without taking into account the number of transcripts, this could introduce a bias towards genes with more annotated isoforms. Similarly, line 246 focuses on genes with "SNP numbers in CEU, CHB, YRI are 5 times larger than the average." Is this controlled for length of the DBS? All else being equal a longer DBS will have more SNPs than a shorter one. It is therefore not surprising that the same genes that were highlighted above as having 'strong' DBS, where strength is impacted by length, show up here too.

    2. Author Response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public Review):

      Summary

      While DNA sequence divergence, differential expression, and differential methylation analysis have been conducted between humans and the great apes to study changes that "make us human", the role of lncRNAs and their impact on the human genome and biology has not been fully explored. In this study, the authors computationally predict HSlncRNAs as well as their DNA Binding sites using a method they have developed previously and then examine these predicted regions with different types of enrichment analyses. Broadly, the analysis is straightforward and after identifying these regions/HSlncRNAs the authors examined their effects using different external datasets.

      I no longer have any concerns about the manuscript as the authors have addressed my comments in the first round of review.

      We thank the reviewer for the valuable comments, which have helped us improve the manuscript.

      Reviewer #2 (Public Review):

      Lin et al attempt to examine the role of lncRNAs in human evolution in this manuscript. They apply a suite of population genetics and functional genomics analyses that leverage existing data sets and public tools, some of which were previously built by the authors, who clearly have experience with lncRNA binding prediction. However, I worry that there is a lack of suitable methods and/or relevant controls at many points and that the interpretation is too quick to infer selection. While I don't doubt that lncRNAs contribute to the evolution of modern humans, and certainly agree that this is a question worth asking, I think this paper would benefit from a more rigorous approach to tackling it.

      I thank the authors for their revisions to the manuscript; however, I find that the bulk of my comments have not been addressed to my satisfaction. As such, I am afraid I cannot say much more than what I said last time, emphasising some of my concerns with regards to the robustness of some of the analyses presented. I appreciate the new data generated to address some questions, but think it could be better incorporated into the text - not in the discussion, but in the results.

      We thank the reviewer for the careful reading and valuable comments. In this round of revision, we address the two main concerns: (1) there is a lack of suitable methods and/or relevant controls at many points, and (2) the interpretation is too quick to infer selection. Based on these comments, we have carefully revised all sections of the manuscript, including the Introduction, Results, Discussion, and Materials and Methods.

      In addition, we have performed two new analyses. Based on the two analyses, we have added one figure and two sections to Results, two sections to Materials and Methods, one figure to Supplementary Notes, and two tables to Supplementary Tables. These results were obtained using new methods and provided more support to the main conclusion.

      To be more responsible, we re-look into the comments made in the first round and respond to them further. The following are point-to-point responses to comments.

      Since many of the details in the Responses-To-Comments are available in published papers and eLife publishes Responses-To-Comments, we do not greatly revise supplementary notes to avoid ostensibly repeating published materials.

      “lack of suitable methods and/or relevant controls”.

      We carefully chose the methods, thresholds, and controls in the study; now, we provide clearer descriptions and explanations.

      (1) We have expanded the last paragraph in Introduction to briefly introduce the methods, thresholds, and controls.

      (2) In many places in Results and Materials and Methods, revisions are made to describe and justify methods, thresholds, and controls.

      (3) Some methods, thresholds, and controls have good consensus, such as FDR and genome-wide background, but others may not, such as the number of genes that greatly differ between humans and chimpanzees. Now, we describe our reasons for the latter situation. For example, we explain that “About 5% of genes have significant sequence differences in humans and chimpanzees, but more show expression differences due to regulatory sequences. We sorted target genes by their DBS affinity and, to be prudential, chose the top 2000 genes (DBS length>252 bp and binding affinity>151) and bottom 2000 genes (DBS length<60 bp but binding affinity>36) to conduct over-representation analysis”.

      (4) We also carefully choose proper words to make descriptions more accurate.

      Responses to the suggestion “new data generated could be better incorporated into the text”.

      (1) We think that this sentence “The occurrence of HS lncRNAs and their DBSs may have three situations – (a) HS lncRNAs preceded their DBSs, (b) HS lncRNAs and their DBSs co-occurred, (c) HS lncRNAs succeeded their DBSs. Our results support the third situation and the rewiring hypothesis”, previously in Discussion, should be better in section 2.3. We have revised it and moved it into the second paragraph of section 2.3.

      (2) Our two new analyses generated new data, and we describe them in Results.

      (3) It is possible to move more materials from Supplementary Notes to the main text, but it is probably unnecessary because the main text currently has eight sub-sections, two tables, and four figures.

      Responses to the comment “the interpretation is too quick to infer selection”.

      (1) When using XP-CLR, iSAFE, Tajima's D, Fay-Wu's H, the fixation index (Fst), and linkage disequilibrium (LD) to detect selection signals, we used the widely adopted parameters and thresholds but did not mention this clearly in the original manuscript. Now, in the first sentence of the second paragraph of section 2.4, we add the phrase “with widely-used parameters and thresholds” (more details are available in section 4.7 and Supplementary Notes).

      (2) It is not the first time we used these tests. Actually, we used these tests in two other studies (Tang et al. Uncovering the extensive trade-off between adaptive evolution and disease susceptibility. Cell Rep. 2022; Tang et al. PopTradeOff: A database for exploring population-specificity of adaptive evolution, disease susceptibility, and drug responsiveness. Comput Struct Biotechnol J. 2023). In this manuscript, section 2.5 and section 4.12 describe how we use these tests to detect signals and infer selection. We also cite the above two published papers from which the reader can obtain more details.

      (3) Also, in section 2.4, we stress that “Signals in considerable DBSs were detected by multiple tests, indicating the reliability of the analysis”.

      To further respond to the comments of “lack of suitable methods” and “this paper would benefit from a more rigorous approach to tackling it”, we have performed two new analyses. The results of the new analyses agree well with previous results and provide new support for the main conclusion. The result of section 2.5 is novel and interesting.

      We write in Discussion “Two questions are how mouse-specific lncRNAs specifically rewire gene expression in mice and how human- and mouse-specific rewiring influences the cross-species transcriptional differences”. To investigate whether the rewiring of gene expression by HS lncRNA in humans is accidental in evolution, we have made further genomic and transcriptomic analyses (Lin et al. Intrinsically linked lineage-specificity of transposable elements and lncRNAs reshapes transcriptional regulation species- and tissue-specifically. doi: https://doi.org/10.1101/2024.03.04.583292). To verify the obtained conclusions, we analyzed the spermatogenesis data from multiple species and obtained supporting evidence (not published).

      I note some specific points that I think would benefit from more rigorous approaches, and suggest possible ways forward for these.

      Much of this work is focused on comparing DNA binding domains in human-unique long-noncoding RNAs and DNA binding sites across the promoters of genes in the human genome, and I think the authors can afford to be a bit more methodical/selective in their processing and filtering steps here. The article begins by searching for orthologues of human lncRNAs to arrive at a set of 66 human-specific lncRNAs, which are then characterised further through the rest of the manuscript. Line 99 describes a binding affinity metric used to separate strong DBS from weak DBS; the methods (line 432) describe this as being the product of the DBS or lncRNA length times the average Identity of the underlying TTSs. This multiplication, in fact, undoes the standardising value of averaging and introduces a clear relationship between the length of a region being tested and its overall score, which in turn is likely to bias all downstream inference, since a long lncRNA with poor average affinity can end up with a higher score than a short one with higher average affinity, and it's not quite clear to me what the biological interpretation of that should be. Why was this metric defined in this way?

      (1) Using RNA:DNA base-pairing rules, other DBS prediction programs return just DBSs with lengths. Using RNA:DNA base-pairing rules and a variant of Smith-Waterman local alignment, LongTarget returns DBSs with lengths and identity values together with DBDs (local alignment makes DBDs and DBSs predicted simultaneously). Thus, instead of measuring lncRNA/DNA binding based on DBS length, we measure lncRNA/DNA binding based on both DBS length and DBD/DBS identity (simply called identity, which is the percentage of paired nucleotides in the RNA and DNA sequences). This allows us to define “binding affinity”. One may think that binding affinity is a more complex function of length and identity. But, according to in vitro studies (see the review Abu Almakarem et al. 2012 and citations therein, and see He et al. 2015 and citations therein), the strength of a triplex is determined by all paired nucleotides (i.e., triplet). Thus, binding affinity=length * identity is biologically reasonable.

      (2) Further, different from predicting DBS upon individual base-pairing rules such as AT-G and CG-C, LongTarget integrates base-pairing rules into rulesets, each covering A, T, C, and G (see the two figures below, which are from He et al 2015). This makes every nucleotide in the RNA and DNA sequences comparable and allows the computation of identity.

      (3) On whether LongTarget may predict unreasonably long DBSs. Three technical features of LongTarget make this highly unlikely (and more unlikely than other programs). The three features are (a) local alignment, (b) gap penalty, and (c) TT penalty (He et al. 2015).

      (4) Some researchers may think that a higher identity threshold (e.g., 0.8 or even higher) makes the predicted DBSs more reliable. This is not true. To explore plausible identity values, we analyzed the distribution of Kcnq1ot1’s DBSs in the large Kcnq1 imprinting region (which contains many known imprinted genes). We found that a high threshold for identity (e.g., 0.8) will make DBSs in many known imprinted genes fail to be predicted. Upon our analysis of many lncRNAs and upon early in vitro experiments, plausible identity values range from 0.4 to 0.8.

      (5) Is it necessary or advisable to define an identity threshold? Since identity values from 0.4 to 0.8 are plausible and identity is a property of a DBS but does not reflect the strength of the whole triplex, it is more reasonable to define a threshold for binding affinity to control predicted DBSs. As explained above, binding affinity = length*identity is a reasonable measure of the strength of a triplex. The default threshold is 60, and given an identity of 0.6 in many triplexes, a DBS with affinity=60 is about 100 bp. Compared with TF binding sites (TFBS), 100 bp is quite long. As we explain in the main text, “taking a DBS of 147 bp as an example, it is extremely unlikely to be generated by chance (p < 8.2e-19 to 1.5e-48)”.

      (6) How to validate predicted DBSs? Validation faces these issues. (a) DBDs are predicted on the genome level, but target transcripts are expressed in different tissues and cells. So, no single transcriptomic dataset can validate all predicted DBSs of a lncRNA. No matter using what techniques and what cells, only a small portion of predicted DBSs can be experimentally captured (validated). (b) The resolution of current experimental techniques is limited; thus, experimentally identified DBSs (i.e., “peaks”) are much longer than computationally predicted DBSs. (c) Experimental results contain false positives and false negatives. So, validation (or performance evaluation) should also consider the ROC curves (Wen et al. 2022).

      (7) As explained above, a long DBS may have a lower binding affinity than a short DBS. A biological interpretation is that the long DBS may accumulate mutations that decrease its binding ability gradually.

      There is also a strong assumption that identified sites will always be bound (line 100), which I disagree is well-supported by additional evidence (lines 109-125). The authors show that predicted NEAT1 and MALAT1 DBS overlap experimentally validated sites for NEAT1, MALAT1, and MEG3, but this is not done systematically, or genome-wide, so it's hard to know if the examples shown are representative, or a best-case scenario.

      (1) We did not make this assumption. Apparently, binding depends on multiple factors, including co-expression of genes and specific cellular context.

      (2) On the second issue, “this is not done systematically, or genome-wide”. We did genome-wide but did not show all results (supplementary fig 2 shows three genomic regions, which are impressively good). In Wen et al. 2022, we describe the overall results.

      It's also not quite clear how overlapping promoters or TSS are treated - are these collapsed into a single instance when calculating genome-wide significance? If, eg, a gene has five isoforms, and these differ in the 3' UTR but their promoter region contains a DBS, is this counted five times, or one? Since the interaction between the lncRNA and the DBS happens at the DNA level, it seems like not correcting for this uneven distribution of transcripts is likely to skew results, especially when testing against genome-wide distributions, eg in the results presented in sections 5 and 6. I do not think that comparing genes and transcripts putatively bound by the 40 HS lncRNAs to a random draw of 10,000 lncRNA/gene pairs drawn from the remaining ~13500 lncRNAs that are not HS is a fair comparison. Rather, it would be better to do many draws of 40 non-HS lncRNAs and determine an empirical null distribution that way, if possible actively controlling for the overall number of transcripts (also see the following point).

      (1) We predicted DBSs in the promoter region of 179128 Ensembl-annotated transcripts and did not merge DBSs (there is no need to merge them). If multiple transcripts share the same TSS, they may share the same DBS, which is natural.

      (2) If the DBSs of multiple transcripts of a gene overlap, the overlap does not raise a problem for lncRNA/DNA binding analysis in specific tissues because usually only one transcript is expressed in a tissue. Therefore, there is no such situation “If, e.g., a gene has five isoforms, and these differ in the 3' UTR but their promoter region contains a DBS, is this counted five times, or one?”

      (3) It is unclear to us what “it seems like not correcting for this uneven distribution of transcripts is likely to skew results” means. Regarding testing against genome-wide distributions, statistically, it is beneficial to make many rounds of random draws genome-wide, but this will take a huge amount of time. Since more variables demand more rounds of drawing, to our knowledge, this is not widely practiced in large-scale transcriptomic data analyses.

      (4) If the difference (result) is small thus calls for rigorous statistical testing, making many rounds of random draws genome-wide is necessary. In our results, “45% of these pairs show a significant expression correlation in specific tissues (Spearman's |rho| >0.3 and FDR <0.05). In contrast, when randomly sampling 10000 pairs of lncRNAs and protein-coding transcripts genome-wide, the percent of pairs showing this level of expression correlation (Spearman's |rho| >0.3 and FDR <0.05) is only 2.3%”.

      Thresholds for statistical testing are not consistent, or always well justified. For instance, in line 142 GO testing is performed on the top 2000 genes (according to different rankings), but there's no description of the background regions used as controls anywhere, or of why 2000 genes were chosen as a good number to test? Why not 1000, or 500? Are the results overall robust to these (and other) thresholds? Then line 190 the threshold for downstream testing is now the top 20% of genes, etc. I am not opposed to different thresholds in principle, but they should be justified.

      (1) We used the g:Profiler program to perform over-representation analysis to identify enriched GO terms. This analysis is used to determine what pre-defined gene sets (GO terms) are more present (over-represented) in a list of “interesting” genes than what would be expected by chance. Specifically, this analysis is often used to examine whether the majority of genes in a pre-defined gene set fall in the extremes of a list: the top and bottom of the list, for example, may correspond to the largest differences in expression between the two cell types. g:Profiler always takes the whole genome as the reference; that is why we did not mention the whole genome reference. We now add in section 2.2 “(with the whole genome as the reference)”.

      (2) Why choosing 2000 but not 2500 genes is somewhat subjective. We now explain that “About 5% of genes have significant sequence differences in humans and chimpanzees, but more show expression differences due to regulatory sequences. We sorted target genes by their DBS affinity and, to be prudential, chose the top 2000 genes (DBS length>252 bp and binding affinity>151) and bottom 2000 genes (DBS length<60 bp but binding affinity>36) to conduct over-representation analysis”.

      Likewise, comparing Tajima's D values near promoters to genome-wide values is unfair, because promoters are known to be under strong evolutionary constraints relative to background regions; as such it is not surprising that the results of this comparison are significant. A fairer comparison would attempt to better match controls (eg to promoters without HS lncRNA DBS, which I realise may be nearly impossible), or generate empirical p-values via permutation or simulation.

      We used these tests to detect selection signals in DBSs but not in the whole promoter regions. Using promoters without HS lncRNA DBS as the control also has risks because promoter regions contain other kinds of regulatory sequences.

      There are huge differences in the comparisons between the Vindija and Altai Neanderthal genomes that to me suggest some sort of technical bias or the such is at play here. e.g. line 190 reports 1256 genes to have a high distance between the Altai Neanderthal and modern humans, but only 134 Vindija genes reach the same threshold of 0.034. The temporal separation between the two specimens does not seem sufficient to explain this difference, nor the difference between the Altai Denisovan and Neanderthal results (2514 genes for Denisovan), which makes me wonder if it is a technical artefact relating to the quality of the genome builds? It would be worth checking.

      We feel it is hard to know whether or not the temporal separation between these specimens is sufficient to explain the differences because many details of archaic humans and their genomes remain unknown and because mechanisms determining genotype-phenotype relationships remain poorly known. After 0.034 was determined, these numbers of genes were determined accordingly. We chose parameters and thresholds that best suit the most important requirements, but these parameters and thresholds may not best suit other requirements; this is a problem for all large-scale studies.     

      Inferring evolution: There are some points of the manuscript where the authors are quick to infer positive selection. I would caution that GTEx contains a lot of different brain tissues, thus finding a brain eQTL is a lot easier than finding a liver eQTL, just because there are more opportunities for it. Likewise, claims in the text and in Tables 1 and 2 about the evolutionary pressures underlying specific genes should be more carefully stated. The same is true when the authors observe high Fst between groups (line 515), which is only one possible cause of high Fst - population differentiation and drift are just as capable of giving rise to it, especially at small sample sizes.

      (1) We add in Discussion that “Finally, not all detected signals reliably indicate positive selection”.

      (2) Our results are that more signals are detected in CEU and CHB than in YRI; this agrees all population genetics studies and implies that our results are not wrongly biased because more samples and larger samples were obtained from CEU and CHB.

    1. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The manuscript of Odermatt et al. investigates the volatiles released by two species of Desmodium plants and the response of herbivores to maize plants alone or in combination with these species. The results show that Desmodium releases volatiles in both the laboratory and the field. Maize grown in the laboratory also released volatiles, in a similar range. While female moths preferred to oviposit on maize, the authors found no evidence that Desmodium volatiles played a role in lowering attraction to or oviposition on maize.

      Strengths:

      The manuscript is a response to recently published papers that presented conflicting results with respect to whether Desmodium releases volatiles constitutively or in response to biotic stress, the level at which such volatiles are released, and the behavioral effect it has on the fall armyworm. These questions are relevant as Desmodium is used in a textbook example of pest-suppressive sustainable intercropping technology called push-pull, which has supported tens of thousands of smallholder farmers in suppressing moth pests in maize. A large number of research papers over more than two decades have implied that Desmodium suppresses herbivores in push-pull intercropping through the release of large amounts of volatiles that repel herbivores. This premise has been questioned in recent papers. Odermatt et al. thus contribute to this discussion by testing the role of odors in oviposition choice. The paper confirms that ovipositing FAW preferred maize, and also confirmed that odors released from Desmodium appeared not important in their bioassays.

      The paper is a welcome addition to the literature and adds quality headspace analyses of Desmodium from the laboratory and the field. Furthermore, the authors, some of whom have since long contributed to developing push-pull, also find that Desmodium odors are not significant in their choice between maize plants. This advances our knowledge of the mechanisms through which push-pull suppresses herbivores, which is critically important to evolving the technique to fit different farming systems and translating this mechanism to fit with other crops and in other geographical areas.

      Thank you for your careful assessment of our manuscript.

      Weaknesses:

      Below I outline the major concerns:

      (1) Clear induction of the experimental plants, and lack of reflective discussion around this: from literature data and previous studies of maize and Desmodium, it is clear that the plants used in this study, particularly the Desmodium, were induced. Maize appeared to be primarily manually damaged, possibly due to sampling (release of GLV, but little to no terpenoids, which is indicative of mostly physical stress and damage, for example, one of the coauthor's own paper Tamiru et al. 2011), whereas Desmodium releases a blend of many compounds (many terpenoids indicative of herbivore induction). Erdei et al. also clearly show that under controlled conditions maize, silver leaf and green leaf Desmodium release volatiles in very low amounts. While the condition of the plants in Odermatt et al. may be reflective of situations in push-pull fields, the authors should elaborate on the above in the discussion (see comments) such that the readers understand that the plant's condition during the experiments. This is particularly important because it has been assumed that Desmodium releases typical herbivore-induced volatiles constitutively, which is not the case (see Erdei et al. 2024). This reflection is currently lacking in the manuscript.

      We acknowledge the need for a more reflective discussion on the possible causes of volatile emission due to physical damage. Although the field plants were carefully handled, it is possible that some physical stress may have contributed to the release of volatiles, such as green leaf volatiles (GLVs). We ensured the revised manuscript reflects this nuanced interpretation (lines 282 – 286). However, we also explained more clearly that our aim was to capture the volatile emission of plants used by farmers under realistic conditions and moth responses to these plants, not to be able to attribute the volatile emission to a specific cause (lines 115 – 117). We revised relevant passages throughout the results and discussion to ensure that we do not make any claims about the reason for volatile emissions, and that our claims regarding these plants and their headspace being representative of the system as practiced by farmers are supported. In the revised manuscript we provide a new supplementary table S2 that additionally shows the classification of the identified substances, which also shows that the majority of the substances that were found in the headspace of the sampled plants of Desmodium intortum or Desmodium incanum are monoterpenes, sesquiterpenes, or aromatic compounds, and not GLVs (that are typically emitted following damage).

      (2) Lack of controls that would have provided context to the data: The experiments lack important controls that would have helped in the interpretation:

      2a The authors did not control the conditions of the plants. To understand the release of volatiles and their importance in the field, the authors should have included controlled herbivory in both maize and Desmodium. This would have placed the current volatile profiles in a herbivory context. Now the volatile measurements hang in midair, leading to discussions that are not well anchored (and should be rephrased thoroughly, see eg lines 183-188). It is well known that maize releases only very low levels of volatiles without abiotic and biotic stressors. However, this changes upon stress (GLVs by direct, physical damage and eg terpenoids upon herbivory, see above). Erdei et al. confirm this pattern in Desmodium. Not having these controls, means that the authors need to put the data in the context of what has been published (see above).

      We appreciate this concern. Our study aimed to capture the real-world conditions of push-pull fields, where Desmodium and maize grow in natural environments without the direct induction of herbivory for experimental purposes (lines 115 – 117). We agree that in further studies it would be important to carry out experiments under different environmental conditions, including herbivore damage. However, this was not within the scope of the present study.

      2b It would also have been better if the authors had sampled maize from the field while sampling Desmodium. Together with the above point (inclusion of herbivore-induced maize and Desmodium), the levels of volatile release by Desmodium would have been placed into context.

      We acknowledge that sampling maize and other intercrop plants, such as edible legumes, alongside Desmodium in the push-pull field would have allowed us to make direct comparisons of the volatile profiles of different plants in the push-pull system under shared field conditions. Again, this should be done in future experiments but was beyond the scope of the present study. Due to the amount of samples we could handle given cost and workload, we chose to focus on Desmodium because there is much less literature on the volatile profiles of field-grown Desmodium than maize plants in the field: we are aware of one study attempting to measure field volatile profiles from Desmodium intortum (Erdei et al. 2024) and no study attempting this for Desmodium incanum. We pointed out this justification for our focus on Desmodium in the manuscript (lines 435 - 439). Additionally, we suggested in the discussion that future studies should measure volatile profiles from all plants commonly used in push-pull systems alongside Desmodium (lines 267 – 269).

      2c To put the volatiles release in the context of push-pull, it would have been important to sample other plants which are frequently used as intercrop by smallholder farmers, but which are not considered effective as push crops, particularly edible legumes. Sampling the headspace of these plants, both 'clean' and herbivore-induced, would have provided a context to the volatiles that Desmodium (induced) releases in the field - one would expect unsuccessful push crops to not release any of these 'bioactive' volatiles (although 'bioactive' should be avoided) if these odors are responsible for the pest suppressive effect of Desmodium. Many edible intercrops have been tested to increase the adoption of push-pull technology but with little success.

      We very much agree that such measurements are important for the longer-term research program in this field. But again, for the current study this would have exploded the size of the required experiment. Regarding bioactivity, we have been careful to use the phrase "potentially bioactive" solely when referring to findings from the literature (lines 99–103), in order to avoid making any definitive claims about our own results.

      Because of the lack of the above, the conclusions the authors can draw from their data are weakened. The data are still valuable in the current discussion around push-pull, provided that a proper context is given in the discussion along the points above.

      We think our revisions made the specific aims of this study more explicit and help to avoid misleading claims.

      (3) 'Tendency' of the authors to accept the odor hypothesis (i.e. that Desmodium odors are responsible for repelling FAW and thereby reduce infestation in maize under push-pull management) in spite of their own data: The authors tested the effects of odor in oviposition choice, both in a cage assay and in a 'wind tunnel'. From the cage experiments, it is clear that FAW preferred maize over Desmodium, confirming other reports (including Erdei et al. 2024). However, when choosing between two maize plants, one of which was placed next to Desmodium to which FAW has no tactile (taste, structure, etc), FAW chose equally. Similarly in their wind tunnel setup (this term should not be used to describe the assay, see below), no preference was found either between maize odor in the presence or absence of Desmodium. This too confirms results obtained by Erdei et al. (but add an important element to it by using Desmodium plants that had been induced and released volatiles, contrary to Erdei et al. 2024). Even though no support was found for repellency by Desmodium odors, the authors in many instances in the manuscript (lines 30-33, 164-169, 202, 279, 284, 304-307, 311-312, 320) appear to elevate non-significant tendencies as being important. This is misleading readers into thinking that these interactions were significant and in fact confirming this in the discussion. The authors should stay true to their own data obtained when testing the hypothesis of whether odors play a role in the pest-suppressive effect of push-pull.

      We appreciate this feedback and agree that we may have overstated claims that could not be supported by strict significance tests. However, we believe that non-significant tendencies can still provide valuable insights. In the revised version of the manuscript, we ensured a clear distinction between statistically significant findings and non-significant trends and remove any language that may imply stronger support for the odor hypothesis than what the data show in all the lines that were mentioned.

      (4) Oviposition bioassay: with so many assays in close proximity, it is hard to certify that the experiments are independent. Please discuss this in the appropriate place in the discussion.

      We have pointed this out in the submitted manuscript in lines 275 – 279. Furthermore, we included detailed captions to figure 4 - supporting figure 3 & figure 4 - supporting figure 4. We are aware that in all such experiments there is a danger of between-treatment interference, which we pointed out for our specific case. We stated that with our experimental setup we tried to minimize interference between treatments by spacing and temporal staggering. We would like to point out that this common caveat does not invalidate experimental designs when practicing replication and randomization. We assume that insects are able to select suitable oviposition sites in the background of such confounding factors under realistic conditions.

      (5) The wind tunnel has a number of issues (besides being poorly detailed):

      5a. The setup which the authors refer to as a 'wind tunnel' does not qualify as a wind tunnel. First, there is no directional flow: there are two flows entering the setup at opposite sides. Second, the flow is way too low for moths to orient in (in a wind tunnel wind should be presented as a directional cue. Only around 1.5 l/min enters the wind tunnel in a volume of 90 l approximately, which does not create any directional flow. Solution: change 'wind tunnel' throughout the text to a dual choice setup /assay.)

      We agree with these criticisms and changed the terminology accordingly from ‘wind tunnel’ to ‘dual choice assay’. We have now conducted an additional experiment which we called ‘no-choice assay’ that provides conditions closer to a true wind tunnel. The setup of the added experiment features an odor entry point at only one side of the chamber to create a more directional airflow. Each treatment (maize alone, maize + D. intortum, maize + D. incanum, and a control with no plants) was tested separately, with only one treatment conducted per evening to avoid cross-contamination, as described in the methods section of the no-choice assay.

      5b. There is no control over the flows in the flight section of the setup. It is very well possible that moths at the release point may only sense one of the 'options'. Please discuss this.

      We added this to the discussion (lines 369 – 374). The new no-choice assays also address this concern by using a setup with laminar flow.

      5c. Too low a flow (1,5 l per minute) implies a largely stagnant air, which means cross-contamination between experiments. An experiment takes 5 minutes, but it takes minimally 1.5 hours at these flows to replace the flight chamber air (but in reality much longer as the fresh air does not replace the old air, but mixes with it). The setup does not seem to be equipped with e.g. fans to quickly vent the air out of the setup. See comments in the text. Please discuss the limitations of the experimental setup at the appropriate place in the discussion.

      We added these limitations to the discussion and addressed these concerns with new experiments (see answer 5a).

      5d. The stimulus air enters through a tube (what type of tube, diameter, length, etc) containing pressurized air (how was the air obtained into bags (type of bag, how is it sealed?), and the efflux directly into the flight chamber (how, nozzle?). However, it seems that there is no control of the efflux. How was leakage prevented, particularly how the bags were airtight sealed around the plants? 

      We added the missing information to the methods and provided details about types of bags, manufacturers, and pre-treatments in the method section. In short, PTFE tubes connected bagged plants to the bioassay setup and air was pumped in at an overpressure, so leakage was not eliminated but contamination from ambient air was avoided.

      5e. The plants were bagged in very narrowly fitting bags. The maize plants look bent and damaged, which probably explains the GLVs found in the samples. The Desmodium in the picture (Figure 5 supplement), which we should assume is at least a representative picture?) appears to be rather crammed into the bag with maize and looks in rather poor condition to start with (perhaps also indicating why they release these volatiles?). It would be good to describe the sampling of the plants in detail and explain that the way they were handled may have caused the release of GLVs.

      We included a more detailed description of the plant handling and bagging processes to the methods to clarify how the plants were treated during the dual-choice and the no-choice assays reported in the revised manuscript. We politely disagree that the maize plants were damaged and the Desmodium plants not representative of those encountered in the field. The plants were grown in insect-proof screen houses to prevent damage by insects and carefully curved without damaging them to fit into the bag. The Desmodium plant pictured was D. incanum, which has sparser foliage and smaller leaves than D. intortum.

      (6) Figure 1 seems redundant as a main figure in the text. Much of the information is not pertinent to the paper. It can be used in a review on the topic. Or perhaps if the authors strongly wish to keep it, it could be placed in the supplemental material.

      We think that Figure 1 provides essential information about the push-pull system and the FAW. To our knowledge, this partly contradictory evidence so far has not been synthesized in the literature. We realize that such a figure would more commonly be provided in a review article, but we do not think that the small number of studies on this topic so far justify a stand-alone review. Instead, the introduction to our manuscript includes a brief review of these few studies, complemented by the visual summary provided in Figure 1 and a detailed supplementary table.

      Reviewer #2 (Public review):

      Based on the controversy of whether the Desmodium intercrop emits bioactive volatiles that repel the fall armyworm, the authors conducted this study to assess the effects of the volatiles from Desmodium plants in the push-pull system on behavior of FAW oviposition. This topic is interesting and the results are valuable for understanding the push-pull system for the management of FAW, the serious pest. The methodology used in this study is valid, leading to reliable results and conclusions. I just have a few concerns and suggestions for improvement of this paper:

      (1) The volatiles emitted from D. incanum were analyzed and their effects on the oviposition behavior of FAW moth were confirmed. However, it would be better and useful to identify the specific compounds that are crucial for the success of the push-pull system.

      We fully agree that identifying specific volatile compounds responsible for the push-pull effect would provide valuable insights into the underlying mechanisms of the system. However, the primary focus of this study was to address the still unresolved question whether Desmodium emits detectable or “significant” amounts of volatiles at all under field conditions, and the secondary aim was to test whether we could demonstrate a behavioral effect of Desmodium headspace on FAW moths. Before conducting our experiments, we carefully considered the option of using single volatile compounds and synthetic blends in bioassays. We decided against this because we judged that the contradictory evidence in the literature was not a sufficient basis for composing representative blends. Furthermore, we think it is an important first step to test f. or behavioral responses to the headspaces of real plants. We consider bioassays with pure compounds to be important for confirmation and more detailed investigation in future studies. There was also contradictory evidence in the literature regarding moth responses to plants. We thus opted to focus on experiments with whole plants to maintain ecological relevance.

      (2) That would be good to add "symbols" of significance in Figure 4 (D).

      We report the statistical significance of the parameters in Figure 4 (D) in Table 3, which shows the mixed model applied for oviposition bioassays. While testing significance between groups is a standard approach, we used a more robust model-based analysis to assess the effects of multiple factors simultaneously. We provided a cross-reference to Table 3 from the figure description of Figure 4 (D) for readers to easily find the statistical details.

      (3) Figure A is difficult for readers to understand.

      Unfortunately, it is not entirely clear which specific figure is being referred to as "Figure A" in this comment. We tried to keep our figures as clear as possible.

      (4) It will be good to deeply discuss the functions of important volatile compounds identified here with comparison with results in previous studies in the discussion better.

      Our study does not provide strong evidence that specific volatiles from Desmodium plants are important determinants of FAW oviposition or choice in the push-pull system. Therefore, we prefer to refrain from detailed discussions of the potential importance of individual compounds. However, in the revised version, we provide an additional table S2 which identifies the overlap with volatiles previously reported from Desmodium, as only the total numbers are summarized in the discussion of the submitted paper.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The points raised are largely self-explanatory as to what needs to be done to fully resolve them. At a minimum the text needs to be seriously revised to:

      (1) reflect the data obtained.

      (2) reflect on the limitations of their experimental setup and data obtained.

      (3) put the data obtained and its limitations in what these tell us and particularly what not. Ideally, additional headspace measurements are taken, including from herbivory and 'clean' maize and Desmodium (in which there is better control of biotic and abiotic stress), as well as other crops commonly planted as companion crops with maize (but none of them reducing pest pressure).

      Thank you for this summary. Please see our detailed responses above.

      In addition to the main points of critique provided above, I have provided additional comments in the text (https://elife-rp.msubmit.net/elife-rp_files/2024/07/18/00134767/00/134767_0_attach_28_25795_convrt.pdf). These elaborate on the above points and include some new ones too. These are the major points of critique, which I hope the authors can address.

      Thank you very much for these detailed comments.

      Reviewer #2 (Recommendations for the authors):

      It is important to note that the original push-pull system was developed against stemborers and involved Napier grass (still used) around the field, which attracts stemborer moths, and Molasses grass as the intercrop that repels the moths and attracts parasitoids. Later, Molasses grass was replaced by desmodiums because it is a legume that fixes nitrogen and therefore can increase nitrate levels in the soil, but most importantly because it prevents germination of the parasitic Striga weed. The possible repellent effect of desmodium on pests and attraction of natural enemies was never properly tested but assumed, probably to still be able to use the push-pull terminology. This "mistake" should be recognized here and in future publications. It is a real pity that the controversy over the repellent effect of desmodium distracts from the amazing success of the push-pull system, also against the fall armyworm.

      We thank the reviewer for pointing out these issues, which are part of the reason for our Figure 1 and why we would like to keep it. We have described this development of the system in the introduction to better present the push-pull system. Our aim in Figure 1 and Table S1 is to highlight both the evidence of the system's success, and the gaps in our understanding, regarding specifically control of damage from the FAW.

    1. Reviewer #1 (Public review):

      Summary:

      This paper addresses an important and topical issue: how temporal context, at various time scales, affects various psychophysical measures, including reaction times, accuracy, and localization. It offers interesting insights, with separate mechanisms for different phenomena, which are well discussed.

      Strengths:

      The paradigm used is original and effective. The analyses are rigorous.

      Weaknesses:

      Here I make some suggestions for the authors to consider. Most are stylistic, but the issue of precision may be important.

      (1) The manuscript is quite dense, with some concepts that may prove difficult for the non-specialist. I recommend spending a few more words (and maybe some pictures) describing the difference between task-relevant and task-irrelevant planes. Nice technique, but not instantly obvious. Then we are hit with "stimulus-related", which definitely needs some words (also because it is orthogonal to neither of the above).

      (2) While I understand that the authors want the three classical separations, I actually found it misleading. Firstly, for a perceptual scientist to call intervals in the order of seconds (rather than milliseconds), "micro" is technically coming from the raw prawn. Secondly, the divisions are not actually time, but events: micro means one-back paradigm, one event previously, rather than defined by duration. Thirdly, meso isn't really a category, just a few micros stacked up (and there's not much data on this). And macro is basically patterns, or statistical regularities, rather than being a fixed time. I think it would be better either to talk about short-term and long-term, which do not have the connotations I mentioned. Or simply talk about "serial dependence" and "statistical regularities". Or both.

      (3) More serious is the issue of precision. Again, this is partially a language problem. When people use the engineering terms "precision" and "accuracy" together, they usually use the same units, such as degrees. Accuracy refers to the distance from the real position (so average accuracy gives bias), and precision is the clustering around the average bias, usually measured as standard deviation. Yet here accuracy is percent correct: also a convention in psychology, but not when contrasting accuracy with precision, in the engineering sense. I suggest you change "accuracy" to "percent correct". On the other hand, I have no idea how precision was defined. All I could find was: "mixture modelling was used to estimate the precision and guess rate of reproduction responses, based on the concentration (k) and height of von Mises and uniform distributions, respectively". I do not know what that means.

      (4) Previous studies show serial dependence can increase bias but decrease scatter (inverse precision) around the biased estimate. The current study claims to be at odds with that. But are the two measures of precision relatable? Was the real (random) position of the target subtracted from each response, leaving residuals from which the inverse precision was calculated? (If so, the authors should say so..) But if serial dependence biases responses in essentially random directions (depending on the previous position), it will increase the average scatter, decreasing the apparent precision.

      (5) I suspect they are not actually measuring precision, but location accuracy. So the authors could use "percent correct" and "localization accuracy". Or be very clear what they are actually doing.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1:

      (1) Developmental time series:

      It was not entirely clear how this experiment relates to the rest of the manuscript, as it does not compare any effects of transport within or across species.

      Implemented Changes:  

      The importance of species arrival timing for community assembly is addressed in both the introduction and discussion. To accommodate the reviewer’s concerns and further emphasize this point, we have added a clarifying sentence to the results section and included an illustrative example with supporting literature in the discussion.

      Results: Clarifying the timing of initial microbial colonization is essential for determining whether and how priority effects mediate community assembly of vertically transmitted microbes in early life, or whether these microbes arrive into an already established microbial landscape. We used non-sterile frogs of our captive laboratory colony (…)

      Discussion: For example, early microbial inoculation has been shown to increase the relative abundance of beneficial taxa such as Janthinobacterium lividum (Jones et al., 2024), whereas efforts to introduce the same probiotic into established adult communities have not led to long-term persistence (Bletz, 2013; Woodhams et al., 2016).  

      (2) Cross-foster experiment:

      The "heterospecific transport" tadpoles were manually brushed onto the back of the surrogate frog, while the "biological transport" tadpoles were picked up naturally by the parent. It is a little challenging to interpret the effect of caregiver species since it is conflated with the method of attachment to the parent. I noticed that the uptake of Os-associated microbes by Os-transported tadpoles seemed to be higher than the uptake of Rv-associated microbes by Rv-associated tadpoles (comparing the second box from the left to the rightmost boxplot in panel S2C). Perhaps this could be a technical artifact if manual attachment to Os frogs was more efficient than natural attachment to Rv frogs.

      I was also surprised to see so much of the tadpole microbiome attributed to Os in tadpoles that were not transported by Os frogs (25-50% in many cases). It suggests that SourceTracker may not be effectively classifying the taxa.

      Implemented Changes:  

      Methods (Study species, reproductive strategies and life history): Oophaga sylvatica (Os) (Funkhouser, 1956; CITES Appendix II, IUCN Conservation status: Near Threatened) is a large, diurnal poison frog (family Dendrobatidae) inhabiting lowland and submontane rainforests in Colombia and Ecuador. While male Os care for the clutch of up to seven eggs, females transport 1-2 tadpoles at a time to water-filled leaf axils where tadpoles complete their development (Pašukonis et al., 2022; Silverstone, 1973; Summers, 1992). Notably, females return regularly to these deposition sites to provision their offspring with unfertilized eggs.

      Discussion: Most poison frogs transport tadpoles on their backs, but the mechanism of adherence remains unclear. Similar to natural conditions, tadpoles that are experimentally placed onto a caregiver’s back also gradually adhere to the dorsal skin, where they remain firmly attached for several hours as the adult navigates dense terrain. Although transport durations were standardized, species-specific factors- such as microbial density at the contact site, microbial taxa identity, and skin physiology such as moisture -could influence microbial transmission between the transporting frog and the tadpole. While these differences may have contributed to varying transmission efficacies observed between the two frog species in our experiment, none of these factors should compromise the correct microbial source assignment. We thus conclude that transporting frogs serve as a source of microbiota for transported tadpoles. However, further studies on species-specific physiological traits and adherence mechanisms are needed to clarify what modulates the efficacy of microbial transmission during transport, both under experimental and natural conditions.  

      Methods (Vertical transmission): Cross-fostering tadpoles onto non-parental frogs has been used previously to study navigation in poison frogs (Pašukonis et al., 2017). According to our experience, successful adherence to both parent and heterospecific frogs depends on the developmental readiness of tadpoles, which must have retracted their gills and be capable of hatching from the vitelline envelope through vigorous movement. Another factor influencing cross-fostering success is the docility of the frog during initial attachment, as erratic movements easily dislodge tadpoles before adherence is established. Rv are small, jumpy frogs that are easily stressed by handling, making experimental fostering of tadpoles—even their own— impractical. Therefore, we favored an experimental design where tadpoles initiate natural transport and parental frogs pick them up with a 100% success rate. We chose the poison frog Os as foster frogs because adults are docile, parental care in this species involves transporting tadpoles, and skin microbial communities differ from Rv- a critical prerequisite for our SourceTracker analysis. The use of the docile Os as the foster species enabled a 100% cross-fostering success rate, with no notable differences in adherence strength after six hours.

      Methods (Sourcetracker Analysis): To assess training quality, we evaluated model selfassignment using source samples. We selected the model trained on a dataset rarefied to the read depth of the adult frog sample with the lowest read count (48162 reads), as it showed the best overall self-assignment performance, whereas models trained on datasets rarefied to the lowest overall read depth performed worse. Unlike studies using technical replicates, our source samples represent distinct biological individuals and sampling timepoints, where natural microbiome variability is expected within each source category. Consequently, we considered self-assignment rates above 70% acceptable. All source samples were correctly assigned to their respective categories (Rv, Os, or control), but with varying proportions of reads assigned as 'Unknown'. Adult frog sources were reliably selfidentified with high confidence (Os: 97.2% median, IQR = 1.4; Rv: 76.3% median, IQR = 38.1). Adult R. variabilis frogs displayed a higher proportion of 'Unknown' assignments compared to O. sylvatica, likely reflecting greater biological variability among individuals and/or a higher proportion of rare taxa not well captured in the training set. The control tadpole source showed lower self-assignment accuracy (median = 30.5%, IQR = 17.1), as expected given the low microbial biomass of these samples, which resulted in low read depth. Low readdepth limits the information available to inform the iterative updating steps in Gibbs sampling and reduces confidence in source assignments. We therefore verified the robustness of our results by performing the second Sourcetracker analysis as described above, training the model only on adult sources and assigning all tadpoles, including lowbiomass controls, as sinks (as described above). Self-assignment rates for the second training set varied (O. sylvatica: 79.2% median, IQR = 29; R. variabilis: 96.6% median, IQR = 3.7), while results remained consistent across analyses, supporting the reliability of our findings.

      (3) Cross-species analysis:

      Like the developmental time series, this analysis doesn't really address the central question of the manuscript. I don't think it is fair for the authors to attribute the difference in diversity to parental care behavior, since the comparison only includes n=2 transporting species and n=1 non-transporting species that differ in many other ways. I would also add that increased diversity is not necessarily an expectation of vertical transmission. The similarity between adults and tadpoles is likely a more relevant outcome for vertical transmission, but the authors did not find any evidence that tadpole-adult similarity was any higher in species with tadpole transport. In fact, tadpoles and adults were more similar in the non-transporting species than in one of the transporting species (lines 296-298), which seems to directly contradict the authors' hypothesis. I don't see this result explained or addressed in the Discussion.

      To address the reviewer’s concerns, we implemented the following changes:  

      Results:

      We rephrased the following sentence from the results part:  

      “These variations may therefore be linked to differing reproductive traits: Af and Rv lay terrestrial egg clutches and transport hatchlings to water, whereas Ll, a non-transporting species, lays eggs directly in water.”

      To read

      “These variations may therefore reflect differences in life history traits among the three species.”

      We moved the information on differing reproductive strategies into the Discussion, where it contributes to a broader context alongside other life history traits that may influence community diversity.

      Discussion (1): We added to our discussion that increased microbial diversity was not an expected outcome of vertical transmission.

      “However, increased microbial diversity is not a known outcome of vertical transmission, and further studies across a broader range of transporting and non-transporting species are needed to assess the role of transport in shaping diversity of tadpole-associated microbial communities.”

      Discussion (2): Likewise, communities associated with adults and tadpoles of transporting species were no more similar than those of non-transporting species. While poison frog tadpoles do acquire caregiver-specific microbes during transport, most of these microbes do not persist on the tadpoles' skin long-term. This pattern can likely be attributed to the capacity of tadpole skin- and gut microbiota to flexibly adapt to environmental changes (Emerson & Woodley, 2024; Santos et al., 2023; Scarberry et al., 2024). It may also reflect the limited compatibility of skin microbiota from terrestrial adults with aquatic habitats or tadpole skin, which differs structurally from that of adults (Faszewski et al., 2008). As a result, many transmitted microbes are probably outcompeted by microbial taxa continuously supplied by the aquatic environment. Interestingly, microbial communities of the non-transporting Ll were more similar to their adult counterparts than those of poison frogs. This pattern might reflect differences in life history among the species. While adult Ll commonly inhabit the rock pools where their tadpoles develop, adults of the two poison frog species visit tadpole nurseries only sporadically for deposition. These differences in habitat use may result in adult Ll hosting skin microbiota that are better adapted to aquatic environments as compared to Rv and Af. Additionally, their presence in the tadpoles’ habitat could make Ll a more consistent source of microbiota for developing tadpoles.

      (4) Field experiment: The rationale and interpretation of the genus-level network are not clear, and the figure is not legible. What does it mean to "visualize the microbial interconnectedness" or to be a "central part of the community"? The previous sentences in this paragraph (lines 337-343) seem to imply that transfer is parent-specific, but the genuslevel network is based on the current adult frogs, not the previous generation of parents that transported them. So it is not clear that the distribution or co-distribution of these taxa provides any insight into vertical transmission dynamics.

      Implemented Changes:  

      We appreciate the reviewer’s close reading and understand how the inclusion of the network visualization without further clarification may have led to confusion. To clarify, the network was constructed from all adult frogs in the population, including—but not limited to—the parental frogs examined in the field experiment. We do not make any claims about the origin of the microbial taxa found on parental frogs. Rather, our aim was to illustrate how genera retained on tadpoles (following potential vertical transmission) contribute to the skin microbial communities of adult frogs of this population beyond just the parental individuals. This finding supports the observation that these retained taxa are generally among the most abundant in adult frogs. However, since this information is already presented in Table S8 and the figure is not essential to the main conclusions, we have removed Supplementary Figure S5 and the accompanying sentence: “A genus-level network constructed from 44 adult frogs shows that the retained genera make up a central part of the community of adult Rv in wild populations (Fig. S5).” We have adjusted the Methods section accordingly.

      Reviewer #2:

      I did not find any major weaknesses in my review of this paper. The work here could potentially benefit from absolute abundance levels for shared ASVs between adults and tadpoles to more thoroughly understand the influences of vertical transmission that might be masked by relative abundance counts. This would only be a minor improvement as I think the conclusions from this work would likely remain the same, however.

      In response to the reviewer’s suggestion, we estimated the absolute abundance of specific ASVs for all samples of tadpoles in which Sourcetracker identified shared ASVs between adults and tadpoles. The resulting scaled absolute abundance values (in copies/μL and copies per tadpole) are provided in Table S10, and a description of the method has been incorporated into the revised Methods section of the manuscript. To support the robustness of this approach in our dataset, we additionally designed an ASV-specific system for ASV24902-Methylocella. Candidate primers were assessed for specificity by performing local BLASTn alignments against the full set of ASV sequences identified in the respective microbial communities of tadpoles. We optimized the annealing temperature via gradient PCR and confirmed primer specificity through Sanger sequencing of the PCR product (Forward: 5′–GAGCACGTAGGCGGATCT–3′ Reverse: 5′–GGACTACNVGGGTWTCTAAT–3′). Using this approach, we confirmed that the relative abundance of ASV24902 (18.05% in the amplicon sequencing data) closely matched its proportion of the absolute 16S rRNA copy number in transported tadpole 6 (18.01%). While we intended to quantify all shared ASVs, we were limited to this single target due to insufficient material for optimizing the assays. As this particular ASV was also detected in the water associated with the same tadpole, we chose not to include this confirmation in the manuscript. Nevertheless, the close match supports the reliability of our approach for scaling absolute abundances in this dataset.

      Results: Absolute abundances of shared ASVs likely originating from the parental source pool (as identified by Sourcetracker) after one month of growth ranged from 7804 to 172326 copies per tadpole (Table S10).

      Methods: Quantitative analysis of 16S rRNA copy numbers with digital PCR (dPCR)

      Absolute abundances were estimated for ASVs that were shared between tadpoles after a one-month growth period and their respective caregivers, and for which Sourcetracker analysis identified the caregiver as a likely source of microbiota. We followed the quantitative sequencing framework described by Barlow et al. (2020), measuring total microbial load via digital PCR (dPCR) with the same universal 16S rRNA primers used to amplify the v4 region in our sequencing dataset. Absolute 16S rRNA copy numbers obtained from dPCR were then multiplied by the relative abundances from our amplicon sequencing dataset to calculate ASV-specific scaled absolute abundances. All dPCR reactions were carried out on a QIAcuity Digital PCR System (Qiagen) using Nanoplates with a 8.5K partition configuration, using the following cycling program: 95°C for 2 minutes, 40 cycles of 95°C for 30 seconds and 52°C for 30 seconds and 72°C for 1 minute, followed by 1 cycle of 40°C for 5 minutes. Reactions were prepared using the QIAcuity EvaGreen PCR Kit (Qiagen, Cat. No. 250111) with 2 µL of DNA template per reaction, following the manufacturer's protocol, and included a negative no-template control and a cleaned and sequenced PCR product as positive control. Samples were measured in triplicates and serial dilutions were performed to ensure accurate quantification. Data were processed with the QIAcuity Software Suite (v3.1.0.0). The threshold was set based on the negative and positive controls in 1D scatterplots. We report mean copy numbers per microliter with standard deviations, correcting for template input, dPCR reaction volume, and dilution factor. Mean copy numbers per tadpole were additionally calculated by accounting for the DNA extraction (elution) volume.  

      Recommendations for the authors:

      Reviewer #1:

      (1) Figure 1b summarizes the ddPCR data as a binary (detected/not detected), but this contradicts the main text associated with this figure, which describes bacteria as present, albeit in low abundances, in unhatched embryos (lines 145-147). Could the authors keep the diagram of tadpole development, which I find very useful, but add the ddPCR data from Figure S1c instead of simply binarizing it as present/absent?

      We appreciate the reviewer’s positive feedback on the clarity of the figure. We agree that presenting the ddPCR data in a more quantitative manner provides a more accurate representation of bacterial abundance across developmental stages. In response, we have retained the developmental diagram, as suggested, and replaced the binary (detected/not detected) information in Figure 1B with rounded mean values for each stage. To complement this, we have included mean values and standard deviations in Table S1. The corresponding text in the main manuscript and legends has been revised accordingly to reflect these changes.  

      (2) More information about the foster species, Oophaga sylvatica, would be helpful. Are they sympatric with Rv? Is their transporting behavior similar to that of Rv?

      We thank the reviewer for this helpful comment. In response, we have added further details on the biology and parental care behavior of Oophaga sylvatica, including information on its distribution range. The species does not overlap with Ranitomeya variabilis at the specific study site where the field work was conducted, although the species are sympatric in other countries. These additions have been incorporated into the Methods section under "Study species, reproductive strategies, and life history."  

      (3) Plotting the proportion of each tadpole microbiome attributed to R. variabilis and the proportion attributed to O. sylvatica on the same plot is confusing, as these points are nonindependent and there is no way for the reader to figure out which points originated from the same tadpole. I would suggest replacing Figure 1D with Figure S2C, which (if I understand correctly) displays the same data, but is separated according to source.

      We agree with the reviewer that Figure S2C allows for clearer interpretation of our results. In response, we implemented the suggested change and replaced Figure 1D with the alternative visualization previously shown in Figure S2C, which displays the same data separated by source. To provide readers with a complementary overview of the full dataset, we have retained the original combined plot in the supplementary material as Figure S2D.

      (4) On the first read, I found the use of "transport" in the cross-fostering experiment confusing until I understood that they weren't being transported "to" anywhere in particular, just carried for 6 hours. A change of phrasing might help readers here.

      We acknowledge the reviewer’s concern and have replaced “transported” with “carried” to avoid confusion for readers who may be unfamiliar with the behavioral terminology. However, because “transport” is the term widely used by specialists to describe this behavior, we now introduce it in the context of the experimental design with the following phrasing:

      “For this design, sequence-based surveys of amplified 16S rRNA genes were used to assess the composition of skin-associated microbial communities on tadpoles and their adult caregivers (i.e., the frogs carrying the tadpoles, typically referred to as ‘transporting’ frogs).”

      (5) "Horizontal transfer" typically refers to bacteria acquired from other hosts, not environmental source pools (line 394).

      We addressed this concern by rephrasing the sentence in the Discussion to avoid potential confusion. The revised text now reads:

      “Across species, newborns might acquire bacteria not only through transfer from environmental source pools and other hosts (…)”  

      (6) The authors suggest that tadpole transport may have evolved in Rv and Af to promote microbial diversity because "increased microbial diversity is linked to better health outcomes" (lines 477-479). It is often tempting to assume that more diversity is always better/more adaptive, but this is not universally true. The fact that the Ll frogs seem to be doing fine in the same environment despite their lower microbiome diversity suggests that this interpretation might be too far of a reach based on the data here.

      We appreciate the reviewer’s concern, agree that increased microbial diversity is not inherently advantageous and have revised the paragraph to make this clearer.  

      “While increased microbial diversity is not inherently advantageous, it has been associated with beneficial outcomes such as improved immune function, lower disease risk, and enhanced fitness in multiple other vertebrate systems.”

      However, rather than claiming that greater diversity is always advantageous, we suggest that this possibility should not be excluded and consider it a relevant aspect of a comprehensive discussion. We also note that whether poison frog tadpoles perform equally well with lower microbial diversity remains an open question. Drawing such conclusions would require experimental validation and cannot be inferred from comparisons with an evolutionarily distant species that differs in life history.

      Reviewer #2:

      (1) Figure 2: Are the data points in C a subset (just the tadpoles for each species) of B? The numbers look a little different between them. The number of observed ASVs in panel B for Rv look a bit higher than the observed ASVs in panel C.

      The data shown in panel C are indeed a subset of the samples presented in panel B, focusing specifically on tadpoles of each species. The slight differences in the number of observed ASVs between panels result from differences in rarefaction depth between comparisons: due to variation in sequencing depth across species and life stages, we performed rarefaction separately for each comparison in order to retain the highest number of taxa while ensuring comparability within each group. Although we acknowledge that this is not a standard approach, we found that results were consistent when rarefying across the full dataset, but chose the presented approach to better accommodate variation in our sample structure. This methodological detail is described in the Methods section:

      “All alpha diversity analyses were conducted with datasets rarefied to 90% of the read number of the sample with the fewest reads in each comparison and visualized with boxplots.”

      It is also noted in the figure legend: “The dataset was separately rarefied to the lowest read depth f each comparison.” We hope this clarification adequately addresses the reviewer’s concern and therefore have not made additional changes.

      (2) Lines 304-305: in the Figure 4B plot, there appear to be 12 transported tadpoles and 8 non-transported tadpoles.

      Thank you for catching this. We have corrected the plot and the associated statistics (alpha and beta diversity) in the results section as well as in the figure. Importantly, the correction did not affect any other results, and the overall findings and interpretations remain unchanged.  

      (3) Line 311: I think this should be Figure 4B.

      (4) Line 430: tadpole transport.

      (5) Line 431: I believe commas need to surround this phrase "which range from a few hours to several days depending on the species (Lötters et al., 2007; McDiarmid & Altig, 1999; Pašukonis et al., 2019)".

      We thank the reviewer for the thorough review and have corrected all typographical and formatting errors noted in comments (3) – (5).

    1. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations for the authors): 

      One minor question would be whether the authors could expand more on the application of END-Seq to examine the processive steps of the ALT mechanism? Can they speculate if the ssDNA detected in ALT cells might be an intermediate generated during BIR (i.e., is the ssDNA displaced strand during BIR) or a lesion? Furthermore, have the authors assessed whether ssDNA lesions are due to the loss of ATRX or DAXX, either of which can be mutated in the ALT setting?

      We appreciate the reviewer’s insightful questions regarding the application of our assays to investigate the nature of the ssDNA detected in ALT telomeres. Our primary aim in this study was to establish the utility of END-seq and S1-END-seq in telomere biology and to demonstrate their applicability across both ALT-positive and -negative contexts. We agree that exploring the mechanistic origins of ssDNA would be highly informative, and we anticipate that END-seq–based approaches will be well suited for such future studies. However, it remains unclear whether the resolution of S1-END-seq is sufficient to capture transient intermediates such as those generated during BIR. We have now included a brief speculative statement in the revised discussion addressing the potential nature of ssDNA at telomeres in ALT cells.

      Reviewer #2 (Recommendations for the authors):

      How can we be sure that all telomeres are equally represented? The authors seem to assume that END-seq captures all chromosome ends equally, but can we be certain of this? While I do not see an obvious way to resolve this experimentally, I recommend discussing this potential bias more extensively in the manuscript.

      We thank the reviewer for raising this important point. END-seq and S1-END-seq are unbiased methods designed to capture either double-stranded or single-stranded DNA that can be converted into blunt-ended double-stranded DNA and ligated to a capture oligo. As such, if a subset of telomeres cannot be processed using this approach, it is possible that these telomeres may be underrepresented or lost. However, to our knowledge, there are no proposed telomeric structures that would prevent capture using this method. For example, even if a subset of telomeres possesses a 5′ overhang, it would still be captured by END-seq. Indeed, we observed the consistent presence of the 5′-ATC motif across multiple cell lines and species (human, mouse, and dog). More importantly, we detected predictable and significant changes in sequence composition when telomere ends were experimentally altered, either in vivo (via POT1 depletion) or in vitro (via T7 exonuclease treatment). Together, these findings support the robustness of the method in capturing a representative and dynamic view of telomeres across different systems.

      That said, we have now included a brief statement in the revised discussion acknowledging that we cannot fully exclude the possibility that a subset of telomeres may be missed due to unusual or uncharacterized structures

      I believe Figures 1 and 2 should be merged.

      We appreciate the reviewer’s suggestion to merge Figures 1 and 2. However, we feel that keeping them as separate figures better preserves the logical flow of the manuscript and allows the validation of END-seq and its application to be presented with appropriate clarity and focus. We hope the reviewer agrees that this layout enhances the clarity and interpretability of the data.

      Scale bars should be added to all microscopy figures.

      We thank the reviewer for pointing this out. We have now added scale bars to all the microscopy panels in the figures and included the scale details in the figure legends.

      Reviewer #3 (Recommendations for the authors):

      Overall, the discussion section is lacking depth and should be expanded and a few additional experiments should be performed to clarify the results.

      We thank the reviewer for the suggestions. Based on this reviewer’s comments and comments for the other reviewers, we incorporated several points into the discussion. As a result, we hope that we provide additional depth to our conclusions.

      (1) The finding that the abundance of variant telomeric repeats (VTRs) within the final 30 nucleotides of the telomeric 5' ends is similar in both telomerase-expressing and ALT cells is intriguing, but the authors do not address this result. Could the authors provide more insight into this observation and suggest potential explanations? As the frequency of VTRs does not seem to be upregulated in POT1-depleted cells, what then drives the appearance of VTRs on the C-strand at the very end of telomeres? Is CST-Pola complex responsible?

      The reviewer raises a very interesting and relevant point. We are hesitant at this point to speculate on why we do not see a difference in variant repeats in ALT versus non-ALT cells, since additional data would be needed. One possibility is that variant repeats in ALT cells accumulate stochastically within telomeres but are selected against when they are present at the terminal portion of chromosome ends. However, to prove this hypothesis, we would need error-free long-read technology combined with END-seq. We feel that developing this approach would be beyond the scope of this manuscript.

      (2) The authors also note that, in ALT cells, the frequency of VTRs in the first 30 nucleotides of the S1-END-SEQ reads is higher compared to END-SEQ, but this finding is not discussed either. Do the authors think that the presence of ssDNA regions is associated with the VTRs? Along this line, what is the frequency of VTRs in the END-SEQ analysis of TRF1-FokI-expressing ALT cells? Is it also increased? Has TRF1-FokI been applied to telomerase-expressing cells to compare VTR frequencies at internal sites between ALT and telomerase-expressing cells?

      Similarly to what is discussed above, short reads have the advantage of being very accurate but do not provide sufficient length to establish the relative frequency of VTRs across the whole telomere sequence. The TRF1-FokI experiment is a good suggestion, but it would still be biased toward non-variant repeats due to the TRF1-binding properties. We plan to address these questions in a future study involving long-read sequencing and END-seq capture of telomeres.

      Finally, in these experiments (S1-END-SEQ or END-SEQ in TRF1-Fok1), is the frequency of VTRs the same on both the C- and the G-rich strands? It is possible that the sequences are not fully complementary in regions where G4 structures form.

      We thank the reviewer for this observation. While we do observe a higher frequency of variant telomeric repeats (VTRs) in the first 30 nucleotides of S1-END-seq reads compared to END-seq in ALT cells, we are currently unable to determine whether this difference is significant, as an appropriate control or matched normalization strategy for this comparison is lacking. Therefore, we refrain from overinterpreting the biological relevance of this observation.

      The reviewer is absolutely correct. Our calculation did not exclude the possibility of extrachromosomal DNA as a source of telomeric ssDNA. We have now addressed this point in our discussion.

      The reviewer is correct in pointing out that we still do not know what causes ssDNA at telomeres in ALT cells. Replication stress seems the most logical explanation based on the work of many labs in the field. However, our data did not reveal any significant difference in the levels of ssDNA at telomeres in non-ALT cells based on telomere length. We used the HeLa1.2.11 cell line (now clarified in the Materials section), which is the parental line of HeLa1.3 and has similarly long telomeres (~20 kb vs. ~23 kb). Despite their long telomeres and potential for replication-associated challenges such as G-quadruplex formation, HeLa1.2.11 cells did not exhibit the elevated levels of telomeric ssDNA that we observed in ALT cells (Figure 4B). Additional experiments are needed to map the occurrence of ssDNA at telomeres in relation to progression toward ALT.

      (3) Based on the ratio of C-rich to G-rich reads in the S1-END-SEQ experiment, the authors estimate that ALT cells contain at least 3-5 ssDNA regions per chromosome end. While the calculation is understandable, this number could be discussed further to consider the possibility that the observed ratios (of roughly 0.5) might result from the presence of extrachromosomal DNA species, such as C-circles. The observed increase in the ratio of C-rich to G-rich reads in BLM-depleted cells supports this hypothesis, as BLM depletion suppresses C-circle formation in U2OS cells. To test this, the authors should examine the impact of POLD3 depletion on the C-rich/G-rich read ratio. Alternatively, they could separate high-molecular-weight (HMW) DNA from low-molecular-weight DNA in ALT cells and repeat the S1-END-SEQ in the HMW fraction.

      The reviewer is absolutely correct. Our calculation did not exclude the possibility of extrachromosomal DNA as a source of telomeric ssDNA. We have now addressed this point in our discussion.

      (4) What is the authors' perspective on the presence of ssDNA at ALT telomeres? Do they attribute this to replication stress? It would be helpful for the authors to repeat the S1-END-SEQ in telomerase-expressing cells with very long telomeres, such as HeLa1.3 cells, to determine if ssDNA is a specific feature of ALT cells or a result of replication stress. The increased abundance of G4 structures at telomeres in HeLa1.3 cells (as shown in J. Wong's lab) may indicate that replication stress is a factor. Similar to Wong's work, it would be valuable to compare the C-rich/G-rich read ratios in HeLa1.3 cells to those in ALT cells with similar telomeric DNA content.

      The reviewer is correct in pointing out that we still do not know what causes ssDNA at telomeres in ALT cells. Replication stress seems the most logical explanation based on the work of many labs in the field. However, our data did not reveal any significant difference in the levels of ssDNA at telomeres in non-ALT cells based on telomere length. We used the HeLa1.2.11 cell line (now clarified in the Materials section), which is the parental line of HeLa1.3 and has similarly long telomeres (~20 kb vs. ~23 kb). Despite their long telomeres and potential for replication-associated challenges such as G-quadruplex formation, HeLa1.2.11 cells did not exhibit the elevated levels of telomeric ssDNA that we observed in ALT cells (Figure 4B). Additional experiments are needed to map the occurrence of ssDNA at telomeres in relation to progression toward ALT.

      Finally, Reviewer #3 raises a list of minor points:

      (1) The Y-axes of Figure 4 have been relabeled to account for the G-strand reads.

      (2) Statistical analyses have been added to the figures where applicable.

      (3) The manuscript has been carefully proofread to improve clarity and consistency throughout the text and figure legends

      (4) We have revised the text to address issues related to the lack of cross-referencing between the supplementary figures and their corresponding legends.

    1. Author Response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review): 

      Summary: 

      Genome-wide association studies have been an important approach to identifying the genetic basis of human traits and diseases. Despite their successes, for many traits, a substantial amount of variation cannot be explained by genetic factors, indicating that environmental variation and individual 'noise' (stochastic differences as well as unaccounted for environmental variation) also play important roles. The authors' goal was to address whether gene expression variation in genetically identical individuals, driven by historical environmental differences and 'noise', could be used to predict reproductive trait differences. 

      Strengths: 

      To address this question, the authors took advantage of genetically identical C. elegans individuals to transcriptionally profile 180 adult hermaphrodite individuals that were also measured for two reproductive traits. A major strength of the paper is its experimental design. While experimenters aim to control the environment that each worm experiences, it is known that there are small differences that each worm experiences even when they are grown together on the same agar plate - e.g. the age of their mother, their temperature, the amount of food they eat, and the oxygen and carbon dioxide levels depending on where they roam on the plate. Instead of neglecting this unknown variation, the authors design the experiment up front to create two differences in the historical environment experienced by each worm: 1) the age of its mother and 2) 8 8-hour temperature difference, either 20 or 25 {degree sign}C. This helped the authors interpret the gene expression differences and trait expression differences that they observed. 

      Using two statistical models, the authors measured the association of gene expression for 8824 genes with the two reproductive traits, considering both the level of expression and the historical environment experienced by each worm. Their data supports several conclusions. They convincingly show that gene expression differences are useful for predicting reproductive trait differences, predicting ~25-50% of the trait differences depending on the trait. Using RNAi, they also show that the genes they identify play a causal role in trait differences. Finally, they demonstrate an association with trait variation and the H3K27 trimethylation mark, suggesting that chromatin structure can be an important causal determinant of gene expression and trait variation. 

      Overall, this work supports the use of gene expression data as an important intermediate for understanding complex traits. This approach is also useful as a starting point for other labs in studying their trait of interest. 

      We thank the reviewer for their thorough articulation of the strengths of our study.  

      Weaknesses: 

      There are no major weaknesses that I have noted. Some important limitations of the work (that I believe the authors would agree with) are worth highlighting, however: 

      (1) A large remaining question in the field of complex traits remains in splitting the role of non-genetic factors between environmental variation and stochastic noise. It is still an open question which role each of these factors plays in controlling the gene expression differences they measured between the individual worms. 

      Yes, we agree that this is a major question in the field. In our study, we parse out differences driven between known historical environmental factors and unknown factors, but the ‘unknown factors’ could encompass both unknown environmental factors and stochastic noise.

      (2) The ability of the authors to use gene expression to predict trait variation was strikingly different between the two traits they measured. For the early brood trait, 448 genes were statistically linked to the trait difference, while for egg-laying onset, only 11 genes were found. Similarly, the total R2 in the test set was ~50% vs. 25%. It is unclear why the differences occur, but this somewhat limits the generalizability of this approach to other traits. 

      We agree that the difference in predictability between the two traits is interesting. A previous study from the Phillips lab measured developmental rate and fertility across Caenorhabditis species and parsed sources of variation (1). Results indicated that 83.3% of variation in developmental rate was explained by genetic variation, while only 4.8% was explained by individual variation. In contrast, for fertility, 63.3% of variation was driven by genetic variation and 23.3% was explained by individual variation. Our results, of course, focus only on predicting the individual differences, but not genetic differences, for these two traits using gene expression data. Considering both sets of results, one hypothesis is that we have more power to explain nongenetic phenotypic differences with molecular data if the trait is less heritable, which is something that could be formally interrogated with more traits across more strains.

      (3) For technical reasons, this approach was limited to whole worm transcription. The role of tissue and celltype expression differences is important to the field, so this limitation is important. 

      We agree with this assessment, and it is something we hope to address with future work.

      Reviewer #2 (Public review): 

      Summary: 

      This paper measures associations between RNA transcript levels and important reproductive traits in the model organism C. elegans. The authors go beyond determining which gene expression differences underlie reproductive traits, but also (1) build a model that predicts these traits based on gene expression and (2) perform experiments to confirm that some transcript levels indeed affect reproductive traits. The clever study design allows the authors to determine which transcript levels impact reproductive traits, and also which transcriptional differences are driven by stochastic vs environmental differences. In sum, this is a rather comprehensive study that highlights the power of gene expression as a driver of phenotype, and also teases apart the various factors that affect the expression levels of important genes. 

      Strengths: 

      Overall, this study has many strengths, is very clearly communicated, and has no substantial weaknesses that I can point to. One question that emerges for me is about the extent to which these findings apply broadly. In other words, I wonder whether gene expression levels are predictive of other phenotypes in other organisms. I

      think this question has largely been explored in microbes, where some studies (PMID: 17959824) but not others (PMID: 38895328) find that differences in gene expression are predictive of phenotypes like growth rate. Microbes are not the primary focus here, and instead, the discussion is mainly focused on using gene expression to predict health and disease phenotypes in humans. This feels a little complicated since humans have so many different tissues. Perhaps an area where this approach might be useful is in examining infectious single-cell populations (bacteria, tumors, fungi). But I suppose this idea might still work in humans, assuming the authors are thinking about targeting specific tissues for RNAseq. 

      In sum, this is a great paper that really got me thinking about the predictive power of gene expression and where/when it could inform about (health-related) phenotypes. 

      We thank the reviewer for recognizing the strengths of our study. We are also interested in determining the extent to which predictive gene expression differences operate in specific tissues.

      Reviewer #3 (Public review): 

      Summary: 

      Webster et al. sought to understand if phenotypic variation in the absence of genetic variation can be predicted by variation in gene expression. To this end they quantified two reproductive traits, the onset of egg laying and early brood size in cohorts of genetically identical nematodes exposed to alternative ancestral (two maternal ages) and same generation life histories (either constant 20C temperature or 8-hour temperature shift to 25C upon hatching) in a two-factor design; then they profiled genome-wide gene expression in each individual. 

      Using multiple statistical and machine learning approaches, they showed that, at least for early brood size, phenotypic variation can be quite well predicted by molecular variation, beyond what can be predicted by life history alone. 

      Moreover, they provide some evidence that expression variation in some genes might be causally linked to phenotypic variation. 

      Strengths: 

      (1) Cleverly designed and carefully performed experiments that provide high-quality datasets useful for the community. 

      (2) Good evidence that phenotypic variation can be predicted by molecular variation. 

      We thank the reviewer for recognizing the strengths of our study.

      Weaknesses:  

      What drives the molecular variation that impacts phenotypic variation remains unknown. While the authors show that variation in expression of some genes might indeed be causal, it is still not clear how much of the molecular variation is a cause rather than a consequence of phenotypic variation. 

      We agree that the drivers of molecular variation remain unknown. While we addressed one potential candidate (histone modifications), there is much to be done in this area of research. We agree that, while some gene expression differences cause phenotypic changes, other gene expression differences could in principle be downstream of phenotypic differences.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      I have a number of suggestions that I believe will improve the Methods section. 

      (1) Strain N2-PD1073 will probably be confusing to some readers. I recommend spelling out that this is the Phillips lab version of N2.

      Thank you for this suggestion; we have added additional explanation of this strain in the Methods.

      (2) I found the details of the experimental design confusing, and I believe a supplemental figure will help. I have listed the following points that could be clarified: 

      a. What were the biological replicates? How many worms per replicate?

      Biological replicates were defined as experiments set up on different days (in this case, all biological replicates were at least a week apart), and the biological replicate of each worm can be found in Supplementary File 1 on the Phenotypic Data tab.

      b. I believe that embryos and L4s were picked to create different aged P0s, and eggs and L4s were picked to separate plates? Is this correct?

      Yes, this is correct.

      c. What was the spread in the embryo age?

      We assume this is asking about the age of the F1 embryos, and these were laid over the course of a 2-hour window.  

      d. While the age of the parents is different, there are also features about their growth plates that will be impacted by the experimental design. For example, their pheromone exposure is different due to the role that age plays in the combination of ascarosides that are released. It is worth noting as my reading of the paper makes it seem that parental age is the only thing that matters.

      The parents (P0) of different ages likely have differential ascaroside exposure because they are in the vicinity of other similarly aged worms, but the F1 progeny were exposed to their parents for only the 2-hour egg-laying window, in an attempt to minimize this type of effect as much as possible.  

      e. Were incubators used for each temperature?

      Yes.

      f. In line 443, why approximately for the 18 hours? How much spread?

      The approximation was based on the time interval between the 2-hour egg-laying window on Day 4 and the temperature shift on Day 5 the following morning. The timing was within 30 minutes of 18 hours either direction.

      g.  In line 444, "continually left" is confusing. Does this mean left in the original incubator?

      Yes, this means left in the incubator while the worms shifted to 25°C were moved. To avoid confusion, we re-worded this to state they “remained at 20°C while the other half were shifted to 25°C”.

      h. In line 445, "all worms remained at 20 {degree sign}C" was confusing to me as to what it indicated. I assume, unless otherwise noted, the animals would not be moved to a new temperature.

      This was an attempt to avoid confusion and emphasize that all worms were experiencing the same conditions for this part of the experiment.  

      i. What size plates were the worms singled onto?

      They were singled onto 6-cm plates.

      j. If a figure were to be made, having two timelines (with respect to the P0 and F1) might be useful.

      We believe the methods should be sufficient for someone who hopes to repeat the experiment, and we believe the schematic in Figure 1A labeling P0 and F1 generations is sufficient to illustrate the key features of the experimental design.

      k. Not all eggs that are laid end up hatching. Are these censored from the number of progeny calculations?

      Yes, only progeny that hatched and developed were counted for early brood.

      (3) For the lysis, was the second transfer to dH20 also a wash step?

      Yes.

      (4) What was used for the Elution buffer?

      We used elution buffer consisting of 10 mM Tris, 0.1 mM EDTA. We have added this to the “Cell lysate generation” section of the methods

      (5) The company that produced the KAPA mRNA-seq prep kit should be listed.

      We added that the kit was from Roche Sequencing Solutions.

      (6) For the GO analysis - one potential issue is that the set of 8824 genes might also be restricted to specific GO categories. Was this controlled for?

      We originally did not explicitly control for this and used the default enrichGO settings with OrgDB = org.Ce.eg.db as the background set for C. elegans. We have now repeated the analysis with the “universe” set to the 8824-gene background set. This did not qualitatively change the significant GO terms, though some have slightly higher or lower p-values. For comparison purposes, we have added the background-corrected sets to the GO_Terms tab of Supplementary File 1 with each of the three main gene groups appended with “BackgroundOf8824”.

      Reviewer #2 (Recommendations for the authors): 

      (1) The abstract, introduction, and experimental design are well thought through and very clear.

      Thank you.

      (2) Figure 1B could use a clearer or more intuitive label on the horizontal axis. The two examples help. Maybe the genes (points) on the left side should be blue to match Figure 1C, where the genes with a negative correlation are in the blue cluster.

      Thank you for these suggestions. We re-labeled the x-axis as “Slope of early brood vs. gene expression (normalized by CPM)”, which we hope gives readers a better intuition of what the coefficient from the model is measuring. We also re-colored the points previously colored red in Figure 1B to be color-coded depending on the direction of association to match Figure 1C, so these points are now color-coded as pink and purple.  

      (3) If red/blue are pos/neg correlated genes in 1C, perhaps different colors should be used to label ELO and brood in Figures 2 and 3. Green/purple?

      We appreciate this point, but since we ended up using the cluster colors of pink and purple in Figure 1, we opted to leave Figures 2 and 3 alone with the early brood and ELO colorcoding of red and blue.

      (4) I am unfamiliar with this type of beta values, but I thought the explanation and figure were very clear. It could be helpful to bold beta1 and beta2 in the top panels of Figure 2, so the readers are not searching around for those among all the other betas. It could also be helpful to add an English phrase to the vertical axes inFigures 2C and 2D, in addition to the beta1 and beta2. Something like "overall effect (beta1)" and"environment-controlled effect (beta2)". Or maybe "effect of environment + stochastic expression differences

      (beta1)" and "effect of stochastic expression differences alone (beta2)". I guess those are probably too big to fit on the figure, but it might be nice to have a label somewhere on this figure connecting them to the key thing you are trying to measure - the effect of gene expression and environment.

      Thank you for these suggestions. We increased the font sizes and bolded β1 and β2 in Figure 2A-B. In Figure 2C-D, we added a parenthetical under β1 to say “(env + noise)” and β2 to say “(noise)”. We agree that this should give the reader more intuition about what the β values are measuring.  

      Reviewer #3 (Recommendations for the authors): 

      The authors collected individuals 24 hours after the onset of egg laying for transcriptomic profiling. This is a well-designed experiment to control for the physiological age of the germline. However, this does not properly control for somatic physiological age. Somatic age can be partially uncoupled from germline age across individuals, and indeed, this can be due to differences in maternal age (Perez et al, 2017). This is because maternal age is associated with increased pheromone exposure (unless you properly controlled for it by moving worms to fresh plates), which causes a germline-specific developmental delay in the progeny, resulting in a delayed onset of egg production compared to somatic development (Perez et al. 2021). You control for germline age, therefore, it is likely that the progeny of day 1 mothers are actually somatically older than the progeny of day 3 mothers. This would predict that many genes identified in these analyses might just be somatic genes that increase or decrease their expression during the young adult stage. 

      For example, the abundance of collagen genes among the genes negatively associated (including col-20, which is the gene most significantly associated with early brood) is a big red flag, as collagen genes are known to be changing dynamically with age. If variation in somatic vs germline age is indeed what is driving the expression variation of these genes, then the expectation is that their expression should decrease with age. Vice versa, genes positively associated with early brood that are simply explained by age should be increasing.  So I would suggest that the authors first check this using time series transcriptomic data covering the young adult stage they profiled. If this is indeed the case, I would then suggest using RAPToR ( https://github.com/LBMC/RAPToR ), a method that, using reference time series data, can estimate physiological age (including tissue-specific one) from gene expression. Using this method they can estimate the somatic physiological age of their samples, quantify the extent of variation in somatic age across individuals, quantify how much of the observed differences in expressions are explained just by differences in somatic age and correct for them during their transcriptomic analysis using the estimated soma age as a covariate (https://github.com/LBMC/RAPToR/blob/master/vignettes/RAPToR-DEcorrection-pdf.pdf). 

      This should help enrich a molecular variation that is not simply driven by hidden differences between somatic and germline age. 

      To first address some of the experimental details mentioned for our paper, parents were indeed moved to fresh plates where they were allowed to lay embryos for two hours and then removed. Thus, we believe this minimizes the effects of ascarosides as much as possible within our design. As shown in the paper, we also identified genes that were not driven by parental age and for all genes quantified to what extent each gene’s association was driven by parental age. Thus, it is unlikely that differences in somatic and germline age is the sole explanatory factor, even if it plays some role. We also note that we accounted for egg-laying onset timing in our experimental design, and early brood was calculated as the number of progeny laid in the first 24 hours of egg-laying, where egg-laying onset was scored for each individual worm to the hour. The plot of each worm’s ELO and early brood traits is in Figure S1. Nonetheless, we read the RAPToR paper with interest, as we highlighted in the paper that germline genes tend to be positively associated with early brood while somatic genes tend to be negatively associated. While the RAPToR paper discusses using tissue-specific gene sets to stage genetically diverse C. elegans RILs, the RAPToR reference itself was not built using gene expression data acquired from different C. elegans tissues and is based on whole worms, typically collected in bulk. I.e., age estimates in RILs differ depending on whether germline or somatic gene sets are used to estimate age when the the aging clock is based on N2 samples. Thus, it is unclear whether such an approach would work similarly to estimate age in single worm N2 samples. In addition, from what we can tell, the RAPToR R package appears to implement the overall age estimate, rather than using the tissue-specific gene sets used for RILs in the paper. Because RAPToR would be estimating the overall age of our samples using a reference that is based on fewer samples than we collected here, and because we already know the overall age of our samples measured using standard approaches, we believe that estimating the age with the package would not give very much additional insight.  

      Bonferroni correction: 

      First, I think there is some confusion in how the author report their p-values: I don't think the authors are using a cut-off of Bonferroni corrected p-value of 5.7 x 10-6 (it wouldn't make sense). It's more likely that they are using a Bonferroni corrected p of 0.05 or 0.1, which corresponds to a nominal p value of 5.7 x 10-6, am I right?

      Yes, we used a nominal p-value of 5.7 x 10-6 to correspond to a Bonferroni-corrected p-value of 0.05, calculated as 0.05/8824. We have re-worded this wherever Bonferroni correction was mentioned.

      Second, Bonferroni is an overly stringent correction method that has now been substituted by the more powerful Benjamini Hochberg method to control the false discovery rate. Using this might help find more genes and better characterize the molecular variation, especially the one associated with ELO?

      We agree that Bonferroni is quite stringent and because we were focused on identifying true positives, we may have some false negatives. Because all nominal p-values are included in the supplement, it is straightforward for an interested reader to search the data to determine if a gene is significant at any other threshold.   

      Minor comments: 

      (1) "In our experiment, isogenic adult worms in a common environment (with distinct historical environments) exhibited a range of both ELO and early brood trait values (Fig S1A)" I think this and the figure is not really needed, Figure S1B is already enough to show the range of the phenotypes and how much variation is driven by the life history traits.

      We agree that the information in S1A is also included in S1B, but we think it is a little more straightforward if one is primarily interested in viewing the distribution for a single trait.

      (2) Line 105 It should be Figure S2, not S3.

      Thank you for catching this mistake.

      (3) Gene Ontology on positive and negatively associated genes together: what about splitting the positive and negative?

      We have added a split of positive and negative GO terms to the GO_Terms tab of Supplement File 1. Broadly speaking, the most enriched positively associated genes have many of the same GO terms found on the combined list that are germline related (e.g., involved in oogenesis and gamete generation), whereas the most enriched negatively associated genes have GO terms found on the combined list that are related to somatic tissues (e.g., actin cytoskeleton organization, muscle cell development). This is consistent with the pattern we see for somatic and germline genes shown in Figure 4.

      (4) A lot of muscle-related GOs, can you elaborate on that?

      Yes, there are several muscle-related GOs in addition to germline and epidermis. While we do not know exactly why from a mechanistic perspective these muscle-related terms are enriched, it may be important to note that many of these terms have highly overlapping sets of genes which are listed in Supplementary File 1. For example, “muscle system process” and “muscle contraction” have the exact same set of 15 genes causing the term to be significantly enriched. Thus, we tend to not interpret having many GO terms on a given tissue as indicating that the tissue is more important than others for a given biological process. While it is clear there are genes related to muscle that are associated with early brood, it is not yet clear that the tissue is more important than others.  

      (5) "consistent with maternal age affecting mitochondrial gene expression in progeny " - has this been previously reported?

      We do not believe this particular observation has been reported. It is important to note that these genes are involved in mitochondrial processes, but are expressed from the nuclear rather than mitochondrial genome. We re-worded the quoted portion of the sentence to say “consistent with parental age affecting mitochondria-related gene expression in progeny”.

      (6) PCA: "Therefore, the optimal number of PCs occurs at the inflection points of the graph, which is after only7 PCs for early brood (R2 of 0.55) but 28 PCs for ELO (R2 of 0.56)." 

      Not clear how this is determined: just graphically? If yes, there are several inflection points in the plot. How did you choose which one to consider? Also, a smaller component is not necessarily less predictive of phenotypic variation (as you can see from the graph), so instead of subsequently adding components based on the variance, they explain the transcriptomic data, you might add them based on the variance they explain in the phenotypic data? To this end, have you tried partial least square regression instead of PCA? This should give gene expression components that are ranked based on how much phenotypic variance they explain.  

      Thank you for this thoughtful comment. We agree that, unlike for Figure 3B, there is some interpretation involved on how many PCs is optimal because additional variance explained with each PC is not strictly decreasing beyond a certain number of PCs. Our assessment was therefore made both graphically and by looking at the additional variance explained with each additional PC. For example, for early brood, there was no PC after PC7 that added more than 0.04 to the R2. We could also have plotted early brood and ELO separately and had a different ordering of PCs on the x-axis. By plotting the data this way, we emphasized that the factors that explain the most variation in the gene expression data typically explain most variation in the phenotypic data.  

      (7) The fact that there are 7 PC of molecular variation that explain early brood is interesting. I think the authors can analyze this further. For example, could you perform separate GO enrichment for each component that explains a sizable amount of phenotypic variance? Same for the ELO.  

      Because each gene has a PC loading in for each PC, and each PC lacks the explanatory power of combined PCs, we believe doing GO Terms on the list of genes that contribute most to each PC is of minimal utility. The power of the PCA prediction approach is that it uses the entire transcriptome, but the other side of the coin is that it is perhaps less useful to do a gene-bygene based analysis with PCA. This is why we separately performed individual gene associations and 10-gene predictive analyses. However, we have added the PC loadings for all genes and all PCs to Supplementary File 1.

      (8) Avoid acronyms when possible (i.e. ELO in figures and figure legends could be spelled out to improve readability).

      We appreciate this point, but because we introduced the acronym both in Figure 1 and the text and use it frequently, we believe the reader will understand this acronym. Because it is sometimes needed (especially in dense figures), we think it is best to use it consistently throughout the paper.

      (9) Multiple regression: I see the most selected gene is col-20, which is also the most significantly differentially expressed from the linear mixed model (LMM). But what is the overlap between the top 300 genes in Figure 3F and the 448 identified by the LMM? And how much is the overlap in GO enrichment?

      Genes that showed up in at least 4 out of 500 iterations were selected more often than expected by chance, which includes 246 genes (as indicated by the red line in Figure 3F). Of these genes, 66 genes (27%) are found in the set of 448 early brood genes. The proportion of overlap increases as the number of iterations required to consider a gene predictive increases, e.g., 34% of genes found in 5 of 500 iterations and 59% of genes found in 10 of 500 iterations overlap with the 448 early brood genes. However, likely because of the approach to identify groups of 10 genes that are predictive, we do not find significant GO terms among the 246 genes identified with this approach after multiple test correction. We think this makes sense because the LMM identifies genes that are individually associated with early brood, whereas each subsequent gene included in multiple regression affects early brood after controlling for all previous genes. These additional genes added to the multiple regression are unlikely to have similar patterns as genes that are individually correlated with early brood.  

      (10) Elastic nets: prediction power is similar or better than multiple regression, but what is the overlap between genes selected by the elastic net (not presented if I am not mistaken) and multiple regression and the linear mixed model?

      For the elastic net models, we used a leave-one-out cross validation approach, meaning there were separate models fit by leaving out the trait data for each worm, training a model using the trait data and transcriptomic data for the other worms, and using the transcriptomic data of the remaining worm to predict the trait data. By repeating this for each worm, the regressions shown in the paper were obtained. Each of these models therefore has its own set of genes. Of the 180 models for early brood, the median model selects 83 genes (range from 72 to 114 genes). Across all models, 217 genes were selected at least once. Interestingly, there was a clear bimodal distribution in terms of how many models a given gene was selected for: 68 genes were selected in over 160 out of 180 models, while 114 genes were selected in fewer than 20 models (and 45 genes were selected only once). Therefore, we consider the set of 68 genes as highly robustly selected, since they were selected in the vast majority of models. This set of 68 exhibits substantial overlap with both the set of 448 early brood-associated genes (43 genes or 63% overlap) and the multiple regression set of 246 genes (54 genes or 79% overlap). For ELO, the median model selected 136 genes (range of 96 to 249 genes) and a total of 514 genes were selected at least once. The distribution for ELO was also bimodal with 78 genes selected over 160 times and 255 genes selected fewer than 20 times. This set of 78 included 6 of the 11 significant ELO genes identified in the LMM.  We have added tabs to Supplementary File 1 that include the list of genes selected for the elastic net models as well as a count of how many times they were selected out of 180 models.

      (11) In other words, do these different approaches yield similar sets of genes, or are there some differences?

      In the end, which approach is actually giving the best predictive power? From the perspective of R2, both the multiple regression and elastic net models are similarly predictive for early brood, but elastic net is more predictive for ELO. However, in presenting multiple approaches, part of our goal was identifying predictive genes that could be considered the ‘best’ in different contexts. The multiple regression was set to identify exactly 10 genes, whereas the elastic net model determined the optimal number of genes to include, which was always over 70 genes. Thus, the elastic net model is likely better if one has gene expression data for the entire transcriptome, whereas the multiple regression genes are likely more useful if one were to use reporters or qRTPCR to measure a more limited number of genes.  

      (12) Line 252: "Within this curated set, genes causally affected early brood in 5 of 7 cases compared to empty vector (Figure 4A).

      " It seems to me 4 out of 7 from Figure 4A. In Figure 4A the five genes are (1) cin-4, (2) puf5; puf-7, (3) eef-1A.2, (4) C34C12.8, and (5) tir-1. We did not count nex-2 (p = 0.10) or gly-13 (p = 0.07), and empty vector is the control.

      (13) Do puf-5 and -7 affect total brood size or only early brood size? Not clear. What's the effect of single puf-5 and puf-7 RNAi on brood?

      We only measured early brood in this paper, but a previous report found that puf-5 and puf-7 act redundantly to affect oogenesis, and RNAi is only effective if both are knocked down together(2). We performed pilot experiments to confirm that this was the case in our hands as well.  

      (14)  To truly understand if the noise in expression of Puf-5 and /or -7 really causes some of the observed difference in early brood, could the author use a reporter and dose response RNAi to reduce the level of puf-5/7 to match the lower physiological noise range and observe if the magnitude of the reduction of early brood by the right amount of RNAi indeed matches the observed physiological "noise" effect of puf-5/7 on early brood?

      We agree that it would be interesting to do the dose response of RNAi, measure early brood, and get a readout of mRNA levels to determine the true extent of gene knockdown in each worm (since RNAi can be noisy) and whether this corresponds to early brood when the knockdown is at physiological levels. While we believe we have shown that a dose response of gene knockdown results in a dose response of early brood, this additional analysis would be of interest for future experiments.

      (15) Regulated soma genes (enriched in H3K27me3) are negatively correlated with early brood. What would be the mechanism there? As mentioned before, it is more likely that these genes are just indicative of variation in somatic vs germline age (maybe due to latent differences in parental perception of pheromone).

      We can think of a few potential mechanisms/explanations, but at this point we do not have a decisive answer. Regulated somatic genes marked with H3K27me3 (facultative heterochromatin) are expressed in particular tissues and/or at particular times in development. In this study and others, genes marked with H3K27me3 exhibit more gene expression noise than genes with other marks. This could suggest that there are negative consequences for the animal if genes are expressed at higher levels at the wrong time or place, and one interpretation of the negative association is that higher expressed somatic genes results in lower fitness (where early brood is a proxy for fitness). Another related interpretation is that there are tradeoffs between somatic and germline development and each individual animal lands somewhere on a continuum between prioritizing germline or somatic development, where prioritizing somatic integrity (e.g. higher expression of somatic genes) comes at a cost to the germline resulting in fewer progeny. Additional experiments, including measurements of histone marks in worms measured for the early brood trait, would likely be required to more decisively answer this question.  

      (16) Line 151: "Among significant genes for both traits, β2 values were consistently lower than β1 (Figures 2CD), suggesting some of the total effect size was driven by environmental history rather than pure noise".

      We are interpreting this quote as part of point 17 below.

      (17) It looks like most of the genes associated with phenotypes from the univariate model have a decreased effect once you account for life history, but have you checked for cases where the life history actually masks the effect of a gene? In other words, do you have cases where the effect of gene expression on a phenotype is only (or more) significant after you account for the effect of life history (β2 values higher than β1)?

      This is a good question and one that we did not explicitly address in the paper because we focused on beta values for genes that were significant in the univariate analysis. Indeed, for the sets of 448 early brood genes ad 11 ELO genes, there are no genes for which β2 is larger than β1. In looking at the larger dataset of 8824 genes, with a Bonferroni-corrected p-value of 0.05, there are 306 genes with a significant β2 for early brood. The majority (157 genes) overlap with the 448 genes significant in the univariate analysis and do not have a higher β2 than β1. Of the remaining genes, 72 of these have a larger β2 than β1. However, in most cases, this difference is relatively small (median difference of 0.025) and likely insignificant. There are only three genes in which β1 is not nominally significant, and these are the three genes with the largest difference between β1 and β2 with β2 being larger (differences of 0.166, 0.155, and 0.12). In contrast, the median difference between β1 and β2 the 448 genes (in which β1 is larger) is 0.17, highlighting the most extreme examples of β2 > β1 are smaller in magnitude than the typical case of β1 > β2. For ELO, there are no notable cases where β2 > β1. There are eight genes with a significant β2 value, and all of these have a β1 value that is nominally significant. Therefore, while this phenomenon does occur, we find it to be relatively rare overall. For completeness, we have added the β1 and β2 values for all 8824 genes as a tab in Supplementary File 1.

    1. Author Response:

      The following is the authors’ response to the original reviews.

      General response

      (1) Evaluation of mitochondrial activity in mox-YG overexpression cells

      To determine whether the observed “mitochondrial development” seen in transcriptomic, proteomic, and microscopic analyses corresponds to an actual phenotypic shift toward respiration, we measured oxygen consumption in mox-YG overexpression cells. The results showed that oxygen consumption rates were indeed elevated in these cells, suggesting a metabolic shift from fermentation toward respiration. These findings have been incorporated into the revised manuscript as new Figure 4E and Figure 4—figure supplement 9, along with the corresponding descriptions in the Results section.

      (2) Evaluation of TORC1 Pathway Inactivation in mox-YG Overexpression Cells

      While the proteomic response in mox-YG overexpression cells overlapped with known responses to TORC1 pathway inactivation, we had not obtained direct evidence that TORC1 activity was indeed reduced. To address this, we assessed TORC1 activity by testing the effect of rapamycin, a TORC1 inhibitor, and by attempting to detect the phosphorylation state of known TORC1 targets. Our results showed that mox-YG overexpressing cells exhibited reduced sensitivity to rapamycin compared to vector control cells, supporting the idea that TORC1 is already inactivated in the mox-YG overexpression condition.

      In parallel, we attempted to detect phosphorylation of TORC1 targets Sch9 and Atg13 by Western blotting. Specifically, we tested several approaches: detecting phospho-Sch9 using a phospho-specific antibody, assessing the band shift of HA-tagged Sch9, and monitoring Atg13 band shift using an anti-Atg13 antibody. While we were unable to detect Sch9 phosphorylation, likely due to technical limitations, we finally succeeded in detecting Atg13 with the help of our new co-author, Dr. Kamada. However, we observed a marked reduction in Atg13 protein levels in mox-YG overexpression cells, making it difficult to interpret the biological significance of any apparent decrease in phosphorylation. Therefore, we decided not to pursue further experiments on TORC1 phosphorylation within the current revision period.

      These findings have been summarized in new Figure 4—figure supplement 7, and the relevant description has been added to the Results section.

      (3) Phenotypes of Gpm1-CCmut

      We focused our initial analysis on the phenotypes of cells overexpressing mox-YG, the protein with the lowest Neutrality Index (NI) in our dataset, as a model of protein burden. However, it remained unclear to what extent the phenotypes observed in mox-YG overexpression cells are generalizable to protein burden as a whole. We agree with the reviewers’ suggestion that it is important to examine whether similar phenotypes are also observed in cells overexpressing Gpm1-CCmut, which was newly identified in this study as having a similarly low NI. We therefore performed validation experiments using Gpm1-CCmut overexpression cells to assess whether they exhibit the characteristic phenotypes observed in mox-YG overexpression cells. These phenotypes included: transcriptional responses, mitochondrial development, metabolic shift toward respiration, and nucleolar shrinkage.

      As a result, mitochondrial development and nucleolar shrinkage were also observed in Gpm1-CCmut overexpression cells, consistent with mox-YG. In contrast, the transcriptional response associated with amino acid starvation and the metabolic shift toward respiration were not observed. Furthermore, an abnormal rounding of cell morphology—absent in mox-YG overexpression cells—was uniquely observed in Gpm1-CCmut cells. These results suggest that the phenotypes observed under mox-YG overexpression may comprise both general effects of protein burden and effects specific to the mox-YG protein. Alternatively, it is possible that Gpm1-CCmut imposes a different kind of constraint or toxicity not shared with mox-YG. In any case, these findings highlight that the full range of phenotypes associated with protein burden cannot yet be clearly defined and underscore the need for future analyses using a variety of “non-toxic” proteins.

      Given that these results form a coherent set, we have relocated original Figure 3—which previously presented the NI values of Gpm1 and Tdh3 in the original version—to new Figure 6, which now includes all related phenotypic analyses. Correspondingly, we have added new Figures 6—figure supplement 1 through 6—figure supplement 7. The associated results have been incorporated into the Results section, and we have expanded the Discussion to address this point

      As a result of these revisions, the order of figures has changed from the original version. The correspondence between the original and revised versions is as follows:

      original→ Revised

      Figure 1 → Figure 1<br />  Figure 2 → Figure 2<br />  Figure 3 → Figure 6<br />  Figure 4 → Figure 3<br />  Figure 5 → Figure 4<br />  Figure 6 → Figure 5

      Public Reviews:

      Reviewer #1 (Public Review):

      Weaknesses:

      While the introduction of the neutrality index seems useful to differentiate between cytotoxicity and protein burden, the biological relevance of the effects of overexpression of the model proteins is unclear.

      Thank you for your comment. This point is in fact the core message we wished to convey in this study. We believe that every protein possesses some degree of what can be described as “cytotoxicity,” and that this should be defined by the expression limit—specifically, the threshold level at which growth inhibition occurs. This index corresponds to what we term the neutrality index. We further argue that protein cytotoxicity arises from a variety of constraints inherent to each protein. These constraints act in a stepwise manner to determine the expression limit (i.e., the neutrality) of a given protein (Figure 1A). To demonstrate the real existence of such constraints, there are two complementary approaches: an inductive one that involves large-scale, systematic investigation of naturally occurring proteins, and a deductive one that tests hypotheses using selected model proteins. Our current study follows the latter approach. In addition, we define protein burden as a phenomenon that can only be elicited by proteins that are ultimately harmless (Figure 1B). We assume that such burden results in a shared physiological state, such as depletion of cellular resources. Through continued efforts to identify a protein suitable for investigating this phenomenon, we eventually arrived at mox-YG. As the reviewer rightly pointed out, examining only mox-YG does not reveal the full picture of protein burden. In fact, in response to the reviewer’s suggestion, we investigated the physiological consequences of overexpressing a mutant glycolytic protein, Gpm1-CCmut (General Response 3). We found that the resulting phenotype was notably different from that observed in cells overexpressing mox-YG. Going forward, we believe that our study provides a foundation for further systematic exploration of “harmless proteins” and the cellular impacts of their overexpression.

      Reviewer #2 (Public Review):

      Weaknesses:

      The authors concluded from their RNA-seq and proteomics results that cells with excess mox-YG expression showed increased respiration and TORC1 inactivation. I think it will be more convincing if the authors can show some characterization of mitochondrial respiration/membrane potential and the TOR responses to further verify their -omic results.

      These points are addressed in General Response 1 and 2.

      In addition, the authors only investigated how overexpression of mox-YG affects cells. It would be interesting to see whether overexpressing other non-toxic proteins causes similar effects, or if there are protein-specific effects. It would be good if the authors could at least discuss this point considering the workload of doing another RNA-seq or mass-spectrum analysis might be too heavy.

      These points are addressed in General Response 3.

      Reviewer #3 (Public Review):

      Weaknesses:

      The data are generally convincing, however in order to back up the major claim of this work - that the observed changes are due to general protein burden and not to the specific protein or condition - a broader analysis of different conditions would be highly beneficial.

      These points are addressed in General Response 3.

      Major points:

      (1) The authors identify several proteins with high neutrality scores but only analyze the effects of mox/mox-YG overexpression in depth. Hence, it remains unclear which molecular phenotypes they observe are general effects of protein burden or more specific effects of these specific proteins. To address this point, a proteome (and/or transcriptome) of at least a Gpm1-CCmut expressing strain should be obtained and compared to the mox-YG proteome. Ideally, this analysis should be done simultaneously on all strains to achieve a good comparability of samples, e.g. using TMT multiplexing (for a proteome) or multiplexed sequencing (for a transcriptome). If feasible, the more strains that can be included in this comparison, the more powerful this analysis will be and can be prioritized over depth of sequencing/proteome coverage.

      This comment has been addressed in General Response 3. Gpm1-CCmut overexpression cells exhibited both phenotypes that were shared with, and distinct from, those observed in mox-YG overexpression cells. To define a unified set of phenotypes associated with "protein burden," we believe that extensive omics analyses targeting multiple "non-toxic" protein overexpression strains will be necessary. However, such an effort goes beyond the scope of the current study, and we would like to leave it as an important subject for future investigation.

      (2) The genetic tug-of-war system is elegant but comes at the cost of requiring specific media conditions (synthetic minimal media lacking uracil and leucine), which could be a potential confound, given that metabolic rewiring, and especially nitrogen starvation are among the observed phenotypes. I wonder if some of the changes might be specific to these conditions. The authors should corroborate their findings under different conditions. Ideally, this would be done using an orthogonal expression system that does not rely on auxotrophy (e.g. using antibiotic resistance instead) and can be used in rich, complex mediums like YPD. Minimally, using different conditions (media with excess or more limited nitrogen source, amino acids, different carbon source, etc.) would be useful to test the robustness of the findings towards changes in media composition.

      We appreciate the reviewer’s clear understanding of both the advantages and limitations of the gTOW system. As rightly pointed out, since our system relies on leucine depletion, it is essential to carefully consider the potential impact this may have on cellular metabolism. Another limitation—though it also serves as one of the strengths—of the gTOW system is its reliance on copy number variation to achieve protein overexpression. This feature limits the possibility of observing rapid responses, as immediate induction is not feasible. To address this issue, we have recently developed a strong and inducible promoter that minimizes effects on other metabolic systems (Higuchi et al., 2024), and we believe this tool will be essential in future experiments.

      In response to the reviewer’s comments, we conducted two additional sets of experiments. First, we established a new overexpression system in nutrient-rich conditions (YPD medium) that is conceptually similar to gTOW but uses aureobasidin A and the AUR1d resistance gene to promote gene amplification (new Figure 4—figure supplement 2). Using this system, we observed that non-fluorescent YG mutants led to increased expression of mox. Total protein levels appeared to rise correspondingly, suggesting that the overall synthetic capacity of cells might be higher in YPD compared to SC medium. However, the degree of overexpression achieved in this system was insufficient to strongly inhibit growth, meaning we could not replicate the stress conditions observed with the original gTOW system. Further studies will be needed to determine whether stronger induction under these nutrient-rich conditions will yield comparable responses.

      Second, we performed a control experiment to examine whether the amino acid starvation response observed in mox-YG overexpressing cells could be attributed to leucine depletion from the medium (new Figure 3—figure supplement 3). By titrating leucine concentrations in SC medium, we confirmed that lower leucine levels reduced the growth rate of vector control cells, indicating leucine limitation. However, GAP1 induction was not observed under these conditions. In contrast, mox-YG overexpression led to strong GAP1 induction under similar growth-inhibitory conditions, suggesting that the amino acid starvation response is not simply due to environmental leucine depletion, but rather a consequence of the cellular burden imposed by mox-YG overexpression.

      These findings have been incorporated into the manuscript, along with the corresponding figures (new Figure 4—figure supplement 2, Figure 3—figure supplement 3), and relevant descriptions have been added to the Results and Discussion sections.

      (3) The authors suggest that the TORC1 pathway is involved in regulating some of the changes they observed. This is likely true, but it would be great if the hypothesis could be directly tested using an established TORC1 assay.

      This comment has been addressed in General Response 2. We assessed the rapamycin sensitivity of mox-YG overexpression cells—which was found to be reduced—and attempted to detect phosphorylation of the TORC1 target Atg13, although the latter was only partially successful. These findings have been incorporated into the Results section.

      (4) The finding that the nucleolus appears to be virtually missing in mox-YG-expressing cells (Figure 6B) is surprising and interesting. The authors suggest possible mechanisms to explain this and partially rescue the phenotype by a reduction-of-function mutation in an exosome subunit. I wonder if this is specific to the mox-YG protein or a general protein burden effect, which the experiments suggested in point 1 should address. Additionally, could a mox-YG variant with a nuclear export signal be expressed that stays exclusively in the cytosol to rule out that mox-YG itself interferes with phase separation in the nucleus?

      As also described in our General Response 3, we observed nucleolar shrinkage upon Gpm1-CCmut overexpression as well (new Figure 6E and 6—figure supplement 7), suggesting that this phenomenon may represent a general feature of protein burden. The reviewer’s suggestion to test whether this effect persists when mox-YG is excluded from the nucleus is indeed intriguing. However, based on our previous work, we have shown that overexpression of NES-tagged proteins (e.g., NES-EGFP) causes severe growth inhibition due to depletion of nuclear export factors (Kintaka et al., 2020). Unfortunately, this technical limitation makes it difficult for us to carry out the proposed experiment as suggested.

      Minor points:

      (5) It would be great if the authors could directly compare the changes they observed at the transcriptome and proteome levels. This can help distinguish between changes that are transcriptionally regulated versus more downstream processes (like protein degradation, as proposed for ribosome components).

      We also considered this point to be important, and therefore compared the transcriptomic and proteomic changes associated with mox-YG overexpression. However, somewhat unexpectedly, we found little correlation between these two layers of response. As shown in new Figure 3 and 4 (original Figures 4 and 5), while genes related to oxidative phosphorylation were consistently upregulated at both the mRNA and protein levels in mox-YG overexpressing cells, ribosomal proteins showed a discordant pattern: their mRNA levels were significantly increased, whereas their protein levels were significantly decreased.

      Several factors may explain this discrepancy: (1) differences in analytical methods between transcriptomics and proteomics; (2) temporal mismatches arising from the dynamic changes in mRNA and protein expression during batch culture; and (3) the possibility that, under protein burden conditions, specific regulatory mechanisms may govern the selective translation or targeted degradation of certain proteins. However, at this point, we were unable to clearly determine which of these factors account for the observed differences.

      For this reason, we did not originally include a global transcriptome–proteome comparison in the manuscript. In response to the reviewer’s comment, however, we have now included the comparison data (new Figure 4—figure supplement 3D).

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Major points:

      (1) While the study provides a detailed description of physiological changes, the underlying mechanisms remain speculative. For example, the exact reasons for nitrogen source depletion or increased respiration are unclear. The transcriptomic and proteomic data should be complemented by basic growth assay tests on rapamycin or glycerol to strengthen these observations.

      This comment has been addressed in General Responses 1 and 2. We conducted oxygen consumption assays and growth assays in the presence of rapamycin, and incorporated these results into the revised version of the manuscript.

      We also performed culture experiments using glycerol as a carbon source. However, both the vector control and mox-YG overexpression cells showed extremely poor growth. Although there was a slight difference between the two, we judged that it would be difficult to draw any meaningful conclusions from these results. Therefore, we have chosen not to include them in the main text (the data are attached below for reference).

      Author response image 1.

      (2) The study mainly focuses on two proteins, mox-YG/ FP proteins and Gpm1-CCmut. Did the authors look also at a broader range of proteins with varying degrees of cytotoxicity to validate the neutrality index and generalize their findings? Such as known cytotoxic proteins.

      In our calculation of the Neutrality Index (NI), we use two parameters: the maximum growth rate (expressed as %MGR relative to the control) and the protein expression level. For the latter, we measure the abundance of the overexpressed protein as a percentage of total cellular protein, based on the assumption that the protein is expressed at a sufficiently high level to be detectable by SDS-PAGE. In our view, proteins typically regarded as “cytotoxic” cannot be overexpressed to levels detectable by SDS-PAGE without the use of more sensitive techniques such as Western blotting. This limitation in expression itself is an indication of their high cytotoxicity. Consequently, for such proteins, NI is determined solely by the MGR value, and will inherently fall below 100.

      To test whether this interpretation is valid, we re-evaluated a group of EGFP variants previously reported by us to exhibit higher cytotoxicity than EGFP (Kintaka et al., 2016), due to overloading of specific cellular transport pathways. These include EGFPs tagged with localization signals. At the time of the original study, we had not calculated their NI values. Upon re-analysis, we found that all of these localization-tagged EGFP variants indeed have NI values below 100.

      This result has been included as a new Figure 2—figure supplement 3, and the relevant descriptions have been added to the Results section.

      (3) The partial rescue of ribosomal biosynthesis defects by a mutation in the nuclear exosome is intriguing but not fully explored. The specific role of the nuclear exosome in managing protein burden remains unclear. This result could be supported by alternative experiments. For example, would tom1 deletion or proteasome inhibition (degradation of ribosomal proteins in the nucleus) partially rescue the nuclear formation?

      As described in the main text, our interest in exosome mutants was prompted by our previous SGA (Synthetic Genetic Array) analysis, in which these mutants exhibited positive genetic interactions with GFP overexpression—namely, they acted in a rescuing manner (Kintaka et al., 2020). In contrast, proteasome mutants did not show such positive interactions in the same screening. On the contrary, proteasome mutants that displayed negative genetic interactions have been identified, such as the pre7ts mutant. Furthermore, the proteasome is involved in various aspects of proteostasis beyond just orphan ribosomal proteins, making the interpretation of its effects potentially quite complex.

      Regarding the TOM1 mutant raised by the reviewer, we attempted to observe nucleolar morphology using the NSR1-mScarlet-I marker in the tom1Δ deletion strain. However, we were unsuccessful in constructing the strain. This failure may be due to the strong detrimental effects of this perturbation in the tom1Δ background. As we were unable to complete this experiment within the revision period, we would like to address this issue in future work.

      Minor comments:

      (1) It would be interesting to include long-term cellular and evolutionary responses to protein overexpression to understand how cells adapt to chronic protein burden.

      Thank you for the suggestion. We are currently conducting experiments related to these points. However, as they fall outside the scope of the present study, we would like to refrain from including the data in this manuscript.

      (2) The microscopy of Nsr1 in Figure 6G does not clearly demonstrate the restored formation of the nucleolus in the mrt4-1 mutant. Electron microscopy images would be a better demonstration.

      The restoration of nucleolar size in the mtr4-1 mutant, as shown in Figure 5—figure supplement 5 (original Figure 6_S5), is statistically significant. However, as described in the main text, the degree of rescue by the mutation is partial, and, as the reviewer notes, not clearly distinguishable by eye. It becomes apparent only when analyzing a large number of cells, allowing for detection as a statistically significant difference. Given that electron microscopy images are inherently limited in the number of cells that can be analyzed and pose challenges for statistical evaluation, we believe it would be difficult to detect such a subtle difference using this method. Therefore, we respectfully ask for your understanding that we will not include additional EM experiments in this revision.

      (3) On page 24, line 451 it says that of the 84 ribosomal proteins... latest reviews and structures described/ identified 79 ribosomal proteins in budding yeast of which the majority are incorporated into the pre-ribosomal particles in the nucleolus. We could not find this information in the provided reference. Please align with the literature.

      Thank you for the comment. In S. cerevisiae, many ribosomal protein genes are duplicated due to gene duplication events, resulting in a total of 136 ribosomal proteins (http://ribosome.med.miyazaki-u.ac.jp/rpg.cgi?mode=genetable). However, not all of them are duplicated, and among the duplicated pairs, some can be distinguished by proteomic analysis based on differences in amino acid sequences, while others cannot. As a result, we report that 84 ribosomal proteins were “detected” in our proteomic analysis. To avoid confusion, we have added the following explanation to the legend of Figure 5—figure supplement 1 (original Figure 6_S1), as follows.

      “Note that when the amino acid sequences of paralogs are identical, they cannot be distinguished by proteomic analysis, and the protein abundance of both members of the paralog pair is represented under the name of only one.”

      Reviewer #2 (Recommendations for the authors):

      (1) The authors mentioned that based on their proteomics results, overexpressing mox-YG appears to increase respiration. I think it is worth doing some quick verification, such as oxygen consumption experiments or mitochondrial membrane potential staining to provide some verification on that.

      This comment has been addressed in General Response 1. We measured oxygen consumption in mox-YG overexpression cells and found that it was indeed elevated, suggesting a metabolic shift from fermentation toward aerobic respiration.

      (2) Similar to point 1, the authors concluded from their proteomics data that the mox-YG overexpression induced responses that are similar to TORC1 inactivation. It might be worth testing whether there is any actual TORC1 inactivation, e.g. by detecting whether there is reduced Sch9 phosphorylation by western blot.

      This comment has been addressed in General Response 2. We assessed the rapamycin sensitivity of mox-YG overexpression cells—which was found to be reduced—and attempted to detect phosphorylation of the TORC1 target Atg13, although the latter was only partially successful. These findings have been incorporated into the Results section.

      (3) The authors showed that overexpressing excess mox-YG caused downregulated glycolysis pathways. It is worth discussing whether overexpressing glycolysis-related non-toxic proteins such as Gpm1-CCmut will also lead to similar results.

      This comment has been addressed in General Response 3. Gpm1-CCmut overexpression cells exhibited both phenotypes shared with mox-YG overexpression and distinct ones. These findings suggest that a unified set of phenotypes associated with "protein burden" has yet to be clearly defined, and further investigation will be necessary to elucidate this.

      Reviewer #3 (Recommendations for the authors):

      (1) The authors identify several proteins with high neutrality scores but only analyze the effects of mox/mox-YG overexpression in depth. Hence, it remains unclear which molecular phenotypes they observe are general effects of protein burden or more specific effects of these specific proteins. To address this point, a proteome (and/or transcriptome) of at least a Gpm1-CCmut expressing strain should be obtained and compared to the mox-YG proteome. Ideally, this analysis should be done simultaneously on all strains to achieve a good comparability of samples, e.g. using TMT multiplexing (for a proteome) or multiplexed sequencing (for a transcriptome). If feasible, the more strains that can be included in this comparison, the more powerful this analysis will be and can be prioritized over depth of sequencing/proteome coverage.

      This comment has been addressed in General Response 3. Gpm1-CCmut overexpression cells exhibited both phenotypes that were shared with, and distinct from, those observed in mox-YG overexpression cells. To define a unified set of phenotypes associated with "protein burden," we believe that extensive omics analyses targeting multiple "non-toxic" protein overexpression strains will be necessary. However, such an effort goes beyond the scope of the current study, and we would like to leave it as an important subject for future investigation.

      (2) The genetic tug-of-war system is elegant but comes at the cost of requiring specific media conditions (synthetic minimal media lacking uracil and leucine), which could be a potential confound, given that metabolic rewiring, and especially nitrogen starvation are among the observed phenotypes. I wonder if some of the changes might be specific to these conditions. The authors should corroborate their findings under different conditions. Ideally, this would be done using an orthogonal expression system that does not rely on auxotrophy (e.g. using antibiotic resistance instead) and can be used in rich, complex mediums like YPD. Minimally, using different conditions (media with excess or more limited nitrogen source, amino acids, different carbon source, etc.) would be useful to test the robustness of the findings towards changes in media composition.

      We appreciate the reviewer’s clear understanding of both the advantages and limitations of the gTOW system. As rightly pointed out, since our system relies on leucine depletion, it is essential to carefully consider the potential impact this may have on cellular metabolism. Another limitation—though it also serves as one of the strengths—of the gTOW system is its reliance on copy number variation to achieve protein overexpression. This feature limits the possibility of observing rapid responses, as immediate induction is not feasible. To address this issue, we have recently developed a strong and inducible promoter that minimizes effects on other metabolic systems (Higuchi et al., 2024), and we believe this tool will be essential in future experiments.

      In response to the reviewer’s comments, we conducted two additional sets of experiments. First, we established a new overexpression system in nutrient-rich conditions (YPD medium) that is conceptually similar to gTOW but uses aureobasidin A and the AUR1d resistance gene to promote gene amplification (new Figure 4—figure supplement 2). Using this system, we observed that non-fluorescent YG mutants led to increased expression of mox. Total protein levels appeared to rise correspondingly, suggesting that the overall synthetic capacity of cells might be higher in YPD compared to SC medium. However, the degree of overexpression achieved in this system was insufficient to strongly inhibit growth, meaning we could not replicate the stress conditions observed with the original gTOW system. Further studies will be needed to determine whether stronger induction under these nutrient-rich conditions will yield comparable responses.

      Second, we performed a control experiment to examine whether the amino acid starvation response observed in mox-YG overexpressing cells could be attributed to leucine depletion from the medium (new Figure 3—figure supplement 3). By titrating leucine concentrations in SC medium, we confirmed that lower leucine levels reduced the growth rate of vector control cells, indicating leucine limitation. However, GAP1 induction was not observed under these conditions. In contrast, mox-YG overexpression led to strong GAP1 induction under similar growth-inhibitory conditions, suggesting that the amino acid starvation response is not simply due to environmental leucine depletion, but rather a consequence of the cellular burden imposed by mox-YG overexpression.

      These findings have been incorporated into the manuscript, along with the corresponding figures (new Figure 4—figure supplement 2, Figure 3—figure supplement 3), and relevant descriptions have been added to the Results and Discussion sections.

      (3) The authors suggest that the TORC1 pathway is involved in regulating some of the changes they observed. This is likely true, but it would be great if the hypothesis could be directly tested using an established TORC1 assay.

      This comment has been addressed in General Response 2. We assessed the rapamycin sensitivity of mox-YG overexpression cells—which was found to be reduced—and attempted to detect phosphorylation of the TORC1 target Atg13, although the latter was only partially successful. These findings have been incorporated into the Results section.

      (4) The finding that the nucleolus appears to be virtually missing in mox-YG-expressing cells (Figure 6B) is surprising and interesting. The authors suggest possible mechanisms to explain this and partially rescue the phenotype by a reduction-of-function mutation in an exosome subunit. I wonder if this is specific to the mox-YG protein or a general protein burden effect, which the experiments suggested in point 1 should address. Additionally, could a mox-YG variant with a nuclear export signal be expressed that stays exclusively in the cytosol to rule out that mox-YG itself interferes with phase separation in the nucleus?

      As also described in our General Response 3, we observed nucleolar shrinkage upon Gpm1-CCmut overexpression as well (new Figure 6E and 6—figure supplement 7), suggesting that this phenomenon may represent a general feature of protein burden. The reviewer’s suggestion to test whether this effect persists when mox-YG is excluded from the nucleus is indeed intriguing. However, based on our previous work, we have shown that overexpression of NES-tagged proteins (e.g., NES-EGFP) causes severe growth inhibition due to depletion of nuclear export factors (Kintaka et al., 2020). Unfortunately, this technical limitation makes it difficult for us to carry out the proposed experiment as suggested.

      (5) It would be great if the authors could directly compare the changes they observed at the transcriptome and proteome levels. This can help distinguish between changes that are transcriptionally regulated versus more downstream processes (like protein degradation, as proposed for ribosome components).

      We also considered this point to be important, and therefore compared the transcriptomic and proteomic changes associated with mox-YG overexpression. However, somewhat unexpectedly, we found little correlation between these two layers of response. As shown in new Figure 3 and 4 (original Figures 4 and 5), while genes related to oxidative phosphorylation were consistently upregulated at both the mRNA and protein levels in mox-YG overexpressing cells, ribosomal proteins showed a discordant pattern: their mRNA levels were significantly increased, whereas their protein levels were significantly decreased.

      Several factors may explain this discrepancy: (1) differences in analytical methods between transcriptomics and proteomics; (2) temporal mismatches arising from the dynamic changes in mRNA and protein expression during batch culture; and (3) the possibility that, under protein burden conditions, specific regulatory mechanisms may govern the selective translation or targeted degradation of certain proteins. However, at this point, we were unable to clearly determine which of these factors account for the observed differences.

      For this reason, we did not originally include a global transcriptome–proteome comparison in the manuscript. In response to the reviewer’s comment, however, we have now included the comparison data (new Figure 4—figure supplement 3D).

      Minor points:

      (1) The authors repeatedly state that 'mitochondrial function' is increased. This is inaccurate in two ways: first, mitochondria have multiple functions, and it should be specified which one is referred to (probably mitochondrial respiration); second, the claim is based solely on the abundance of transcripts/proteins, which may or may not reflect increased activity.

      The authors should either perform functional tests (e.g. measure oxygen consumption or extracellular acidification), or change their wording to more accurately reflect the findings.

      To more directly reflect our findings, we revised two instances of the phrase “mitochondrial function” to “mitochondrial proteins” in the manuscript. Furthermore, as described in General Response 1, we confirmed that oxygen consumption is elevated in mox-YG overexpression cells. This observation suggests that mitochondrial respiratory activity is indeed enhanced under these conditions.

      (2) Similarly, the authors state that FPs are 'not localized' (e.g. line 137). This should be specified (e.g. 'not actively sorted into cellular compartments other than the cytosol').

      As pointed out by the reviewer, we have revised the relevant sections accordingly.

      (3) In Figure 4D, some of the reporter assays don't fully recapitulate the RNAseq findings (e.g. for PHO84 and ZPS1, where mox-FS and mox-YG behave differently in the reporter assay, but not in the RNAseq data). This may stem from technical limitations given that the reporter assay relies on RFP expression which could generally be affected by protein overexpression (cf. ACT1pro in mox-FS), but it should be mentioned in the text.

      We apologize for the confusion caused by our insufficient explanation of "moxFS" in new Figure 3D (original Figure 4D). As clarified here, "moxFS" refers to a frameshift mutant in which the mRNA is transcribed but the protein is not translated due to an early frameshift mutation. This is not a functional mox protein. The behavior of this mutant is nearly identical to that of the vector control, indicating that the transcriptional response observed in this assay is not triggered by mRNA expression itself, but rather by events occurring after protein synthesis begins. Importantly, the transcriptional responses identified by RNA-seq in mox-YG overexpression cells are largely recapitulated by this reporter assay, supporting the reliability of our experimental design.

      We appreciate the reviewer’s comment, which helped us recognize the lack of clarity in our original description. In response, we have added an explanation of the FS mutation to the figure legend (new Figure 3D), and we have also expanded the description of the moxFS experimental results in the Results section.

    1. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary: Zhu et al., investigate the cellular defects in glia as a result of loss in DEGS1/ifc encoding the dihydroceramide desaturase. Using the strength of Drosophila and its vast genetic toolkit, they find that DEGS1/ifc is mainly expressed in glia and its loss leads to profound neurodegeneration. This supports a role for DEGS1 in the developing larval brain as it safeguards proper CNS development. Loss of DEGS1/ifc leads to dihydroceramide accumulation in the CNS and induces alteration in the morphology of glial subtypes and a reduction in glial number. Cortex and ensheathing glia appeared swollen and accumulated internal membranes. Astrocyte-glia on the other hand displayed small cell bodies, reduced membrane extension and disrupted organization in the dorsal ventral nerve cord. They also found that DEGS1/ifc localizes primarily to the ER. Interestingly, the authors observed that loss of DEGS1/ifc drives ER expansion and reduced TGs and lipid droplet numbers. No effect on PC and PE and a slight increase in PS.

      The conclusions of this paper are well supported by the data. The study could be further strengthened by a few additional controls and/or analyses.

      Strengths:

      This is an interesting study that provides new insight into the role of ceramide metabolism in neurodegeneration.

      The strength of the paper is the generation of LOF lines, the insertion of transgenes and the use of the UAS-GAL4/GAL80 system to assess the cell-autonomous effect of DEGS1/ifc loss in neurons and different glial subtypes during CNS development.

      The imaging, immunofluorescence staining and EM of the larval brain and the use of the optical lobe and the nerve cord as a readout are very robust and nicely done.

      Drosophila is a difficult model to perform core biochemistry and lipidomics but the authors used the whole larvae and CNS to uncover global changes in mRNA levels related to lipogenesis and the unfolded protein responses as well as specific lipid alterations upon DEGS1/ifc loss.

      Weaknesses:

      (1) The authors performed lipidomics and RTqPCR on whole larvae and larval CNS from which it is impossible to define the cell type-specific effects. Ideally, this could be further supported by performing single cell RNAseq on larval brains to tease apart the cell-type specific effect of DEGS1/ifc loss.

      We agree that using scRNAseq or pairing FACS-sorting of individual glial subtypes with bulk RNAseq would help tease apart the cell-type specific effects of DEGS1/ifc loss on glial cells. At this time, however, this approach extends beyond the scope of the current paper and means of the lab. 

      (2) It's clear from the data that the accumulation of dihydroceramide in the ER triggers ER expansion but it remains unclear how or why this happens. Additionally, the authors assume that, because of the reduction in LD numbers, that the source of fatty acids comes from the LDs. But there is no data testing this directly.

      As CERT, the protein that transports ceramide from the ER to the Golgi, is far more efficient at transporting ceramide than dihydroceramide, we speculate that dihydroceramide accumulates in the ER due to inefficient transport from the ER to the Golgi by CERT. We state this model more explicitly in the results under the subheading “Reduction of dihydroceramide synthesis suppresses the ifc CNS phenotype”.

      We agree with the point on lipid droplet. We observe a correlation, not a causation, between reduction of lipid droplets and a large expansion of ER membrane. We have tried to clarify the text in the last paragraph of the discussion to make this point more clearly. See also response to reviewer 2 point 3. 

      (3) The authors performed a beautiful EMS screen identifying several LOF alleles in ifc. However, the authors decided to only use KO/ifcJS3. The paper could be strengthened if the authors could replicate some of the key findings in additional fly lines.

      We agree. We replicated the observed cortex glia swelling, ER expansion in cortex glia, and observed increase in neuronal cell death markers in late-third instar larvae mutant for either the ifcjs1 or ifcjs2 allele. These data are now provided as Supplementary Figure 7.

      (4) The authors use M{3xP3-RFP.attP}ZH-51D transgene as a general glial marker. However, it would be advised to show the % overlap between the glial marker and the RFP since a lot of cells are green positive but not per se RFP positive and vice versa.

      We visually reexamined the expression of the 3xP3 RFP transgene relative to FABP labeling for cortex glia, Ebony for astrocyte-like glia, and the Myr-GFP transgene driven by glial-subtype specific GAL4 driver lines for perineurial, subperineurial, and ensheathing glia. We note that RFP localizes to the nucleus cytoplasm while FABP and Ebony localize to the cytoplasm and Myr-GFP to the cell membrane. Thus, an observed lack of overlap of expression between RFP and the other markers can arise to differential localization of the two markers in the same cells (see, for example, Fig. S2D where Myr-GFP expression in the nuclear envelope encircles that of RFP in the nucleus. Through visual inspection of five larval-brain complexes for each glial subtype marker, we found that essentially all cortex, SPG, and ensheathing glia expressed RFP. Similarly, nearly all astrocyte-like glia also expressed RFP, but they expressed RFP at significantly lower levels than that observed for cortex, SPG, or ensheathing glia. This analysis also confirmed that most perineurial glia do not express RFP. The 3xP3 M{3xP3-RFP.attP}ZH-51D transgene then labels most glia in the Drosophila CNS. We have added text to Supplementary Figure 2 noting the above observations as to which glial cells express RFP. 

      (5) The authors indicate that other 3xP3 RFP and GFP transgenes at other genomic locations also label most glia in the CNS. Do they have a preferential overlap with the different glial subtypes?

      We assessed three different types of 3xP3 RFP and GFP transgenes: M{3xP3RFP.attp} transgenes (n=4), Mi{GFP[E.3xP3]=ET1} transgenes (n=3), and

      Tl{GFP[3xP3.cLa]=CRIMIC.TG4} transgenes (n>6). All labeled cortex glia, but different lines exhibited differential labeling of astrocyte and ensheathing glia. These data are now included as Supplementary Figure 3.

      Reviewer #2 (Public Review):

      Summary:

      The manuscript by Zhu et al. describes phenotypes associated with the loss of the gene ifc using a Drosophila model. The authors suggest their findings are relevant to understanding the molecular underpinnings of a neurodegenerative disorder, HLD-18, which is caused by mutations in the human ortholog of ifc, DEGS1.

      The work begins with the authors describing the role for ifc during fly larval brain development, demonstrating its function in regulating developmental timing, brain size, and ventral nerve cord elongation. Further mechanistic examination revealed that loss of ifc leads to depleted cellular ceramide levels as well as dihydroceramide accumulation, eventually causing defects in ER morphology and function. Importantly, the authors showed that ifc is predominantly expressed in glia and is critical for maintaining appropriate glial cell numbers and morphology. Many of the key phenotypes caused by the loss of fly ifc can be rescued by overexpression of human DEGS1 in glia, demonstrating the conserved nature of these proteins as well as the pathways they regulate. Interestingly, the authors discovered that the loss of lipid droplet formation in ifc mutant larvae within the cortex glia, presumably driving the deficits in glial wrapping around axons and subsequent neurodegeneration, potentially shedding light on mechanisms of HLD-18 and related disorders.

      Strengths:

      Overall, the manuscript is thorough in its analysis of ifc function and mechanism. The data images are high quality, the experiments are well controlled, and the writing is clear.

      Weaknesses:

      (1) The authors clearly demonstrated a reduction in number of glia in the larval brains of ifc mutant flies. What remains unclear is whether ifc loss leads to glial apoptosis or a failure for glia to proliferate during development. The authors should distinguish between these two hypotheses using apoptotic markers and cell proliferation markers in glia.

      To address this point, we used phospho-histone H3 to assess mitotic index in the thoracic CNS of wild-type versus ifc mutant late third instar larvae and found a mild, but significant reduction in mitotic index in ifc mutant relative to wild-type nerve cords. We also assessed the ability of glial-specific expression of the potent anti-apoptotic gene p35 to rescue the observed loss of cortex glia phenotype in the thoracic region of the CNS of otherwise ifc mutant larvae and observed a clear increase in cortex glia in the presence versus the absence of glial-specific p35 expression (p<3 x 10-4). These data are now provided as Supplementary Figure S8 in the paper and referred to on page 8.

      (2) It is surprising that human DEGS1 expression in glia rescues the noted phenotypes despite the different preference for sphingoid backbone between flies and mammals. Though human DEGS1 rescued the glial phenotypes described, can animal lethality be rescued by glial expression of human DEGS1? Are there longer-term effects of loss of ifc that cannot be compensated by the overexpression of human DEGS1 in glia (age-dependent neurodegeneration, etc.)?

      We note explicitly that while glial expression of human DEGS1 does provide rescuing activity, it only partially rescues the ifc mutant CNS phenotype in contrast to glial expression of Drosophila ifc, which fully rescues this phenotype. Thus, the relative activity of human DEGS1 is far below that of Drosophila ifc when assayed in flies. To quantify the functional difference between the two transgenes, we assessed the ability of glial expression of fly ifc or of human DEGS1 to rescue the lethality of otherwise ifc mutant larvae: Glial expression of ifc was sufficient to rescue the adult viability of 57.9% of ifc mutant flies based on expected Mendelian ratios (n=2452), whereas glial expression of DEGS1 was sufficient to rescue just 3.9% of ifc mutant flies (n=1303), uncovering a ~15-fold difference in the ability of the two transgenes to rescue the lethality of otherwise ifc mutant flies. In the absence of either transgene, no ifc mutant larvae reached adulthood (n=1030). These data are now provided in the text on page 9 of the revised manuscript. 

      (3) The mechanistic link between the loss of ifc and lipid droplet defects is missing. How do defects in ceramide metabolism alter triglyceride utilization and storage? While the author's argument that the loss of lipid droplets in larval glia will lead to defects in neuronal ensheathment, a discussion of how this is linked to ceramides needs to be added.

      We have revised the text to address this point. We speculate that the apparent increased demand for membrane phospholipid synthesis may drive the depletion of lipid droplets, providing a link to ifc function and ceramides. Below we provide the rewritten last paragraph; the underlined section is the new text.  

      “The expansion of ER membranes coupled with loss of lipid droplets in ifc mutant larvae suggests that the apparent demand for increased membrane phospholipid synthesis may drive lipid droplet depletion, as lipid droplet catabolism can release free fatty acids to serve as substrates for lipid synthesis. At some point, the depletion of lipid droplets, and perhaps free fatty acids as well, would be expected to exhaust the ability of cortex glia to produce additional membrane phospholipids required for fully enwrapping neuronal cell bodies. Under wild-type conditions, many lipid droplets are present in cortex glia during the rapid phase of neurogenesis that occurs in larvae. During this phase, lipid droplets likely support the ability of cortex glia to generate large quantities of membrane lipids to drive membrane growth needed to ensheathe newly born neurons. Supporting this idea, lipid droplets disappear in the adult Drosophila CNS when neurogenesis is complete and cortex glia remodeling stops. We speculate that lipid droplet loss in ifc mutant larvae contributes to the inability of cortex glia to enwrap neuronal cell bodies. Prior work on lipid droplets in flies has focused on stress-induced lipid droplets generated in glia and their protective or deleterious roles in the nervous system. Work in mice and humans has found that more lipid droplets are often associated with the pathogenesis of neurodegenerative diseases, but our work correlates lipid droplet loss with CNS defects. In the future, it will be important to determine how lipid droplets impact nervous system development and disease.”

      (4) On page 10, the authors use the words "strong" and "weak" to describe where ifc is expressed. Since the use of T2A-GAL4 alleles in examining gene expression is unable to delineate the amount of gene expression from a locus, the terms "broad" and "sparse" labeling (or similar terms) should be used instead.

      The ifc T2A-GAL4 insert in the ifc locus reports on the transcription of the gene. We agree that GAL4 system will not reflect amount of gene expression differences when the expression levels are not dramatically different. However, when the expression levels differ dramatically, as in our case, GAL4 system can reflect this difference in the expression of a reporter gene.  We reworded this section to suggest that ifc is transcribed at higher levels in glia as compared to neurons. We can’t use sparse or broad, as ifc is expressed in all, or at least in most, glia and neurons. The new text is as follows:” Using this approach, we observed strong nRFP expression in all glial cells (Figures 4D and S10A) and modest nRFP expression in all neurons (Figures 4E and S10B), suggesting ifc is transcribed at higher levels in glial cells than neurons in the larval CNS.”  

      Reviewer #3 (Public Review):

      Summary:

      In this manuscript, the authors report three novel ifc alleles: ifc[js1], ifc[js2], and ifc[js3]. ifc[js1] and ifc[js2] encode missense mutations, V276D and G257S, respectively. ifc[js3] encodes a nonsense mutation, W162*. These alleles exhibit multiple phenotypes, including delayed progression to the late-third larval instar stage, reduced brain size, elongation of the ventral nerve cord, axonal swelling, and lethality during late larval or early pupal stages.

      Further characterization of these alleles the authors reveals that ifc is predominantly expressed in glia and localizes to the endoplasmic reticulum (ER). The expression of ifc gene governs glial morphology and survival. Expression of fly ifc cDNA or human DEGS1 cDNA specifically in glia, but not neurons, rescues the CNS phenotypes of ifc mutants, indicating a crucial role for ifc in glial cells and its evolutionary conservation. Loss of ifc results in ER expansion and loss of lipid droplets in cortex glia. Additionally, loss of ifc leads to ceramide depletion and accumulation of dihydroceramide. Moreover, it increases the saturation levels of triacylglycerols and membrane phospholipids. Finally, the reduction of dihydroceramide synthesis suppresses the CNS phenotypes associated with ifc mutations, indicating the key role of dihydroceramide in causing ifc LOF defects.

      Strengths:

      This manuscript unveils several intriguing and novel phenotypes of ifc loss-of-function in glia. The experiments are meticulously planned and executed, with the data strongly supporting their conclusions.

      Weaknesses:

      I didn't find any obvious weakness.

      Reviewer #1 (Recommendations For The Authors):

      Additional minor comments below:

      (1) The authors state that TGs are the building blocks of membrane phospholipids. This is not exactly true. The breakdown of TGs can result in free FAs which can be used for membrane phospholipid synthesis. Also, membrane phospholipids can also be generated from free FAs that were never in TGs.

      To address this point, we have reworked a number of sentences in the text. On page 12 we reworded two small sections to the following: 

      “In the CNS, lipid droplets form primarily in cortex glia[29] and are thought to contribute to membrane lipid synthesis through their catabolism into free fatty acids versus acting as an energy source in the brain.[41] Consistent with the possibility that increased membrane lipid synthesis drives lipid droplet reduction, RNA-seq assays of dissected nerve cords revealed that loss of ifc drove transcriptional upregulation of genes that promote membrane lipid biogenesis”

      As TG breakdown results in free fatty acids that can be used for membrane phospholipid synthesis, we asked if changes in TG levels and saturation were reflected in the levels or saturation of the membrane phospholipids phosphatidylcholine (PC), phosphatidylethanolamine (PE), and phosphatidylserine (PS).

      (2) Figure 5J what does the dotted line indicate? Please specify in the figure legend or remove it.

      We have added the following text in the figure legend: Dotted line indicates a log2 fold change of 0.5 in the treatment group compared to the control group.

      (3) The text for your graphs is hard to read. Please make the font larger.

      We have increased font size to enhance the readability of the figures.

      (4) The authors mentioned that driving ifc expression in neurons rescues the phenotypes (ref 17). While the glial-specific role presented in this study is robust. I think some readers would appreciate some discussion of this study in light of the data presented here.

      We have added the below text on page 10 to address this point.

      “Results of our gene rescue experiments conflict with a prior study on ifc in which expression of ifc in neurons was found to rescue the ifc phenotype. In this context, we note that elav-GAL4 drives UASlinked transgene expression not just in neurons, but also in glia at appreciable levels, and thus needs to be paired with repo-GAL80 to restrict GAL4-mediated gene expression to neurons. Thus, “off-target” expression in glial cells may account for the discrepant results. It is, however, more difficult to reconcile how neuronal or glial expression of ifc would rescue the observed lethality of the ifc-KO chromosome given the presence additional lethal mutations in the 21E2 region of the second chromosome.”

      (5) While the analysis of fatty acid saturation is experimentally well done. I'm not really sure what the significance of this data is.

      We included this information as a reference for future analysis of additional genes in the ceramide biogenesis pathway, as we expect that alteration of the levels and saturation levels of PE, PC, and PS in cell membranes may underlie key changes in the biophysical properties of glial cell membranes and their ability to enwrap or infiltrate their targets. Thus, we expect the significance of these data to grow as more work is done on additional members of the ceramide pathway in the nervous system in flies and other systems.  

      Reviewer #2 (Recommendations For The Authors):

      (1) There is a typo at the top of page 11: "internal membranes and fail enwrap neurons" is missing the word "to" before "enwrap"

      The typo was fixed.

      (2)  PMID: 36718090 should be included in the discussion of SPT and ORMDL complex in human disease.

      The reference was added.

      Reviewer #3 (Recommendations For The Authors):

      In this manuscript, the authors report three novel ifc alleles: ifc[js1], ifc[js2], and ifc[js3]. ifc[js1] and ifc[js2] encode missense mutations, V276D and G257S, respectively. ifc[js3] encodes a nonsense mutation, W162*. These alleles exhibit multiple phenotypes, including delayed progression to the late-third larval instar stage, reduced brain size, elongation of the ventral nerve cord, axonal swelling, and lethality during late larval or early pupal stages.

      Further characterization of these alleles the authors reveals that ifc is predominantly expressed in glia and localizes to the endoplasmic reticulum (ER). The expression of ifc gene governs glial morphology and survival. Expression of fly ifc cDNA or human DEGS1 cDNA specifically in glia, but not neurons, rescues the CNS phenotypes of ifc mutants, indicating a crucial role for ifc in glial cells and its evolutionary conservation. Loss of ifc results in ER expansion and loss of lipid droplets in cortex glia. Additionally, loss of ifc leads to ceramide depletion and accumulation of dihydroceramide. Moreover, it increases the saturation levels of triacylglycerols and membrane phospholipids. Finally, the reduction of dihydroceramide synthesis suppresses the CNS phenotypes associated with ifc mutations, indicating the key role of dihydroceramide in causing ifc LOF defects.

      In summary, this manuscript unveils several intriguing and novel phenotypes of ifc loss-of-function in glia. The experiments are meticulously planned and executed, with the data strongly supporting their conclusions. I have no additional comments and fully support the publication of this manuscript in eLife.

      The authors also note that they added one paragraph to the discussion that addresses the possibility that the increased detection of cell death markers could arise due to the inability of glial cells to remove cellular debris. The text of this paragraph is provided below:

      We note that cortex glia are the major phagocytic cell of the CNS and phagocytose neurons targeted for apoptosis as part of the normal developmental process.23-26  Thus, while we favor the model that ifc triggers neuronal cell death due to glial dysfunction, it is also possible that increased detection of dying neurons arises due at least in part to a decreased ability of cortex glia to clear dying neurons from the CNS. At present, the large number of neurons that undergo developmentally programmed cell death combined with the significant disruption to brain and ventral nerve cord morphology caused by loss of ifc function render this question difficult to address.Additional evidence does, however, support the idea that loss of ifc function drives excess neuronal cell death: Clonal analysis in the fly eye reveals that loss of ifc drives photoreceptor neuron degeneration17, indicating that loss of ifc function drives neuronal cell death; cortex-glia specific depletion of CPES, which acts downstream of ifc, disrupts neuronal function and induces photosensitive epilepsy in flies59, indicating that genes in the ceramide pathway can act nonautonomously in glia to regulate neuronal function; recent genetic studies reveal that other glial cells can compensate for impaired cortex glial cell function by phagocytosing dying neurons62, and we observe that the cell membranes of subperineurial glia enwrap dying neurons in ifc mutant larvae (Fig. S14), consistent with similar compensation occurring in this background, and in humans, loss of function mutations in DEGS1 cause neurodegeneration.7-9 Clearly, future work is required to address this question for ifc/DEGS1 and perhaps other members of the ceramide biogenesis pathway.

    1. Author response:

      We would like to thank the three reviewers for the careful review and thoughtful comments on our manuscript. In addition to providing useful suggestions, they uncovered some embarrassing oversights on our part, related to experimental details including number of embryos, and quantification of variance in the observed changes for some of the experiments, which were inadvertently omitted in the submission. We provide below an initial response to the reviewer’s public reviews and expect to submit a revised manuscript comprehensively addressing all their concerns.

      I would like to start by addressing some of their most critical comments related to validation of the tools used to reduce soxB1 gene family function in the embryo.  In the absence of the critical supplementary data that we inadvertently failed to include, the reviewers were left with an understandable, but we feel erroneous impression, that there was insufficient validation of mutant and knockdown tools. 

      Reviewer #2 says “The sox2y589 mutant line is not properly verified in this manuscript, which could be done by examining ant-Sox2 antibody labeling, Western blot analysis or…”

      This validation, which had been performed previously both with antibody staining and with western blot analysis, was inadvertently omitted from the supplementary data submitted with the paper. The western blot data is shown here.

      Author response image 1.

      Validation of sox2 mutant phenotype with Western blot.

      Lysates were prepared from 25 embryos selected as wild type or potentially mutant based on the “loss of L1” phenotype at 6 dpf. This polyclonal antibody recognizes within the last 16 amino acids of the C-terminal.

      Author response image 2.

      Validation of sox2 mutant phenotype with antibody staining.

      Though in this experiment there was considerable background in the red channel, and it shows the lateral line nerve, loss of nuclear Sox2 expression is evident in the deposited neuromast of an embryo identified as a mutant based on its delayed deposition of the L1 neuromast.

      This data and a repeat of the antibody staining showing the primordium with loss of Sox2 will be included in a revised manuscript.

      Furthermore, Reviewer #2 comments “the authors show that the anti-Sox2 and antiSox3 antibody labeling is reduced but not absent in sox2 MO1 and sox3 MO-injected embryos, but do not show antibody labeling of the sox2 MO and sox3 MO-double injected embryos to determine if there is an additional knockdown”

      This will be included in a revised manuscript.

      Reviewer #2:

      The authors acknowledge that the sox2 MO1 used in this manuscript also alters sox3 function, but do not redo the experiments with a specific sox2 MO

      This is not exactly true. Having discovered sox2 MO1 simultaneously reduces sox2 and sox3 function, three new morpholinos were obtained based on another paper (Kamachi et al 2008), which had quantitatively assessed efficacy of three sox2 specific morpholinos (sox2 MO2, sox2 MO3, and sox2 MO4). The effects of these morpholinos on the pattern of L1 deposition was compared to that of sox2 MO1. This comparison was shown in supplementary Figure 2 and is included below. It shows that the sox2 specific morpholinos resulted in a poorly penetrant delay in deposition of L1, comparable to that of a sox2 mutant, which was quantified in supplementary Figure 3B. The observations with these three sox2 specific morpholinos independently supported the observations made with the sox2 mutant that reduction of sox2 on its own results in a delay in deposition of the first neuromast with low penetrance and that to effectively examine the role of these SoxB1 genes in the primordium their function needs to be compromised in a combinatorial manner. A conclusion that was independently supported by observations made by crossing sox1a, sox2 and sox3 mutants (Figure 3 and Supplementary Figure 3). Therefore, even though the initial use of a sox2 morpholino, which simultaneously knocks down sox3, was unintentional, its use turned out to be useful. It allowed us to examine effects of knocking down sox2 and sox3 with a single morpholino. Furthermore, though this project was initiated more than 15 years ago to specifically understand sox2 function, our focus had shifted to understanding the role of soxB1 family members sox1a, sox2 and sox3 functioning together as an interacting system that regulates Wnt activity in the primordium. Considering this broader focus, reflected in the title of the paper, it was not a priority to repeat every experiment previously done with the sox2MO1 with the new sox2 specific morpholinos. Instead, having acknowledged the “limitations” of sox2MO1, we used it to better understand effects of combinatorial reduction of SoxB1 function.

      Reviewer #1:

      It is not exactly clear what underlies the apparent redundancy. It would be helpful if the soxb gene family member expression was reported after loss of each.

      As suggested by reviewer #1, we had previously looked changes in expression of each of the soxB1 factors following loss of individual soxB1 factors but not included it in the supplementary data with the original submission. Independent of a reproducible and consistent expansion sox1a expression into the trailing zone, following loss of sox2 function, which is reported in the paper and quantified here where 10/10 mutant embryos showed the expansion (compare region within bracket in WT and sox2<sup>-/-</sup>), no consistent changes in the expression of other soxB1 family members was observed as part of a mechanism that might account for compensation when function of a particular soxB1 factor is soxB1 factor is lost. The data shown above together with more extensive quantification of changes will be included in a revised version of the manuscript. At this time the only consistent change was the expansion of sox1a to the trailing zone when lost. The data trailing zone when sox2 function is lost. This change reflects dependence of sox1a on Wnt activity and the fact that Wnt activity expands into the trailing zone when sox2 function is lost.  

      Author response image 3.

      Reviewer #3:

      Given that the expression patterns of Sox1a and Sox3 are not merely different but are largely reciprocal, the mechanistic basis of their very similar double mutant phenotypes with Sox2 remains opaque.

      The simplest way to think about compensation for gene function in a network is to think of it being determined by expression of a homolog or another gene with a similar function being expressed in a similar or overlapping domain.  However, it is more useful to think of Sox2 function in the primordium as part of a interacting network of SoxB1 factors whose differential regulatory mechanisms create a robust system that simultaneously regulates two key aspects of Wnt activity in the primordium; how high Wnt activity is allowed to get in the leading zone and how effectively it is shut off to facilitate protoneuromast maturation in the trailing zone. These features of Wnt activity influence both when and where nascent protoneuromasts will form in the wake of a progressively shrinking Wnt system and where they undergo effective maturation and stabilization prior to deposition. Changes in individual SoxB1 expression patterns provide some hints about how some SoxB1 factors may compensate when function of one or more of these factors is compromised. However, a deeper understanding of robustness and “compensation” will require a systems level understanding of this gene regulatory network with computational models, which we are currently working on in our group. It remains possible, for example, that how far into the trailing zone the Wnt activity has an influence is regulated at least in part by how high it is allowed to get in the leading zone by sox1a. Conversely, how high Wnt activity gets in the leading zone may be influenced by how effectively it is shut off in the trailing zone by sox2 and sox3, as this influences the size of the Wnt system, which in turn can influence the overall level of Wnt activity. In this manner Sox1a may cooperate with Sox2 and Sox3 to limit both how high Wnt activity is allowed to get in the primordium and to effectively shut it off in the trailing zone.

      Reviewer #3:

      Related to this, the authors discuss that Sox1a/Sox2 double knockdown produces a more severe phenotype than Sox2/Sox3 double knockdown, yet this difference is not obviously reflected in the data.

      The severity of the sox1a/sox2 double mutant phenotype compared to that of the sox2/sox3 double mutant is shown in Figure 3 K and N, and quantified in Supplementary Figure 3A. Simultaneous loss of sox2 and sox3 results in a small but relatively penetrant delay in where the first stable neuromast is deposited (Figure 2 N). By contrast, loss of sox2 and sox1a together consistently results in a longer delay in deposition of the first stable (Figure 2 K). A new graph, shown below, which will be incorporated in the revised paper, shows that there is a significant difference in the pattern of L1 deposition in sox1a<sup>-/-</sup>, sox2<sup>-/-</sup> and sox2<sup>-/-</sup>, sox3<sup>-/-</sup> double mutants. 

      Author response image 4.

      All 3 datasets found to be normally distributed by Shapiro-Wilk test. 1-way ANOVA showed significance (<0.0001), with Tukey’s multiple comparisons test showing significant difference between all 3 conditions. (***p=0.0008, ****p<0.0001)

      Reviewer #1:

      It would be good to more clearly state why sox3 is not regulated by Wnt given its expression is inhibited by the delta TCF construct (Figure 2M).

      The explanation for why we believe sox3 expression is determined by Fgf signaling, and not Wnt activity requires integrating what is observed both with induction of the delta TCF construct and the dominant negative Fgf receptor (DN FgfR). Loss of sox3 expression with induced expression of the delta TCF construct could result from loss of Wnt activity or the downstream loss of Fgf activity, which is ultimately dependent on Fgfs secreted by Wnt active cells in the leading domain. Distinguishing between these possibilities is based on inhibition of FGF signaling with the DN FgfR, described in the next paragraph. Heat Shock induced expression of DN FgfR expression results in loss of FGF signaling and the simultaneous expansion of Wnt activity into the trailing zone. As explained in the original text, loss of sox3 expression in this context, rather than its expansion, suggests its expression is determined by Fgf signaling not Wnt activity. We will emphasize that its loss, rather than its expansion, following induction of DN FgfR, indicates its expression is determined by Fgf signaling not Wnt activity.

      Reviewer #2:

      The manuscript lacks quantification of many of the experiments, making it difficult to conclude their significance.

      One of the biggest inadvertent omissions of the paper was the inadequate quantification of some of the results. Quantification of results with considerable variation in the outcome, like the pattern of L1 deposition,  was provided following manipulations where various combinations of sox1a, sox2, and sox3 function was lost (Figures 3, supplementary Figures 2 and 3) or where sox2MO1/sox3MO was used with or without IWR (Figure 5 and Figure 6). However, numbers for the experiments in Figures 2 were omitted in the Figure legend, where typically about 10 embryos for each manipulation were photographed, scored, and a representative image was used to make the figure. In these experiments  there was a very consistent result with 100% of the embryos showing changes represented by each panel in Figure 2. The only exception was Figure 2Y where 9/10 embryos showed the described change. Similarly in Figure 4 there was a consistent result and 100% of embryos showed the change shown. Numbers and statistics for these results will be included in a revised manuscript.

      Reviewer #2:

      The statistical analysis in Figure 5 and Supplementary Figures 2 and 3 should be one-way ANOVA or Kruskal-Wallis with a Dunn's multiple comparisons test rather than pair-wise comparisons.

      The analysis has been re-done following the reviewer’s suggestions. The analysis confirms the primary conclusions of the original submission, and this analysis will be incorporated in a revised manuscript. However, to improve the power of the analysis, experiments with low numbers of embryos will be repeated.

      See redone graphs in Figure 5 and supplementary Figure 2 and 3.

    1. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The authors revisit the specific domains/signals required for the redirection of an inner nuclear membrane protein, emerin, to the secretory pathway. They find that epitope tagging influences protein fate, serving as a cautionary tale for how different visualisation methods are used. Multiple tags and lines of evidence are used, providing solid evidence for the altered fate of different constructs.

      Strengths:

      This is a thorough dissection of domains and properties that confer INM retention vs secretion to the PM/lysosome, and will serve the community well as a caution regarding the placement of tags and how this influences protein fate.

      Weaknesses:

      Biogenesis pathways are not explored experimentally: it would be interesting to know if the lysosomal pool arrives there via the secretory pathway (eg by engineering a glycosylation site into the lumenal domain) or by autophagy, where failed insertion products may accumulate in the cytoplasm and be degraded directly from cytoplasmic inclusions.

      This manuscript is a Research Advance that follows previous work that we published in eLife on this topic (Buchwalter et al., eLife 2019; PMID 31599721). In that prior publication, we showed that emerin-GFP arrives at the lysosome by secretion and exposure at the PM, followed by internalization. While we state these previous findings in this manuscript, we did not explicitly restate here how we came to that conclusion. In the 2019 study, we (i) engineered in a glycosylation site, which demonstrated that emerin-GFP receives complex, Endo H-resistant N-glycans, indicating passage through the Golgi; (ii) performed cell surface labeling, which confirmed that emerin accesses the PM; and interfered with (iii) the early secretory pathway using brefeldin A and with (iv) lysosomal function using bafilomycin A1. Further, we ruled out autophagy as a major contributor to emerin trafficking by treating cells with the PI3K inhibitor KU55933, which had no effect on emerin’s lysosomal delivery.

      It would be helpful if the topology of constructs could be directly demonstrated by pulse-labelling and protease protection. It's possible that there are mixed pools of both topologies that might complicate interpretation.

      We demonstrate that emerin’s TMD inserts in a tail-anchored orientation (C terminus in ER lumen) by appending a GFP tag to either the N or C terminus, followed by anti-GFP antibody labeling of unpermeabilized cells (Fig. 1G). This shows the preferred topology of emerin’s wild type TMD.

      As the reviewer points out, it is possible that our manipulations of the TMD sequence (Fig. 2D-E) alter its preferred topology of membrane insertion. We addressed this question by performing anti-GFP and anti-emerin antibody labeling of the less hydrophobic TMD mutant (EMD-TMDm-GFP) after selective permeabilization of the plasma membrane (Figure 2 supplement, panel F). If emerin biogenesis is normal, the GFP tag should face the ER lumen while the emerin antibody epitope should be cytosolic. If the fidelity of emerin’s membrane insertion is impaired, the GFP tag could be exposed to the cytosol (flipped orientation), which would be detected by anti-GFP labeling upon plasma membrane permeabilization. We find that the C-terminal GFP tag is completely inaccessible to antibody when the PM is selectively permeabilized with digitonin, but is readily detected when all intracellular membranes are permeabilized with Triton-X-100. These data confirm that mutating emerin’s TMD does not disrupt the protein’s membrane topology.

      Reviewer #2 (Public review):

      In this manuscript, Mella et al. investigate the effect of GFP tagging on the localization and stability of the nuclear-localized tail-anchored (TA) protein Emerin. A previous study from this group showed that C-terminally GFP-tagged Emerin protein traffics to the plasma membrane and reaches lysosomes for degradation. It is suggested that the C-terminal tagging of tail-anchored proteins shifts their insertion from the post-translational TRC/GET pathway to the co-translational SRP-mediated pathway. The authors of this paper found that C-terminal GFP tagging causes Emerin to localize to the plasma membrane and eventually reach lysosomes. They investigated the mechanism by which Emerin-GFP moves to the secretory pathway. By manipulating the cytosolic domain and the hydrophobicity of the transmembrane domain (TMD), the authors identify that an ER retention sequence and strong TMD hydrophobicity contribute to Emerin trafficking to the secretory pathway. Overall, the data are solid, and the knowledge will be useful to the field. However, the authors do not fully answer the question of why C-terminally GFP-tagged Emerin moves to the secretory pathway. Importantly, the authors did not consider the possible roles of GFP in the ER lumen influencing Emerin trafficking to the secretory pathway.

      Reviewer #2 (Recommendations for the authors):

      Major concerns:

      (1) The authors suggest that an ER retention sequence and high hydrophobicity of Emerin TMD contribute to its trafficking to the secretory pathway. However, these two features are also present in WT Emerin, which correctly localizes to the inner nuclear membrane. Additionally, the authors show that the ER retention sequence is normally obscured by the LEM domain. The key difference between WT Emerin and Emerin-GFP is the presence of GFP in the ER lumen. The authors missed investigating the role of GFP in the ER lumen in influencing Emerin trafficking to the secretory pathway. It is likely that COPII carrier vesicles capture GFP protein in the lumen as part of the bulk flow mechanism for transport to the Golgi compartment. The authors could easily test this by appending a KDEL sequence to the C-terminus of GFP; this should now redirect the protein to the nucleus.

      We agree with the reviewer’s point that the presence of lumenal GFP somehow promotes secretion of emerin from the ER, likely at the stage of enhancing its packaging into COPII vesicles. We struggle to think about how to interpret the KDEL tagging experiment that the reviewer proposes, as the KDEL receptor predominantly recycles soluble proteins from the Golgi to the ER, while emerin is a membrane protein; and we have shown that emerin already contains a putative COPI-interacting RRR recycling motif in its cytosolic domain.

      Nevertheless, we agree with the reviewer that it is worthwhile to test the possibility that addition of GFP to emerin’s C-terminus promotes capture by COPII vesicles. We have evaluated this question by performing temperature block experiments to cause cargo accumulation within stalled COPII-coated ER exit sites, then comparing the propensity of various untagged and tagged emerin variants to enrich in ER exit sites as judged by colocalization with the COPII subunit Sec31a. These data now appear in Figure 4 supplement 1. These experiments indicate that emerin-GFP samples ER exit sites significantly more than does untagged emerin. Further, the ER exit site enrichment of emerin-GFP is dampened by shortening emerin’s TMD. We do not see further enrichment of any emerin variant in ER exit sites when COPII vesicle budding is stalled by low temperature incubation, implying that emerin lacks any positive sorting signals that direct its selective enrichment in COPII vesicles. Altogether, these data indicate that both emerin’s long and hydrophobic TMD and the addition of a lumenal GFP tag increase emerin’s propensity to sample ER exit sites and undergo non-selective, “bulk flow” ER export.

      (2) The authors nicely demonstrate that the hydrophobicity of Emerin TMD plays a role in its secretory trafficking. I wonder if this feature may be beneficial for cells to degrade newly synthesized Emerin via the lysosomal pathway during mitosis, as the nuclear envelope breakdown may prevent the correct localization of newly synthesized Emerin. The authors could test Emerin localization during mitosis. Such findings could add to the physiological significance of their findings. At the minimum, they should discuss this possibility.

      We thank the reviewer for this insightful suggestion. It is attractive to speculate that secretory trafficking might enable lysosomal degradation of emerin during mitosis, when its lamin anchor has been depolymerized. However, we think it is unlikely that mitotic trafficking contributes significantly to the turnover flux of untagged emerin; if it did, we would expect to see higher steady state levels and/or slowed turnover of emerin mutants that cannot traffic to the lysosome. We did not observe this outcome. Instead, mutations that enhance (RA) or impair (TMDm) emerin trafficking had no effect on the untagged protein’s steady-state levels (Fig. 4G).

      Minor concerns:

      (1) On page 7, the authors note that "FLAG-RA construct was not poorly expressed relative to WR, in contrast with RA-GFP (Figures S3C, 2I)." The expression levels of these proteins cannot be compared across two different blots.

      We apologize for this confusion; we were implying two distinct comparisons to internal controls present on each blot. We have adjusted the text to read “FLAG-RA construct was not poorly expressed relative to FLAG-WT (Fig. S3C) in contrast to RA-GFP compared to WT-GFP (Fig. 2I).”

      (2) In the first paragraph of the discussion, the authors suggest that aromatic amino acids facilitate trafficking to lysosomes. However, they only replaced aromatic amino acids with alanine residues. If they want to make this claim, they should test other amino acids, particularly hydrophobic amino acids such as leucine.

      The reviewer may be inferring more import from our statement than we intended. We focused on these aromatic residues within the TMD because they contribute strongly to its overall hydrophobicity. Experimentally, we determined that nonconservative alanine substitutions of these aromatic residues inhibited trafficking. We do not state and do not intend to imply that the aromatic character of these residues specifically influences trafficking propensity, and we agree with the reviewer that to test such a question would require additional substitutions with non-aromatic hydrophobic amino acids.

      We realize that our phrasing may have been misleading by opening with discussion of the aromatic amino acids; in the revised discussion paragraph, we instead lead with discussion of TMD hydrophobicity, and then state how the specific substitutions we made affect trafficking.

      Reviewing Editor comments:

      While reviewer 1 did not provide any recommendations to the authors, I agree with this reviewer that the authors should validate the topology of their tagged proteins (at least for the one used to draw key conclusions). Given that Emerin is a tail-anchored protein, having a big GFP tag at the C-terminus could mess up ER insertion, causing the protein to take a wrong topology or even be mislocalized in the cytosol, particularly under overexpression conditions. In either case, it can be subject to quality control-dependent clearance via either autophagy, ERphagy, or ER-to-lysosome trafficking. I think that the authors should try a few straightforward experiments such as brefeldin A treatment or dominant negative Sar1 expression to test whether blocking conventional ER-to-Golgi trafficking affects lysosomal delivery of Emerin. I also think that the authors should discuss their findings in the context of the RESET pathway reported previously (PMID: 25083867). The ER stress-dependent trafficking of tagged Emerin to the PM and lysosomes appears to follow a similar trafficking pattern as RESET, although the authors did not demonstrate that Emerin traffic to lysosomes via the PM. In this regard, they should tone down their conclusion and discuss their findings in the context of the RESET pathway, which could serve as a model for their substrate.

      We agree that validating the topology of TMD mutants is important, and now include these experiments in the revised manuscript (please see our response to Reviewer 1 above).

      Please see our response to Reviewer 1’s public review; we previously determined that emerin-GFP undergoes ER-to-Golgi trafficking (see our 2019 study).

      We recognize the major parallels between our findings and the RESET pathway. In our 2019 study, we found that similarly to other RESET cargoes, emerin-GFP travels through the secretory pathway, is exposed at the PM, and is then internalized and delivered to lysosomes. We discussed these strong parallels to RESET in our 2019 study. In this revised manuscript, we now also point out the parallels between emerin trafficking and RESET and cite the 2014 study by Satpute-Krishnan and colleagues (PMID 25083867)

    1. Reviewer #2 (Public review):

      In 'Developmental constraints mediate the summer solstice reversal of climate effects on European beech bud set', Rebindaine and co-authors report on two experiments on Fagus sylvatica where they manipulated temperatures of saplings between day and night and at different times of year. I enjoyed reading this paper and found it well written. I think the experiments are interesting, but I found the exact methods somewhat extreme compared to how the authors present them. Further, given that much of the experiment happened outside, I am not sure how much we can generalize from one year for each experiment, especially when conducted on one population of one species. I next expand briefly on these concerns and a few others.

      Concerns:

      (1) As I read the Results, I was surprised the authors did not give more information on the methods here. For example, they refer to the 'effect of July cooling' but never say what the cooling was. Once I read the methods, I feared they were burying this as the methods feel quite extreme given the framing of the paper. The paper is framed as explaining observational results of natural systems, but the treatments are not natural for any system in Europe that I have worked in. For example, a low of 2 {degree sign}C at night and 7 {degree sign}C during the day through the end of May and then 7/13 {degree sign}C in July is extreme. I think these methods need to be clearly laid out for the reader so they can judge what to make of the experiment before they see the results.

      (2) I also think the control is confounded with the growth chamber experience in Experiment 1. That is, the control plants never experience any time in a chamber, but all the treatments include significant time in a chamber. The authors mention how detrimental chamber time can be to saplings (indeed, they mention an aphid problem in experiment 2), so I think they need to be more upfront about this. The study is still very valuable, but again, we may need to be more cautious in how much we infer from the results.

      (3) I suggest the authors add a figure to explain their experiments, as they are very hard to follow. Perhaps this could be added to Figure 1?

      (4) Given how much the authors extrapolate to carbon and forests, I would have liked to see some metrics related to carbon assimilation, versus just information on timing.

      (5) Fagus sylvatica is an extremely important tree to European forests, but it also has outlier responses to photoperiod and other cues (and leafs out very late), so using just this species to then state 'our results likely are generalisable across temperate tree species' seems questionable at best.

      (6) Another concern relates to measuring the end of season (EOS). It is well known that different parts of plants shut down at different times, and each metric of end of season - budset, end of radial expansion, leaf coloring, etc - relates to different things. Thus, I was surprised that the authors ignore all this complexity and seem to equate leaf coloring with budset (which can happen MONTHS before leaf coloring often) and with other metrics. The paper needs a much better connection to the physiology of end of season and a better explanation for the focus on budset. Relatedly, I was surprised that the authors cite almost none of the literature on budset, which generally suggests it is heavily controlled by photoperiod and population-level differences in photoperiod cues, meaning results may be different with a different population of plants.

      (7) I didn't fully see how the authors' results support the Solstice as Switch hypothesis, since what timing mattered seemed to depend on the timing of treatment and was not clearly related to the solstice. Could it be that these results suggest the Solstice as Switch hypothesis is actually not well supported (e.g., line 135) and instead suggest that the pattern of climate in the summer months affects end-of-season timing?

    1. Author response:

      The following is the authors’ response to the original reviews

      We would like to thank you and the reviewers for valuable feedback on the first version of the manuscript. We now addressed all of the issues raised by reviewers, mostly by implementing the suggested changes and clarifying important details in the revised version of the manuscript. A detailed response to each comment is provided in the rebuttal letter. Briefly, the main changes were as follow:

      - We changed homeostatic balance to network balance especially when describing the main finding as the response changes induced by the stimulation occurred on a fast timescale. We speculate the sustained changes observed in the post-stimulation condition are the result of homeostatic mechanisms.

      - We added additional verification on the target stimulation effect by adding a supplementary result showing its effect between the target and off-target z-planes, as well as demonstrating the minimal impact of the imaging laser to rsChRmine.

      - We added a simple toy model illustrating suppression specifically applied to co-tuned cells that yields the response amplitude decrease, to further support our findings.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Kang et al. provide the first experimental insights from holographic stimulation of auditory cortex. Using stimulation of functionally-defined ensembles, they test whether overactivation of a specific subpopulation biases simultaneous and subsequent sensory-evoked network activations.

      Strengths:

      The investigators use a novel technique to investigate the sensory response properties in functionally defined cell assemblies in auditory cortex. These data provide the first evidence of how acutely perturbing specific frequency-tuned neurons impacts the tuning across a broader population.

      Weaknesses:

      I have several main concerns about the interpretation of these data:<br /> (1) The premise of the paper suggests that sensory responses are noisy at the level of neurons, but that population activity is reliable and that different neurons may participate in sensory coding on different trials. However, no analysis related to single trial variance or overall stability of population coding is provided. Specifically, showing that population activity is stable across trials in terms of total activity level or in some latent low dimensional representation would be required to support the concept of "homeostatic balancing".

      Thank you for raising an important point. We agree that the term ‘homeostatic balancing’ may be not the best term to be applied to explain the main results. We now have toned down on the homeostatic plasticity aspect to explain the main result. We have changed the term to a simple ‘network balance’, potentially due to various factors including rapid synaptic plasticity. We speculate the persistent activity of co-tuned cells in the post-stimulation session as a result of homeostatic balance, instead of rapidly changing back their responses to the baseline. Relevant changes are implemented throughout the manuscript including Introduction (e.g., lines 76-78) and Discussion sections (e.g., lines 453-456).

      (2) Rebalancing would predict either that the responses of stimulated neurons would remain A) elevated after stimulation due to a hebbian mechanism or B) suppressed due to high activity levels on previous trials, a homeostatic mechanism. The authors report suppression in targeted neurons after stimulation blocks, but this appears similar to all other non-stimulated neurons. How do the authors interpret the post-stimulation effect in stimulated neurons?

      It is true that the post stimulation effect of no response change both from co-tuned and non co-tuned neurons, and both from stimulation and control sessions. This could be due to neuronal activity being adapted and decreased enough from the consecutive presentation of acoustic stimuli themselves. However, we still think that if the stimulation driven co-tuned non stimulated neurons’ response decrease is highly driven by stimulation without homeostasis, at least their responses should bounce back during the post-stimulation. We agree that further investigation would be required to further confirm such effect. We elaborated this as another discussion point in the discussion section (lines 457-464).

      (3) The authors suggest that ACtx is different from visual cortex in that neurons with different tuning properties are intermingled. While that is true at the level of individual neurons, there is global order, as demonstrated by the authors own widefield imaging data and others at the single cell level (e.g. Tischbirek et al. 2019). Generally, distance is dismissed as a variable in the paper, but this is not convincing. Work across multiple sensory systems, including the authors own work, has demonstrated that cortical neuron connectivity is not random but varies as a function of distance (e.g. Watkins et al. 2014). Better justification is needed for the spatial pattern of neurons that were chosen for stimulation. Further, analyses that account for center of mass of stimulation, rather than just the distance from any stimulated neuron would be important to any negative result related to distance.

      Thank you for the further suggestion regarding the distance matter. While Watkins et al., 2014 and Levy and Reyes (2012) showed stronger connectivity for nearby cells as well as for more distant patches, on a functional level, Winkowski & Kanold 2013 showed high frequency heterogeneity especially in L2/3, where we targeted to image in this study. Thus, connected cells can have varied tuning consistent with spine imaging (Konnerth paper). We now also calculated the distance based on the center of mass of target cells to calculate the distance effect for an additional verification and still observed no distance related stimulation effect. We now replaced the Figure 4B with the result from the center of mass calculation.

      (4) Data curation and presentation: Broadly, the way the data were curated and plotted makes it difficult to determine how well-supported the authors claims are. In terms of curation, the removal of outliers 3 standard deviations above the mean in the analysis of stimulation effects is questionable. Given the single-cell stimulation data presented in Figure 1, the reader is led to believe that holographic stimulation is quite specific. However, the justification for removing these outliers is that there may be direct stimulation 20-30 um from the target. Without plotting and considering the outliers as well, it is difficult to understand if these outsized responses are due to strong synaptic connections with neighboring neurons or rather just direct off-target stimulation. Relatedly, data presentation is limited to the mean + SEM for almost all main effects and pre-post stimulation effects are only compared indirectly. Whether stimulation effects are driven by just a few neurons that are particularly suppressed or distinct populations which are suppressed or enhanced remains unclear.

      Thank you for pointing this out. Now we specifically removed neighboring cells that are < 20 um from the target point and we observed similar. We replaced all the relevant figures, texts, and statistical results to ensure that the exclusion was specific to overlapping neighboring cells.

      Reviewer #2 (Public review):

      The goal of HiJee Kang et al. in this study is to explore the interaction between assemblies of neurons with similar pure-tone selectivity in mouse auditory cortex. Using holographic optogenetic stimulation in a small subset of target cells selective for a given pure tone (PTsel), while optically monitoring calcium activity in surrounding non-target cells, they discovered a subtle rebalancing process: co-tuned neurons that are not optogenetically stimulated tend to reduce their activity. The cortical network reacts as if an increased response to PTsel in some tuned assemblies is immediately offset by a reduction in activity in the rest of the PTsel-tuned assemblies, leaving the overall response to PTsel unchanged. The authors show that this rebalancing process affects only the responses of neurons to PTsel, not to other pure tones. They also show that assemblies of neurons that are not selective for PTsel don't participate in the rebalancing process. They conclude that assemblies of neurons with similar pure-tone selectivity must interact in some way to organize this rebalancing process, and they suggest that mechanisms based on homeostatic signaling may play a role.

      he conclusions of this paper are very interesting but some aspects of the study including methods for optogenetic stimulation, statistical analysis of the results and interpretation of the underlying mechanisms need to be clarified and extended.

      (1) This study uses an all-optical approach to excite a restricted group of neurons chosen for their functional characteristics (their frequency tuning), and simultaneously record from the entire network observable in the FOV. As stated by the authors, this approach is applied for the first time to the auditory cortex, which is a tour de force. However, such an approach is complex and requires precise controls to be convincing. In the manuscript, several methodological aspects are not sufficiently described to allow a proper understanding.

      (i) The use of CRmine together with GCaMP8s has been reported as problematic as the 2Ph excitation of GCaMP8s also excites the opsin. Here, the authors use a red-shifted version of CRmine to prevent such cross excitation by the imaging laser. To be convincing, they should explain how they controlled for the absence of rsCRmine activation by the 940nm light. Showing the fluorescence traces immediately after the onset of the imaging session would ensure that neurons are not excited as they are imaged.

      Thank you for pointing this out. We realized that the important reference was omitted. Kishi et al. 2022 validated the efficacy of the rsChRmine compared to ChRmine. In this paper, they compared regular ChRmine and rsChRmine activity to different wavelengths and setting and showed the efficiency of rsChRmine with reduced optical cross talk. This reference is now included in the manuscript (line 98). We also checked the spontaneous baseline activity that lasted about 10 sec. before any of the sound presentation and observed a relatively stable activity throughout, rather than any imaging session onset related activation, which is also similar to what we see from another group of GCaMP6s transgenic animals.

      Author response image 1.

      Baseline fluorescence activity across cells within FOVs from AAV9-hSyn-GCaMP8s-T2A-rsChRmine injected mice (top) and CBA X Thy1-GCaMP6s F1 transgenic mice (bottom). Fluorescence levels and activity patterns remain similar, suggesting no evident imaging laser-induced activation from rsChRmine. Note that GCaMP8s examples are smoothed by using moving average of 4 points as GCaMP8s show faster activity.

      (ii) Holographic patterns used to excite 5 cells simultaneously may be associated with out-of-focus laser hot spots. Cells located outside of the FOV could be activated, therefore engaging other cells than the targeted ones in the stimulation. This would be problematic in this study as their tuning may be unrelated to the tuning of the targeted cells. To control for such an effect, one could in principle decouple the imaging and the excitation planes, and check for the absence of out-of-focus unwanted excitation.

      We further verified whether the laser power at the targeted z-plane influences cells’ activity at nearby z-planes. As the Reviewer pointed out, the previous x- and y-axis shifts were tested by single-cell stimulation. This time, we stimulated five cells simultaneously, to match the actual experiment setup and assess potential artifacts in other planes. We observed no stimulation-driven activity increase in cells at a z-planed shifted by 20 µm (Supplementary Figure 1). This confirms the holographic stimulation accurately manipulates the pre-selected target cells and the effects we observe is not likely due to out-of-focus stimulation artifacts. It is true that not all pre-selected cells showing significant response changes prior to the main experiment are effectively activated t every trial during the experiments. We varied the target cell distances across FOVs, from nearby cells to those farther apart within the FOV. We have not observed a significant relationship between the target cell distances and stimulation effect. Lastly, cells within < 20 µm of the target were excluded to prevent potential excitation due to the holographic stimulation power. Given the spontaneous movements of the FOV during imaging sessions due to animal’s movement, despite our efforts to minimize them, we believe that any excitation from these neighboring neurons would be directly from the stimulation rather than the light pattern artifact itself.

      (iii) The control shown in Figure 1B is intended to demonstrate the precision of the optogenetic stimulation: when the stimulation spiral is played at a distance larger or equal to 20 µm from a cell, it does not activate it. However, in the rest of the study, the stimulation is applied with a holographic approach, targeting 5 cells simultaneously instead of just one. As the holographic pattern of light could produce out-of-focus hot spots (absent in the single cell control), we don't know what is the extent of the contamination from non-targeted cells in this case. This is important because it would determine an objective criterion to exclude non-targeted but excited cells (last paragraph of the Result section: "For the stimulation condition, we excluded non-target cells that were within 15 µm distance of the target cells...")

      Highly sensitive neurons to certain frequency also shows the greatest adaptation effect, which can be observed the control condition. Therefore, the high sensitive neurons showing greater amplitude change is first related to the neuronal adaptation to its sensitive information. However, by stimulating the co-tuned target neurons, other co-tuned non-target neurons shows significantly greater amplitude decrease, compared to either non co-tuned target neurons stimulation or control (the latter did not meet the significance level).

      We also tried putting more rigorous criterion as 20 um instead of 15 um as you pointed out since the spiral size was 20 um. The result yielded further significant response amplitude decrease due to the stimulation effect only from co-tuned non-target neurons for processing their preferred frequency information.

      (2) A strength of this study comes from the design of the experimental protocol used to compare the activity in non-target co-tuned cells when the optogenetic stimulation is paired with their preferred tone versus a non-preferred pure tone. The difficulty lies in the co-occurrence of the rebalancing process and the adaptation to repeated auditory stimuli, especially when these auditory stimuli correspond to a cell's preferred pure tones. To distinguish between the two effects, the authors use a comparison with a control condition similar to the optogenetic stimulation conditions, except that the laser power is kept at 0 mW. The observed effect is shown as an extra reduction of activity in the condition with the optogenetic paired with the preferred tone, compared to the control condition. The specificity of this extra reduction when stimulation is synchronized with the preferred tone, but not with a non-preferred tone, is a potentially powerful result, as it points to an underlying mechanism that links the assemblies of cells that share the same preferred pure tones.

      The evidence for this specificity is shown in Figure 3A and 3D. However, the universality of this specificity is challenged by the fact that it is observed for 16kHz preferring cells, but not so clearly for 54kHz preferring cells: these 54kHz preferring cells also significantly (p = 0.044) reduce their response to 54kHz in the optogenetic stimulation condition applied to 16kHz preferring target cells compared to the control condition. The proposed explanation for this is the presence of many cells with a broad frequency tuning, meaning that these cells could have been categorized as 54kHz preferring cells, while they also responded significantly to a 16kHz pure tone. To account for this, the authors divide each category of pure tone cells into three subgroups with low, medium and high frequency preferences. Following the previous reasoning, one would expect at least the "high" subgroups to show a strong and significant specificity for an additional reduction only if the optogenetic stimulation is targeted to a group of cells with the same preferred frequency. Figure 3D fails to show this. The extra reduction for the "high" subgroups is significant only when the condition of opto-stimulation synchronized with the preferred frequency is compared to the control condition, but not when it is compared to the condition of opto-stimulation synchronized with the non-preferred frequency.

      Therefore, the claim that "these results indicate that the effect of holographic optogenetic stimulation depends not on the specific tuning of cells, but on the co-tuning between stimulated and non-stimulated neurons" (end of paragraph "Optogenetic holographic stimulation decreases activity in non-target co-tuned ensembles") seems somewhat exaggerated. Perhaps increasing the number of sessions in the 54kHz target cell optogenetic stimulation condition (12 FOV) to the number of sessions in the 16kHz target cell optogenetic stimulation condition (18 FOV) could help to reach significance levels consistent with this claim.

      We previously also tested by randomly subselecting 12 FOVs from 16kHz stimulation condition to match the same number of FOV between two groups and did not really see any result difference. However, to further ensure the results, we now added three more dataset for 54 kHz target cell stimulation condition (now 15 FOV) which yielded similar outcome. We have now updated the statistical values from added datasets.

      (3) To interpret the results of this study, the authors suggest that mechanisms based on homeostatic signaling could be important to allow the rebalancing of the activity of assemblies of co-tuned neurons. In particular, the authors try to rule out the possibility that inhibition plays a central role. Both mechanisms could produce effects on short timescales, making them potential candidates. The authors quantify the spatial distribution of the balanced non-targeted cells and show that they are not localized in the vicinity of the targeted cells. They conclude that local inhibition is unlikely to be responsible for the observed effect. This argument raises some questions. The method used to quantify spatial distribution calculates the minimum distance of a non-target cell to any target cell. If local inhibition is activated by the closest target cell, one would expect the decrease in activity to be stronger for non-target cells with a small minimum distance and to fade away for larger minimum distances. This is not what the authors observe (Figure 4B), so they reject inhibition as a plausible explanation. However, their quantification doesn't exclude the possibility that non-target cells in the minimum distance range could also be close and connected to the other 4 target cells, thus masking any inhibitory effect mediated by the closest target cell. In addition, the authors should provide a quantitative estimate of the range of local inhibition in layers 2/3 of the mouse auditory cortex to compare with the range of distances examined in this study (< 300 µm). Finally, the possibility that some target cells could be inhibitory cells themselves is considered unlikely by the authors, given the proportions of excitatory and inhibitory neurons in the upper cortical layers. On the other hand, it should be acknowledged that inhibitory cells are more electrically compact, making them easier to be activated optogenetically with low laser power.

      Minimum distance is defined as the smallest distance non-target cell to any of the target cells. Thus, if this is local inhibition, it is likely that the closest target cell would have affected the non-target cells’ response changes. We also calculated the distance based on the center of mass of target cells to calculate the distance effect for an additional verification, based on both Reviewers’ comments, and still observed no distance related stimulation effect. The result is now updated in Figure 4B.

      Based on previous literature, such as Levy & Reyes 2012, the excitatory and inhibitory connectivity is known to range around 100 um distance. Our results do not necessarily show any further effect observed for cells with distance below 100 um. This suggests that such effect is not limited to local inhibition. We also added further speculation on why our results are less likely due to increased inhibition, albeit the biological characteristics of inhibitory neurons to optogenetics.

      Reviewer #3 (Public review):

      Summary:

      The authors optogenetically stimulate 5 neurons all preferring the same pure tone frequency (16 or 54 kHz) in the mouse auditory cortex using a holography-based single cell resolution optogenetics during sound presentation. They demonstrate that the response boosting of target neurons leads to a broad suppression of surrounding neurons, which is significantly more pronounced in neurons that have the same pure tone tuning as the target neurons. This effect is immediate and spans several hundred micrometers. This suggests that the auditory cortical network balances its activity in response to excess spikes, a phenomenon already seen in visual cortex.

      Strengths:

      The study is based on a technologically very solid approach based on single-cell resolution two-photon optogenetics. The authors demonstrate the potency and resolution of this approach. The inhibitory effects observed upon targeted stimulation are clear and the relative specificity to co-tuned neurons is statistically clear although the effect size is moderate.

      Weaknesses:

      The evaluation of the results is brief and some aspects of the observed homeostatic are not quantified. For example, it is unclear whether stimulation produces a net increase or decrease of population activity, or if the homeostatic phenomenon fully balances activity. A comparison of population activity for all imaged neurons with and without stimulation would be instructive. The selectivity for co-tuned neurons is significant but weak. Although it is difficult to evaluate this issue, this result may be trivial, as co-tuned neurons fire more strongly. Therefore, the net activity decrease is expected to be larger, in particular, for the number of non-co-tuned neurons which actually do not fire to the target sound. The net effect for the latter neurons will be zero just because they do not respond. The authors do not make a very strong case for a specific inhibition model in comparison to a broad and non-specific inhibitory effect. Complementary modeling work would be needed to fully establish this point.

      Thank you for raising important points. We agree that the term homeostatic balancing may have been an overstatement. We toned down regarding the homeostatic plasticity and conclude the result from the rapid plasticity at a single trial level now. Regardless, the average activity level did not differ among stimulation conditions (control, 16kHz stim, and 54kHz stim), which seems to suggest that overall activity level has been maintained regardless of the stimulation. We added a new figure of the global activity change as Fig. 4A.

      We also added a simple model work in which a suppression term was applied either to all neurons or specifically to non-target co-tuned cells to test our results from the data.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) For the first holography paper in A1, more information is needed about how holographic stimulation was performed and how stimulation artifacts were avoided or removed from the data set, especially as the text states that the PMTs were left open for the duration of the experiment.

      We further clarified the rationale of leaving the shutter open to avoid any mechanic sounds to activate neurons in the AC. We further clarified that we keep the uncaging shutter open since the Bruker default setting (Software version: 5.7) opens and closes the shutter for the every iteration of the stimulation which generates extra heavy mechanical sounds which then hinders whether the activation is due to the sound or stimulation.

      (2) The choice of the dF/F as the primary tool for quantifying data should be better justified. Presumably, cells have very different variances in baseline activity levels and baseline fluorescence levels that create a highly skewed distribution of responses across the population. Further, a

      To take the baseline activity variances into account, we first calculate dF/F normalising to the baseline period (about 330 ms before the sound onset) right before each trial, per cell level. By doing so, we minimize any effect that could have been driven by variable baseline activity levels across neurons.

      (3) More analysis should be performed to determine why 33% of stimulated cells are not activated, and instead are suppressed during stimulation. Is this related to a cells baseline fluorescence?

      Great point. Although we tried our best to pre-select stimulation-responsive neurons before we start the actual experiments and head fix the animals as much as possible, these neurons do not stay as the “best stimulation-responsive neurons” throughout the entire imaging session. There can be various caveats on this. First, they seem to change their activity levels due to the optogenetic stimulation after they are exposed to acoustic stimulation. Second, since the AC is in the temporal side, it is likely to be more affected from the animals’ and their brain movements throughout the imaging session, which could be bigger than visual cortex or motor cortex. However, 33% of 5 cells is about 1.5 cells so it is usually missed about one cell on average, although some sessions have all 5 cells being stimulated while some other sessions have clearly less effective holographic stimulation effect.

      We even manually visualised the fluorescence change due to the holographic stimulation before we start any imaging sessions. Regardless, they don’t stay as the ‘best stimulation responsive cells’ throughout which we cannot control the natural biological aspect of neuronal activities. Regardless, based on the significant stimulation effects observed by presenting different pure tone frequencies as well as delivering different target stimulation and no-stimulation control, we believe that the effect itself is valid. We added these caveats into the manuscript as a further discussion point and things to consider.

      (4) The linear mixed-effects model should include time as a variable as A) the authors hypothesize that responses should be reduced over time due to sensory adaptation and that B) stimulation induced suppression might be dynamic (though they find it is not).

      Since the stimulation effect seems to be independent from trial-by-trial changes among stimulation conditions (Fig. 4) and we now have toned down on the aspect of homeostasis, we kept the current mixed-effect model variables.

      (5) More speculation is needed on why stimulation suppresses responses from the first trial onwards.

      We further speculate such rapid response changes due to activity-dependent synaptic changes due to overall network energy shift from optogenetic stimulation to maintain the cortical circuit balance.  

      (6) What does each dot represent in Figure 4a vs. Figure 4B? They are very different in number.

      In 4A, each dot is average amplitude change values per each trial level. They are exactly same number of dots between frequency, cell groups and conditions as each dot represents each trial (20 each). The reason why it may look differ could be only due to some overlaps between frequencies.

      In 4B, each dot is each cell. The reason why it’s denser in Stimulation conditions’ 16kHz preferring cells panel is that it naturally had more FOVs thus more cells to be plotted. We further clarified these details in the figure legend.

      (7) How sensory responsive neurons were selected should be shown in the figures. Specifically, which fraction of the 30% of most responsive neurons were stimulated should be stated. Depending on the exact yield in the field of view, all or only a minority of strongly sensory responsive neurons are being stimulated, which in either case would color the interpretation of the data.

      We tried varying the FOV as much as possible across sessions to ensure that FOVs are directly in the A1 covering a range of frequencies. If we cannot observe more than 80 neurons as sound responsive neurons from processed suite2p data, we searched for another FOV.  

      We now included an example FOV of the widefield imaging we first conducted to identify A1, and another example FOV of the 2-photon imaging where we conducted a short sound presentation session to identify the sensory responsive neurons, as an inset of the ‘Cell selection’ part in Figure 1.

      Reviewer #2 (Recommendations for the authors):

      Minor points:

      - p.4, last line: "of" probably missing "the processing the target..."

      Fixed.

      - p.5, top, end of the first paragraph of this page: Figure 3B and 3E don't show exemplar traces.

      Corrected as Figure 2A and 2D.

      - P.5, first sentence of the paragraph "Optogenetic holographic stimulation increases activity in targeted ensembles": reference to Figure 3A and 3D should rather be Figure 2A and 2D.

      Corrected.

      - P.9, 2nd paragraph: sentence with a strange syntax: "since their response amplitude..."

      Corrected.

      - Figure 2: panels C and F are missing.

      Corrected.

      - p.11, methods: "wasthen" should be "was then".

      Corrected.

      - p.12, analysis: it is not clearly explained why the sound evoked activity is computed based on the 160ms to 660ms after sound onset instead of 0ms to 660 ms. It is likely related to some potential contamination but it should be explicitly explained.

      Due to the relatively slow calcium transient to more correctly capture the sound related evoked responses. Added this detail.

      - Methods, analysis: the authors should better explain how they conducted the random permutation described in the Figures 1D, 2B and 2E. Which signals were permutated?

      Random permutation to shuffle the target cell ID.

      - References 55 and 56 don't explicitly state that excitatory neurons generally have stronger responses to sound than inhibitory neurons.

      Thank you for pointing out this error. We replaced those references with Maor et al. 2016 and Kerlin et al. 2010, showing excitatory neurons show more selective tuning, and also changed the wording more appropriately.

      - It is not explained whether the imaging sessions are performed on awake or anaesthetized animals. It is probably done on awake animals, but then it is not clear what procedure is used to get the animals used to the head restraint. It usually takes a few days for the mice to get used to it, and the stress level is often different at the beginning and end of an experiment. Given the experimental protocol used in the study, in which sessions are performed sequentially and compared to each other, this aspect could play a role. However, the main comparison made is probably safe as it compares a control condition (laser at 0mW) and conditions with optogenetic stimulation, all done with similar sequences of sessions.

      The experiment was conducted on awake animals. Although we did not have any control on comparing their status in the beginning and the end of the experiment, they all had a widefield imaging session imaging session to identify the A1 region which uses the same head-fixation setup, thus they are more used to the setup when we conduct 2-photon imaging and stimulation. Regardless of the session, if animals show any sign of extra discomfort due to the unfamiliar setup, we keep them there for 10-15 minutes until they are accustomed to the setup with no movement. If they still show a sign of discomfort, we take them out and try for another day. We now included this detail on the manuscript.

      Reviewer #3 (Recommendations for the authors):

      - Evaluate the global effect of stimulation on the population activity averaged across all neurons (activated and non-activated).

      Thank you for your suggestions. We now included a new Figure 3A that present the population activity across all responsive cells. The average activity level did not differ among stimulation conditions (control, 16kHz stim, and 54kHz stim).

      - Evaluate with a simple model if a population of neurons with different sound tuning receiving non-specific inhibition would not produce the observed effect.

      Thank you for the suggestion. We generated a simple model in which a suppression term was applied either to all neurons or specifically to non-target co-tuned cells to test our results from the data. We took a similar range of number of neurons and FOVs to closely simulate the model to the real dataset structure. On 50 simulated calcium traces of neurons (n),

      Trace<sub>n(t)</sub> = R<sub>n(t)</sub> – theta<sub>n</sub> + epsilon<sub>n(t)</sub>

      Where R<sub>n(t)</sub> is a response amplitude from either baseline or stimulation session, theta<sub>n</sub> is a suppression term applied either to all neurons or only to non-target co-tuned neurons, only during the stimulation session, and epsilon<sub>n(t)</sub> is additive noise. Theta was defined based on the average amount of increased activity amplitudes generated from target neurons due to the stimulation, implemented from the real dataset with extra neuron-level jitter. Similar to the real data analyses, we compared the response change between the stimulation and baseline sessions’ trace amplitudes. By comparing two different model outcomes and the real data, we observed a significant effect of the model type (F(2, 2535) = 34.943, p < 0.0001) and interaction between the model type and cell groups was observed (F(2, 2535) = 36.348, p < 0.0001). Applying suppression to only non-target co-tuned cells during the stimulation session yielded a significant response amplitude decrease for co-tuned cells compared to non co-tuned cells (F(1, 2535) = 45.62, p < 0.0001), which resembles the real data In contrast, applying suppression to all non-target cells led to similar amplitude changes in both co-tuned and non co-tuned neurons (F(1, 2535) = 0.87, p = 0.35), which was not observed in either the real data or the simulated data restricted to co-tuned cell suppression. Therefore, the model predicts correctly that the specific suppression given to only co-tuned neurons drove the real data outcome. All of this information is now added into Methods and Results sections and the figure is added as Figure 3C.

    1. Reviewer #3 (Public review):

      Summary:

      In this study, Hall and colleagues investigate how the coupling of activity from ACC to CA1is altered by fear learning, showing that during sleep immediately before learning, there is evidence for increased coupling of ACC activity with neurons that will subsequently be inhibited during the learning process. They go on to show that this effect seems to be mediated most by a subpopulation of neurons in the superficial layer of CA1. This fits with previous reports suggesting that these superficial neurons are key for the flexible updating of memory. The authors then go on to show that artificial activation of ACC using optogenetics results in varied effects in CA1, including a subtle decrease in activity of superficial neurons that lasts longer than the stimulus itself. Finally, the authors present some preliminary data suggesting that different interneurons may be recruited by this optogenetic stimulation in different ways and at different times.

      Overall, this is an interesting paper, but much of the analysis is very preliminary, and much of the crucial data about the learning effects and alterations to cell firing are not presented clearly and fully. This is further confounded by a rather opaque description of the results and analysis in the text. Overall, there is something very interesting here, but there needs to be a substantial series of extra analyses to clearly say what this is. In many cases, more robust analysis may render the results underpowered, which could dramatically change the conclusions of the paper.

      Strengths:

      The authors performed difficult, dual-location recordings across a multi-day learning paradigm, which seems like it could be a really nice dataset. They delve into the circuit basis of an interesting finding regarding ACC to CA1 connectivity and how this changes before and after fear conditioning. They provide data to suggest this connectivity may be through specific and distinct subcircuits in CA1.

      Weaknesses:

      (1) There is essentially no information in the text or figures about what the actual learning was, how it was done, how individual animals performed, and how any of these metrics related to learning. Looking at the methods, the authors did a number of things never mentioned anywhere in the text or figures, including novel arena exposure, contextual reexposure in extinction after learning, etc. It seems that this is a very rich dataset that has not been presented at all. I would recommend at the very least:<br /> a) Plot all of the behavioural training data, and how each mouse relates to one another - did the mice learn? At this stage, we don't know!<br /> b) Explain in the text in detail exactly what was done and why, and what this tells us about the neuronal activity.<br /> c) If there is variance in learning and or conditioning, does this relate to features in the analysis, such as the GLM result.

      (2) Along similar lines, a key metric for most of the paper is that neurons most coupled with ACC are more likely to be inhibited during training. However, there is nothing anywhere in the paper showing these data. How do neurons in general respond to contextual shocks? The methods describe this as the average firing rate during training, normalised to pre-sleep activity. This metric seems a bit coarse and may obscure really important task-relevant dynamics. Are the neurons active at specific times, are they tuned to relevant parts of the task, and do any of these features of the cell activity also relate to the coupling with ACC? Similarly, how did the authors mitigate the influence of electrical artefacts caused by the foot shock in their recordings? Again, there is a huge amount of data here that is not being described, and likely holds very valuable information about what is actually happening. The paper would really benefit from the inclusion of these data in an accessible form, such as heatmaps of spiking, how these patterns change over time, and around e.g., foot shock, etc. Also key is how these features are altered by the variability of learning across subjects.

      (3) A number of the effects are presented by comparing a statistically significant effect to a non-statistically significant effect (e.g. in Figure 2b, Figure 2d, Figure 4 b,c, and others). This isn't really valid - the key test that the two groups are different is either with a direct test of the difference or an interaction term in an e.g., ANOVA test. In some places, I am not sure the same conclusions will be drawn from the data with these tests.

      (4) To what extent is defining superficial and deep CA1 neurons solely by ripple waveform an accepted method? Of the two papers referenced for this approach, one is a 2-photon calcium imaging paper that does not do electrical recordings (as far as I am aware), and the second uses this as a descriptor after defining the positions of units on an array. It would be good to clarify how accepted this is, and also how robust this is. At the very least, some kind of metric or walkthrough in the supplement as to how this was done, and how well each cell was classified and with what confidence, or some metric of how distinct and separate the two populations were (or was it just a smudge).

      (5) In the optogenetic experiment in Figure 5, the effect on the CA1 sup neurons seems to be driven by changes in a small subpopulation of this group, with no change in the others. Related to point 2, is there anything else in the data that can pull out what these cells are? More detailed analysis of the firing of these neurons might pull out something really interesting.

      (6) Related to this - a number of comparisons simply pool neurons across mice and analyse them as if independent. This is done a lot in the past, but it would be better if an approach that included the interdependence of neurons recorded from the same mouse at the same time were used (such as a hierarchical model). While this is complex, a simpler approach would just be to plot the summary data also per mouse. For example, in Figure 5, how do the neurons inhibited by ACC activation spread across the different mice? Is the level of inhibition related to how well the mice learned the CS-US association?

      (7) Figure 6 is interesting, but very preliminary. None of the effects are quantified, and one of the cell types is not identified. I think some proper analysis needs to be done, again across mice, to be able to draw conclusions from these data.

      (8) Finally, in general, I felt that the way the paper was written was very hard to follow, often relying on very processed levels of analysis that were hard to relate back to the raw traces and their biological meaning. In general taking more words to really simply and fully explain each analysis, and taking the words and figures to walk through how each analysis was done and what it tells us about the neuronal data/biology would be really beneficial, especially to someone who is not an extracellular electrophysiologist or immersed in the immediate field.

      In summary, while this manuscript explores an intriguing hypothesis about pre-learning circuit dynamics, it is currently held back by insufficient clarity in behavioural analysis, data presentation, and statistical quantification. Addressing these core issues would greatly improve interpretability and confidence in the findings.

    1. Reviewer #1 (Public review):

      Summary:

      Zhang et al. addressed the question of whether hyperaltruistic preference is modulated by decision context and tested how oxytocin (OXT) may modulate this process. Using an adapted version of a previously well-established moral decision-making task, healthy human participants in this study undergo decisions that gain more (or lose less, termed as context) meanwhile inducing more painful shocks to either themselves or another person (recipient). The alternative choice is always less gain (or more loss) meanwhile less pain. Through a series of regression analyses, the authors reported that hyperaltruistic preference can only be found in the gain context but not in the loss context, however, OXT reestablished the hyperaltruistic preference in the loss context similar to that in the gain context.

      Strengths:

      This is a solid study that directly adapted a previously well-established task and the analytical pipeline to assess hyperaltruistic preference in separate decision contexts. Context-dependent decisions have gained more and more attention in literature in recent years, hence this study is timely. It also links individual traits (via questionnaires) with task performance, to test potential individual differences. The OXT study is done with great methodological rigor, including pre-registration. Both studies have proper power analysis to determine the sample size.

      Weaknesses:

      Despite the strengths, multiple analytical decisions have to be explained, justified, or clarified. Also, there is scope to enhance the clarity and coherence of the writing - as it stands, readers will have to go back and forth to search for information. Last, it would be helpful to add line numbers in the manuscript during the revision, as this will help all reviewers to locate the parts we are talking about.

      Introduction:<br /> (1) The introduction is somewhat unmotivated, with key terms/concepts left unexplained until relatively late in the manuscript. One of the main focuses in this work is "hyperaltruistic", but how is this defined? It seems that the authors take the meaning of "willing to pay more to reduce other's pain than their own pain", but is this what the task is measuring? Did participants ever need to PAY something to reduce the other's pain? Note that some previous studies indeed allow participants to pay something to reduce other's pain. And what makes it "HYPER-altruistic" rather than simply "altruistic"? Plus, in the intro, the authors mentioned that the "boundary conditions" remain unexplored, but this idea is never touched again. What do boundary conditions mean here in this task? How do the results/data help with finding out the boundary conditions? Can this be discussed within wider literature in the Discussion section? Last, what motivated the authors to examine decision context? It comes somewhat out of the blue that the opening paragraph states that "We set out to [...] decision context", but why? Are there other important factors? Why decision context is more important than studying those others?

      Experimental design:<br /> (2) The experiment per se is largely solid, as it followed a previously well-established protocol. But I am curious about how the participants got instructed? Did the experimenter ever mention the word "help" or "harm" to the participants? It would be helpful to include the exact instructions in the SI.

      (3) Relatedly, the experimental details were not quite comprehensive in the main text. Indeed, Methods come after the main text, but to be able to guide readers to understand what was going on, it would be very helpful if the authors could include some necessary experimental details at the beginning of the Results section.

      Statistical analysis<br /> (3) One of the main analyses uses the harm aversion model (Eq1) and the results section keeps referring to one of the key parameters of it (ie, k). However, it is difficult to understand the text without going to the Methods section below. Hence it would be very helpful to repeat the equation also in the main text. A similar idea goes to the delta_m and delta_s terms - it will be very helpful to give a clear meaning of them, as nearly all analyses rely on knowing what they mean.

      (4) There is one additional parameter gamma (choice consistency) in the model. Did the authors also examine the task-related difference of gamma? This might be important as some studies have shown that the other-oriented choice consistency may differ in different prosocial contexts.

      (5) I am not fully convinced that the authors included two types of models: the harm aversion model and logistic regression models. Indeed, the models look similar, and the authors have acknowledged that. But I wonder if there is a way to combine them? For example:<br /> Choice ~ delta_V * context * recipient (*Oxt_v._placebo)<br /> The calculation of delta_V follows Equation 1.<br /> Or the conceptual question is, if the authors were interested in the specific and independent contribution of dalta_m and dalta_s to behavior, as their logistic model did, why the authors examine the harm aversion first, where a parameter k is controlling for the trade-off? One way to find it out is to properly run different models and run model comparison. In the end, it would be beneficial to only focus on the "winning" model to draw inferences.

      (6) The interpretation of the main OXT results needs to be more cautious. According to the operationalization, "hyperaltruistic" is the reduction of pain of others (higher % of choosing the less painful option) relative to the self. But relative to the placebo (as baseline), OXT did not increase the % of choosing the less painful option for others, rather, it decreased the % of choosing the less painful option for themselves. In other words, the degree of reducing other's pain is the same under OXT and placebo, but the degree of benefiting self-interest is reduced under OXT. I think this needs to be unpacked, and some of the wording needs to be changed. I am not very familiar with the OXT literature, but I believe it is very important to differentiate whether OXT is doing something on self-oriented actions vs other-oriented actions. Relatedly, for results such as that in Fig5A, it would be helpful to not only look at the difference, but also the actual magnitude of the sensitivity to the shocks, for self and others, under OXT and placebo.

      Comments on revisions:

      I did not change my original public review, as I think it can still be helpful for the field to see the reasoning and argument.

      For the revision, the authors have done a thorough job of addressing my previous comments and questions.

      The only aspect I would like to ask is that, it would still be great to have a clear definition of hyperaltruism. As it stands, hyperaltruism refers to "people's willingness to pay more to reduce other's pain than<br /> their own pain", ie, this means the "hyper" bit is considered with respect to "self". But shouldn't hyperaltruism be classified contrasting "normal" altruism?

      It is fine that it follows a previously published work (Crockett et al., 2014), but it would still be necessary to explain/define the construct being tested in a standalone fashion rather than letting readers to go back to the original work.

    2. Reviewer #3 (Public review):

      Summary:

      In this study, the authors aimed to index individual variation in decision-making when decisions pit the interests of the self (gains in money, potential for electric shock) against the interests of an unknown stranger in another room (potential for unknown shock). In addition, the authors conducted an additional study in which male participants were either administered intranasal oxytocin or placebo before completing the task to identify the role of oxytocin in moderating task responses. Participants' choice data was analyzed using a harm aversion model in which choices were driven by the subjective value difference between the less and more painful options.

      Strengths:

      Overall, I think this is a well-conducted, interesting, and novel set of research studies exploring decision-making that balances outcomes for the self versus a stranger, and the potential role of the hormone oxytocin (OT) in shaping these decisions. The pain component of the paradigm is well designed, as is the decision-making task, and overall the analyses were well suited to evaluating and interpreting the data. Advantages of the task design include the absence of deception, e.g., the use of a real study partner and real stakes, as a trial from the task was selected at random after the study and the choice the participant made were actually executed. 

      Weaknesses:

      The primary weakness of the paper concerns its framing. Although it purports to be measuring "hyper-altruism," which is the same term used in prior similar (although not identical) designs, I do not believe the task constitutes altruism, but rather the decision to engage, or not engage, in instrumental aggression.

      I continue to believe that when in the "other" trials the only outcome possible for the study partner is pain, and the only outcome possible for the participant is monetary gain, these trials measure decisions about instrumental aggression. That is the exact definition of instrumental aggression is: causing others harm for personal gain. Altruism is not equivalent to refraining from engaging in instrumental aggression, although some similar mechanisms may support both. True altruism would be to accept shocks to the self for the other's benefit (e.g., money).  The interpretation of this task as assessing instrumental aggression is supported by the fact that only the Instrumental Harm subscale of the OUS was associated with outcomes in the task, but not the Impartial Benevolence subscale. By contrast, the IB subscale is the one more consistently associated with altruism (e.g,. Kahane et al 2018; Amormino at al, 2022) I believe it is important for scientific accuracy for the paper, including the title, to be rewritten to reflect what it is testing.

      Although I recognize similar tasks have been previously characterized as "hyper-altruism" I do not believe that is sufficient justification for continuing to promulgate this descriptor without any caveats. I hope the authors will engage more seriously with the idea that this is what the task is measuring.

      Relatedly, in the introduction, I believe it would be important to discuss the non-symmetry of moral obligations related to help/harm--we have obligations not to harm strangers but no obligation to help strangers. This is another reason I do not think the term "hyper altruism" is a good description for this task--given it is typically viewed as morally obligatory not to harm strangers, choosing not to harm them is not "hyper" altruistic (and again, I do not view it as obviously altruism at all).

    3. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1:

      Despite the strengths, multiple analytical decisions have to be explained, justified, or clarified. Also, there is scope to enhance the clarity and coherence of the writing - as it stands, readers will have to go back and forth to search for information. Last, it would be helpful to add line numbers in the manuscript during the revision, as this will help all reviewers to locate the parts we are talking about.

      We thank the reviewer’s suggestions have added the line numbers to the revised manuscript.

      (1) Introduction:

      The introduction is somewhat unmotivated, with key terms/concepts left unexplained until relatively late in the manuscript. One of the main focuses in this work is "hyperaltruistic", but how is this defined? It seems that the authors take the meaning of "willing to pay more to reduce other's pain than their own pain", but is this what the task is measuring? Did participants ever need to PAY something to reduce the other's pain? Note that some previous studies indeed allow participants to pay something to reduce other's pain. And what makes it "HYPER-altruistic" rather than simply "altruistic"?

      As the reviewer noted, we adopted a well-established experimental paradigm to study the context-dependent effect on hyper-altruism. Altruism refers to the fact that people take others’ welfare into account when making decisions that concern both parties. Research paradigms investigating altruistic behavior typically use a social decision task that requires participants to choose between options where their own financial interests are pitted against the welfare of others (FeldmanHall et al., 2015; Hu et al., 2021; Hutcherson et al., 2015; Teoh et al., 2020; Xiong et al., 2020). On the other hand, the hyperaltruistic tendency emphasizes subjects’ higher valuation to other’s pain than their own pain (Crockett et al., 2014, 2015, 2017; Volz et al., 2017). One example for the manifestation of hyperaltruism would be the following scenario: the subject is willing to forgo $2 to reduce others’ pain by 1 unit (social-decision task) and only willing to forgo $1 to reduce the same amount of his/her own pain (self-decision task) (Crockett et al., 2014). On the contrary, if the subjects are willing to forgo less money to reduce others’ suffering in the social decision task than in the self-decision task, then it can be claimed that no hyperaltruism is observed. Therefore, hyperaltruistic preference can only be measured by collecting subjects’ choices in both the self and social decision tasks and comparing the choices in both tasks.

      In our task, as in the studies before ours (Crockett et al., 2014, 2015, 2017; Volz et al., 2017), subjects in each trial were faced with two options with different levels of pain on others and monetary payoffs on themselves. Based on subjects’ choice data, we can infer how much subjects were willing to trade 1 unit of monetary payoff in exchange of reducing others’ pain through the regression analysis (see Figure 1 and methods for the experimental details). We have rewritten the introduction and methods sections to make this point clearer to the audience.  

      Plus, in the intro, the authors mentioned that the "boundary conditions" remain unexplored, but this idea is never touched again. What do boundary conditions mean here in this task? How do the results/data help with finding out the boundary conditions? Can this be discussed within wider literature in the Discussion section?

      Boundary conditions here specifically refer to the variables or decision contexts that determine whether hyperaltruistic behavior can be elicited. Individual personality trait, motivation and social relationship may all be boundary conditions affecting the emergence of hyperaltruistic behavior. In our task, we specifically focused on the valence of the decision context (gain vs. loss) since previous studies only tested the hyperaltruistic preference in the gain context and the introduction of the loss context might bias subjects’ hyperaltruistic behavior through implicit moral framing.

      We have explained the boundary conditions in the revised introduction (Lines 45 ~ 49).

      “However, moral norm is also context dependent: vandalism is clearly against social and moral norms yet vandalism for self-defense is more likely to be ethically and legally justified (the Doctrine of necessity). Therefore, a crucial step is to understand the boundary conditions for hyperaltruism.”

      Last, what motivated the authors to examine the decision context? It comes somewhat out of the blue that the opening paragraph states that "We set out to [...] decision context", but why? Are there other important factors? Why decision context is more important than studying those others?

      We thank the reviewer for the comment. The hyperaltruistic preference was originally demonstrated between conditions where subjects’ personal monetary gain was pitted against others’ pain (social-condition) or against subjects’ own suffering (self-condition) (Crockett et al., 2014). Follow up studies found that subjects also exhibited strong egoistic tendencies if instead subjects needed to harm themselves for other’s benefit in the social condition (by flipping the recipients of monetary gain and electric shocks) (Volz et al., 2017). However, these studies have primarily focused on the gain contexts, neglecting the fact that valence could also be an influential factor in biasing subjects’ behavior (difference between gain and loss processing in humans). It is likely that replacing monetary gains with losses in the money-pain trade-off task might bias subjects’ hyperaltruistic preference due to heightened vigilance or negative emotions in the face of potential loss (such as loss aversion) (Kahneman & Tversky, 1979; Liu et al., 2020; Pachur et al., 2018; Tom et al., 2007; Usher & McClelland, 2004; Yechiam & Hochman, 2013). Another possibility is that gain and loss contexts may elicit different subjective moral perceptions (or internal moral framings) in participants, affecting their hyperaltruistic preferences (Liu et al., 2017; Losecaat Vermeer et al., 2020; Markiewicz & Czupryna, 2018; Wu et al., 2018). In our manuscript, we did not strive to compare which factors might be more important in eliciting hyperaltruistic behavior, but rather to demonstrate the crucial role played by the decision context and to show that the internal moral framing could be the mediating factor in driving subjects’ hyperaltruistic behavior. In fact, we speculate that the egoistic tendencies found in the Volz et al. 2017 study was partly driven by the subjects’ failure to engage the proper internal moral framing in the social condition (harm for self, see Volz et al., 2017 for details).

      (2) Experimental Design:

      (2a) The experiment per se is largely solid, as it followed a previously well-established protocol. But I am curious about how the participants got instructed? Did the experimenter ever mention the word "help" or "harm" to the participants? It would be helpful to include the exact instructions in the SI.

      In the instructions, we avoided words such as “harm”, “help”, or other terms reminding subjects about the moral judgement of the decisions they were about to make. Instead, we presented the options in a neutral and descriptive manner, focusing only on the relevant components (shocks and money). The instructions for all four conditions are shown in supplementary Fig. 9.

      (2b) Relatedly, the experimental details were not quite comprehensive in the main text. Indeed, the Methods come after the main text, but to be able to guide readers to understand what was going on, it would be very helpful if the authors could include some necessary experimental details at the beginning of the Results section.

      We thank the reviewer’s suggestion. We have now provided a brief introduction of the experimental details in the revised results section (Lines 125 ~132).

      “Prior to the money-pain trade-off task, we individually calibrated each subject’s pain threshold using a standard procedure[4–6]. This allowed us to tailor a moderate electric stimulus that corresponded to each subject’s subjective pain intensity. Subjects then engaged in 240 decision trials (60 trials per condition), acting as the “decider” and trading off between monetary gains or losses for themselves and the pain experienced by either themselves or an anonymous “pain receiver” (gain-self, gain-other, loss-self and loss-other, see Supplementary Fig. 8 for the instructions and also see methods for details).”

      (3) Statistical Analysis<br /> (3a) One of the main analyses uses the harm aversion model (Eq1) and the results section keeps referring to one of the key parameters of it (ie, k). However, it is difficult to understand the text without going to the Methods section below. Hence it would be very helpful to repeat the equation also in the main text. A similar idea goes to the delta_m and delta_s terms - it will be very helpful to give a clear meaning of them, as nearly all analyses rely on knowing what they mean.

      We thank the reviewer’s suggestion. We have now added the equation of the harm aversion model and provided more detailed description to the equations in the main text (Lines 150 ~155).

      “We also modeled subjects’ choices using an influential model where subjects’ behavior could be characterized by the harm (electric shock) aversion parameter κ, reflecting the relative weights subjects assigned to ∆m and ∆s, the objective difference in money and shocks between the more and less painful options, respectively (∆V=(1-κ)∆m - κ∆s Eq.1, See Methods for details)[4–6]. Higher κ indicates that higher sensitivity is assigned to ∆s than ∆m and vice versa.”

      (3b) There is one additional parameter gamma (choice consistency) in the model. Did the authors also examine the task-related difference of gamma? This might be important as some studies have shown that the other-oriented choice consistency may differ in different prosocial contexts.

      To examine the task-related difference of choice consistency (γ), we compared the performance of 4 candidate models:

      Model 1 (M1): The choice consistency parameter γ remains constant across shock recipients (self vs. other) and decision contexts (gain vs. loss).

      Model 2 (M2): γ differs between the self- and other-recipient conditions, with γ<sub>self</sub> and γ<sub>other</sub> representing the choice consistency when pain is inflicted on him/her-self or the other-recipient.

      Model 3 (M3): γ differs between the gain and loss conditions, with γ<sub>gain</sub> and γ<sub>loss</sub> representing the choice consistencies in the gain and loss contexts, respectively.

      Model 4 (M4): γ varies across four conditions, with γ<sub>self-gain</sub>, γ<sub>other-gain</sub>, γ<sub>self-loss</sub> and γ<sub>other-loss</sub> capturing the choice consistency in each condition.

      Supplementary Fig. 10 shows, after fitting all the models to subjects’ choice behavioral data, model 1 (M1) performed the best among all the four candidate models in both studies (1 & 2) with the lowest Bayesian Information Criterion (BIC). Therefore, we conclude that factors such as the shock recipients (self vs. other) and decision contexts (gain vs. loss) did not significantly influence subjects’ choice consistency and report model results using the single choice consistency parameter.

      (3c) I am not fully convinced that the authors included two types of models: the harm aversion model and the logistic regression models. Indeed, the models look similar, and the authors have acknowledged that. But I wonder if there is a way to combine them? For example:

      Choice ~ delta_V * context * recipient (*Oxt_v._placebo)

      The calculation of delta_V follows Equation 1.

      Or the conceptual question is, if the authors were interested in the specific and independent contribution of dalta_m and dalta_s to behavior, as their logistic model did, why did the authors examine the harm aversion first, where a parameter k is controlling for the trade-off? One way to find it out is to properly run different models and run model comparisons. In the end, it would be beneficial to only focus on the "winning" model to draw inferences.

      The reviewer raised an excellent point here. According to the logistic regression model, we have:

      Where P is the probability of selecting the less harmful option. Similarly, if we combine Eq.1 (∆V=1-κ)∆m-κ∆s) and Eq.2 ) of the harm aversion model, we have:

      If we ignore the constant term β<sub>0</sub> from the logistic regression model, the harm aversion model is simply a reparameterization of the logistic regression model. The harm aversion model was implemented first to derive the harm aversion parameter (κ), which is an parameter in the range of [0 1] to quantify how subjects value the relative contribution of Δm and Δs between options in their decision processes. Since previous studies used the term κ<sub>other</sub>-κ<sub>self</sub> to define the magnitude of hyperaltruistic preference, we adopted similar approach to compare our results with previous research under the same theoretical framework. However, in order to investigate the independent contribution of Δm and Δs, we will have to take γ into account (we can see that the β<sub>∆m</sub> and β<sub>∆s</sub> in the logistic regression model are not necessarily correlated by nature; however, in the harm aversion model the coefficients (1-κ) and κ is always strictly negatively correlated (see Eq. 1). Only after multiplying γ, the correlation between γ(1-κ) and γκ will vary depending on the specific distribution of γ and κ). In summary, we followed the approach of previous research to estimate harm aversion parameter κ to compare our results with previous studies and to capture the relative influence between Δm and Δs. When we studied the contextual effects (gain vs. loss or placebo vs. control) on subjects’ behavior, we further investigated the contextual effect on how subjects evaluated Δm and Δs, respectively. The two models (logistic regression model and harm aversion model) in our study are mathematically the same and are not competitive candidate models. Instead, they represent different aspects from which our data can be examined.

      We also compared the harm aversion model with and without the constant term β<sub>0</sub> in the choice function. Adding a constant term β<sub>0</sub> the above Equation 2 becomes:

      As the following figure shows, the hyperaltruistic parameters (κ<sub>other</sub>-κ<sub>self</sub>) calculated from the harm aversion model with the constant term (panels A & B) have almost identical patterns as the model without the constant term (panels C & D, i.e. Figs. 2B & 4B in the original manuscript) in both studies.

      Author response image 1.

      Figs. 2B & 4B in the original manuscript) in both studies.

       

      (3d) The interpretation of the main OXT results needs to be more cautious. According to the operationalization, "hyperaltruistic" is the reduction of pain of others (higher % of choosing the less painful option) relative to the self. But relative to the placebo (as baseline), OXT did not increase the % of choosing the less painful option for others, rather, it decreased the % of choosing the less painful option for themselves. In other words, the degree of reducing other's pain is the same under OXT and placebo, but the degree of benefiting self-interest is reduced under OXT. I think this needs to be unpacked, and some of the wording needs to be changed. I am not very familiar with the OXT literature, but I believe it is very important to differentiate whether OXT is doing something on self-oriented actions vs other-oriented actions. Relatedly, for results such as that in Figure 5A, it would be helpful to not only look at the difference but also the actual magnitude of the sensitivity to the shocks, for self and others, under OXT and placebo.

      We thank the reviewer for this thoughtful comment. As the reviewer correctly pointed out, “hyperaltruism” can be defined as “higher % of choosing the less painful option to the others relative to the self”. Closer examination of the results showed that both the degrees of reducing other’s pain as well as reducing their own pain decreased under OXT (Figure 4A). More specifically, our results do not support the claim that “In other words, the degree of reducing others’ pain is the same under OXT and placebo, but the degree of benefiting self-interest is reduced under OXT.” Instead, the results show a significant reduction in the choice of less painful option under OXT treatment for both the self and other conditions (the interaction effect of OXT vs. placebo and self vs. other: F<sub>1.45</sub>= 16.812, P < 0.001, η<sup>2</sup> = 0.272, simple effect OXT vs. placebo in the self- condition: F<sub>1.45</sub>=59.332, P < 0.001, η<sup>2</sup> = 0.569, OXT vs. placebo in the other-condition: F<sub>1.45</sub>= 14.626, P < 0.001, η<sup>2</sup> = 0.245, repeated ANOVA, see Figure 4A).

      We also performed mixed-effect logistic regression analyses where subjects’ choices were regressed against  and  in different valences (gain vs. loss) and recipients (self vs. other) conditions in both studies 1 & 2 (Supplementary Figs. 1 & 6). As we replot supplementary Fig. 6 and panel B (included as Supplementary Fig. 8 in the supplementary materials) in the above figure, we found a significant treatment × ∆<sub>s</sub> (differences in shock magnitude between the more and less painful options) interaction effect β=0.136±0.029P < =0.001, 95% CI=[-0.192, -0.079]), indicating that subject’s sensitivities towards pain were indeed different between the placebo and OXT treatments for both self and other conditions. Furthermore, the significant four-way ∆<sub>s</sub> × treatment (OXT vs. Placebo) × context (gain vs. loss) × recipient (self vs. other) interaction effect (β=0.125±0.053, P=0.018 95% CI=[0.022, 0.228]) in the regression analysis, followed by significant simple effects (In the OXT treatment: ∆<sub>s</sub> × recipient effect in the gain context: F<sub>1.45</sub>= 7.622, P < 0.008, η<sup>2</sup> = 0.145; ∆<sub>s</sub> × recipient effect in the loss context: F<sub>1.45</sub>= 7.966, P 0.007, η<sup>2</sup> = 0.150, suggested that under OXT treatment, participants showed a greater sensitivity toward ∆<sub>s</sub> (see asterisks in the OXT condition in panel B) in the other condition than the self-condition, thus restoring the hyperaltruistic behavior in loss context.

      As the reviewer suggested, OXT’s effect on hyperaltruism does manifest separately on subjects’ harm sensitivities on self- and other-oriented actions. We followed the reviewer’s suggestions and examined the actual magnitude of the sensitivities to shocks for both the self and other treatments (panel B in the figure above). It’s clear that the administration of OXT (compared to the Placebo treatment, panel B in the figure above) significantly reduced participants’ pain sensitivity (treatment × ∆<sub>s</sub>: β=-0.136±0.029, P < 0.001, 95% CI=[-0.192,-0.079]), yet also restored the harm sensitivity patterns in both the gain and loss conditions. These results are included in the supplementary figures (6 & 8) as well as in the main texts.

      Recommendations:

      (1) For Figures 2A-B, it would be great to calculate the correlation separately for gain and loss, as in other figures.

      We speculate that the reviewer is referring to Figures 3A & B. Sorry that we did not present the correlations separately for the gain and loss contexts because the correlation between an individual’s IH (instrumental harm), IB (impartial beneficence) and hyperaltruistic preferences was not significantly modulated by the contextual factors. The interaction effects in both Figs. 3A & B and Supplementary Fig.5 (also see Table S1& S2) are as following: Study1 valence × IH effect: β=0.016±0.022, t<sub>152</sub>=0.726, P=0.469; valence × IB effect: β=0.004±0.031, t<sub>152</sub>=0.115, P=0.908; Study2 placebo condition: valence × IH effect: β=0.018±0.024, t<sub>84</sub>=0.030 P=0.463; valence × IB effect: β=0.051±0.030, t<sub>84</sub>=1.711, P=0.702. We have added these statistics to the main text following the reviewer’s suggestions.

      (2) "by randomly drawing a shock increment integer ∆s (from 1 to 19) such that [...] did not exceed 20 (𝑆+ {less than or equal to} 20)." I am not sure if a random drawing following a uniform distribution can guarantee S is smaller than 20. More details are needed. Same for the monetary magnitude.

      We are sorry for the lack of clarity in the method description. As for the task design, we followed adopted the original design from previous literature (Crockett et al., 2014, 2017). More specifically:

      “Specifically, each trial was determined by a combination of the differences of shocks (Δs, ranging from 1 to 19, with increment of 1) and money (Δm, ranging from ¥0.2 to ¥19.8, with increment of ¥0.2) between the two options, resulting in a total of 19×99=1881 pairs of [Δs, Δm]. for each trial. To ensure the trials were suitable for most subjects, we evenly distributed the desired ratio Δm / (Δs + Δm) between 0.01 and 0.99 across 60 trials for each condition. For each trial, we selected the closest [Δs, Δm] pair from the [Δs, Δm] pool to the specific Δm / (Δs + Δm) ratio, which was then used to determine the actual money and shock amounts of two options. The shock amount (S<sub>less</sub>) for the less painful option was an integer drawn from the discrete uniform distribution [1-19], constraint by S<sub>less</sub> + ∆s < 20. Similarly, the money amount (M<sub>less</sub>) for the less painful option was drawn from a discrete uniform distribution [¥0.2 - ¥19.8], with the constraint of M<sub>less</sub> + ∆m < 20. Once the S<sub>less</sub>and M<sub>less</sub> were selected, the shock (S<sub>more</sub>) and money (M<sub>more</sub>) magnitudes for the more painful option were calculated as: S<sub>more</sub> = S<sub>less</sub> + ∆s, M<sub>more</sub> = M<sub>less</sub> + ∆m”  

      We have added these details to the methods section (Lines 520-533).

      Reviewer #2:

      (1) The theoretical hypothesis needs to be better justified. There are studies addressing the neurobiological mechanism of hyperaltruistic tendency, which the authors unfortunately skipped entirely.

      Also in recommendation #1:

      (1) In the Introduction, the authors claim that "the mechanistic account of the hyperaltruistic phenomenon remains unknown". I think this is too broad of a criticism and does not do justice to prior work that does provide some mechanistic account of this phenomenon. In particular, I was surprised that the authors did not mention at all a relevant fMRI study that investigates the neural mechanism underlying hyperaltruistic tendency (Crockett et al., 2017, Nature Neuroscience). There, the researchers found that individual differences in hyperaltruistic tendency in the same type of moral decision-making task is better explained by reduced neural responses to ill-gotten money (Δm in the Other condition) in the brain reward system, rather than heightened neural responses to others' harm. Moreover, such neural response pattern is related to how an immoral choice would be judged (i.e., blamed) by the community. Since the brain reward system is consistently involved in Oxytocin's role in social cognition and decision-making (e.g., Dolen & Malenka, 2014, Biological Psychiatry), it is important to discuss the hypothesis and results of the present research in the context of this literature.

      We totally agree with the reviewer that the expression “mechanistic account of the hyperaltruistic phenomenon remains unknown” in our original manuscript can be misleading to the audience. Indeed, we were aware of the major findings in the field and cited all the seminal work of hyperaltruism and its related neural mechanism (Crockett et al., 2014, 2015, 2017). We have changed the texts in the introduction to better reflect this point and added further discussion as to how oxytocin might play a role:

      “For example, it was shown that the hyperaltruistic preference modulated neural representations of the profit gained from harming others via the functional connectivity between the lateral prefrontal cortex, a brain area involved in moral norm violation, and profit sensitive brain regions such as the dorsal striatum6.” (Lines 41~45)

      “Oxytocin has been shown to play a critical role in social interactions such as maternal attachment, pair bonding, consociate attachment and aggression in a variety of animal models[42,43]. Humans are endowed with higher cognitive and affective capacities and exhibit far more complex social cognitive patterns[44]. ” (Lines 86~90)

      (2) There are some important inconsistencies between the preregistration and the actual data collection/analysis, which the authors did not justify.

      Also in recommendations:

      (4) It is laudable that the authors pre-registered the procedure and key analysis of the Oxytocin study and determined the sample size beforehand. However, in the preregistration, the authors claimed that they would recruit 30 participants for Experiment 1 and 60 for Experiment 2, without justification. In the paper, they described a "prior power analysis", which deviated from their preregistration. It is OK to deviate from preregistration, but this needs to be explicitly mentioned and addressed (why the deviation occurred, why the reported approach was justifiable, etc.).

      We sincerely appreciate the reviewer’s thorough assessment of our manuscript. In the more exploratory study 1, we found that the loss decision context effectively diminished subjects’ hyperaltruistic preference. Based on this finding, we pre-registered study 2 and hypothesized that: 1) The administration of OXT may salvage subject’s hyperaltruistic preference in the loss context; 2) The administration of OXT may reduce subjects’ sensitivities towards electric shocks (but not necessarily their moral preference), due to the well-established results relating OXT to enhanced empathy for others (Barchi-Ferreira & Osório, 2021; Radke et al., 2013) and the processing of negative stimuli(Evans et al., 2010; Kirsch et al., 2005; Wu et al., 2020); and 3) The OXT effect might be context specific, depending on the particular combination of valence (gain vs. loss) and shock recipient (self vs. other) (Abu-Akel et al., 2015; Kapetaniou et al., 2021; Ma et al., 2015).

      As our results suggested, the administration of OXT indeed restored subjects’ hyperaltruistic preference (confirming hypothesis 1, Figure 4A). Also, OXT decreased subjects’ sensitivities towards electric shocks in both the gain and loss conditions (supplementary Fig. 6 and supplementary Fig. 8), consistent with our second hypothesis. We must admit that our hypothesis 3 was rather vague, since a seminal study clearly demonstrated the context-dependent effect of OXT in human cooperation and conflict depending on the group membership of the subjects (De Dreu et al., 2010, 2020). Although our results partially validated our hypothesis 3 (supplementary Fig. 6), we did not make specific predictions as to the direction and the magnitude of the OXT effect.

      The main inconsistency is related to the sample size. When we carried out study 1, we recruited both male and female subjects. After we identified the context effect on the hyperaltruistic preference, we decided to pre-register and perform study 2 (the OXT study). We originally made a rough estimate of 60 male subjects for study 2. While conducting study 2, we also went through the literature of OXT effect on social behavior and realized that the actual subject number around 45 might be enough to detect the main effect of OXT. Therefore, we settled on the number of 46 (study 2) reported in the manuscript. Correspondingly, we increased the subject number in study 1 to the final number of 80 (40 males) to make sure the subject number is enough to detect a small-to-medium effect, as well as to have a fair comparison between study 1 and 2 (roughly equal number of male subjects). It should be noted that although we only reported all the subjects (male & female) results of study 1 in the manuscript, the main results remain very similar if we only focus on the results of male subjects in study 1 (see the figure below). We believe that these results, together with the placebo treatment group results in study 2 (male only), confirmed the validity of our original finding.

      Author response image 2.

      Author response image 3.

      We have included additional texts (Lines 447 ~ 452) in the Methods section for the discrepancy between the preregistered and actual sample sizes in the revised manuscript:

      “It should be noted that in preregistration we originally planned to recruit 60 male subjects for Study 2 but ended up recruiting 46 male subjects (mean age =  years) based on the sample size reported in previous oxytocin studies[57,69]. Additionally, a power analysis suggested that the sample size > 44 should be enough to detect a small to median effect size of oxytocin (Cohen’s d=0.24, α=0.05, β=0.8) using a 2 × 2 × 2 within-subject design[76].”

      (3) Some of the exploratory analysis seems underpowered (e.g., large multiple regression models with only about 40 participants).

      We thank the reviewer’s comments and appreciate the concern that the sample size would be an issue affecting the results reliability in multiple regression analysis.

      In Fig. 2, the multiple regression analyses were conducted after we observed a valence-dependent effect on hyperaltruism (Fig. 2A) and the regression was constructed accordingly:

      Choice ~ ∆s *context*recipient + ∆m *context*recipient+(1+ ∆s *context*recipient + ∆s*context*recipient | subject)

      Where ∆s and ∆m indicate the shock level and monetary reward difference between the more and loss painful options, context as the monetary valence (gain vs. loss) and recipient as the identity of the shock recipient (self vs. other).

      Since we have 240 trials for each subject and a total of 80 subjects in Study 1, we believe that this is a reasonable regression analysis to perform.

      In Fig. 3, the multiple regression analyses were indeed exploratory. More specifically, we ran 3 multiple linear regressions:

      hyperaltruism~EC*context+IH*context+IB*context

      Relative harm sensitivity~ EC*context+IH*context+IB*context

      Relative money sensitivity~ EC*context+IH*context+IB*context

      Where Hyperaltruism is defined as κ<sub>other</sub> - κ<sub>self</sub>, Relative harm sensitivity as otherβ<sub>∆s</sub> - selfβ<sub>∆s</sub> and Relative monetary sensitivity as otherβ<sub>∆m</sub> - selfβ<sub>∆m</sub>. EC (empathic concern), IH (instrumental harm) and IB (impartial beneficence) were subjects’ scores from corresponding questionnaires.

      For the first regression, we tested whether EC, IH and IB scores were related to hyperaltruism and it should be noted that this was tested on 80 subjects (Study 1). After we identified the effect of IH on hyperaltruism, we ran the following two regressions. The reason we still included IB and EC as predictors in these two regression analyses was to remove potential confounds caused by EC and IB since previous research indicated that IB, IH and EC could be correlated (Kahane et al., 2018).

      In study 2, we performed the following regression analyses again to validate our results (Placebo treatment in study 2 should have similar results as found in study 1).

      Relative harm sensitivity~ EC*context+IH*context+IB*context

      Relative money sensitivity~ EC*context+IH*context+IB*context

      Again, we added IB and EC only to control for the nuance effects by the covariates. As indicated in Fig. 5 C-D, the placebo condition in study 2 replicated our previous findings in study 1 and OXT administration effectively removed the interaction effect between IH and valence (gain vs. loss) on subjects’ relative harm sensitivity.

      To more objectively present our data and results, we have changed the texts in the results section and pointed out that the regression analysis:

      hyperaltruism~EC*context+IH*context+IB*context

      was exploratory (Lines 186-192).

      “We tested how hyperaltruism was related to both IH and IB across decision contexts using an exploratory multiple regression analysis. Moral preference, defined as κ<sub>other</sub> - κ<sub>self</sub>, was negatively associated with IH (β=-0.031±0.011, t<sub>156</sub>=-2.784, P =0.006) but not with IB (β=0.008±0.016, t<sub>156</sub>=0.475, P=0.636) across gain and loss contexts, reflecting a general connection between moral preference and IH (Fig. 3A & B).”

      (4) Inaccurate conceptualization of utilitarian psychology and the questionnaire used to measure it.

      Also in recommendations:

      (2) Throughout the paper, the authors placed lots of weight on individual differences in utilitarian psychology and the Oxford Utilitarianism Scale (OUS). I am not sure this is the best individual difference measure in this context. I don't see a conceptual fit between the psychological construct that OUS reflects, and the key psychological processes underlying the behaviors in the present study. As far as I understand it, the conceptual core of utilitarian psychology that OUS captures is the maximization of greater goods. Neither the Instrumental Harm (IH) component nor the Impartial Beneficence (IB) component reflects a tradeoff between the personal interests of the decision-making agent and a moral principle. The IH component is about the endorsement of harming a smaller number of individuals for the benefit of a larger number of individuals. The IB component is about treating self, close others, and distant others equally. However, the behavioral task used in this study is neither about distributing harm between a smaller number of others and a larger number of others nor about benefiting close or distant others. The fact that IH showed some statistical association with the behavioral tendency in the present data set could be due to the conceptual overlap between IH and an individual's tendency to inflict harm (e.g., psychopathy; Table 7 in Kahane et al., 2018, which the authors cited). I urge the authors to justify more why they believe that conceptually OUS is an appropriate individual difference measure in the present study, and if so, interpret their results in a clearer and justifiable manner (taking into account the potential confound of harm tendency/psychopathy).

      We thank the reviewer for the thoughtful comment and agree that “IH component is about the endorsement of harming a smaller number of individuals for the benefit of a larger number of individuals. The IB component is about treating self, close others, and distant others equally”. As we mentioned in the previous response to the reviewer, we first ran an exploratory multiple linear regression analysis of hyperaltruistic preference (κ<sub>other</sub> - κ<sub>self</sub>) against IB and IH in study 1 based on the hypothesis that the reduction of hyperaltruistic preference in the loss condition might be due to 1) subjects’ altered altitudes between IB and hyperaltruistic preference between the gain and loss conditions, and/or 2) the loss condition changed how the moral norm was perceived and therefore affected the correlation between IH and hyperaltruistic preference. As Fig. 3 shows, we did not find a significant IB effect on hyperaltruistic preference (κ<sub>other</sub> - κ<sub>self</sub>), nor on the relative harm or money sensitivity (supplementary Fig. 3). These results excluded the possibility that subjects with higher IB might treat self and others more equally and therefore show less hyperaltruistic preference. On the other hand, we found a strong correlation between hyperaltruistic preference and IH (Fig. 3A): subjects with higher IH scores showed less hyperaltruistic preference. Since the hyperaltruistic preference (κ<sub>other</sub> - κ<sub>self</sub>) is a compound variable and we further broke it down to subjects’ relative sensitivity to harm and money (other β<sub>∆s</sub> - self β<sub>∆s</sub> and other β<sub>∆m</sub> - self β<sub>∆m</sub>, respectively). The follow up regression analyses revealed that the correlation between subjects’ relative harm sensitivity and IH was altered by the decision contexts (gain vs. loss, Fig. 3C-D). These results are consistent with our hypothesis that for subjects to engage in the utilitarian calculation, they should first realize that there is a moral dilemma (harming others to make monetary gain in the gain condition). When there is less perceived moral conflict (due to the framing of decision context as avoiding loss in the loss condition), the correlation between subjects’ relative harm sensitivity and IH became insignificant (Fig. 3C). It is worth noting that these results were further replicated in the placebo condition of study 2, further indicating the role of OXT is to affect how the decision context is morally framed.

      The reviewer also raised an interesting possibility that the correlation between subject’s behavioral tendency and IH may be confounded by the fact that IH is also correlated with other traits such as psychopathy. Indeed, in the Kahane et al., 2018 paper, the authors showed that IH was associated with subclinical psychopathy in a lay population. Although we only collected and included IB and Empathic concern (EC) scores as control variables and in principle could not rule out the influence of psychopathy, we argue it is unlikely the case. First, psychopaths by definition “only care about their own good” (Kahane et al., 2018). However, subjects in our studies, as well as in previous research, showed greater aversion to harming others (compared to harming themselves) in the gain conditions. This is opposite to the prediction of psychopathy. Even in the loss condition, subjects showed similar levels of aversion to harming others (vs. harming themselves), indicating that our subjects valuated their own and others’ well-being similarly. Second, although there appears to be an association between utilitarian judgement and psychopathy(Glenn et al., 2010; Kahane et al., 2015), the fact that people also possess a form of universal or impartial beneficence in their utilitarian judgements suggest psychopathy alone is not a sufficient variable explaining subjects’ hyperaltruistic behavior.

      We have thus rewritten part of the results to clarify our rationale for using the Oxford Utilitarianism Scale (especially the IH and IB) to establish the relationship between moral traits and subjects’ decision preference (Lines 212-215):

      “Furthermore, our results are consistent with the claim that profiting from inflicting pains on another person (IH) is inherently deemed immoral1. Hyperaltruistic preference, therefore, is likely to be associated with subjects’ IH dispositions.”

      (3) Relatedly, in the Discussion, the authors mentioned "the money-pain trade-off task, similar to the well-known trolley dilemma". I am not sure if this statement is factually accurate because the "well-known trolley dilemma" is about a disinterested third-party weighing between two moral requirements - "greatest good for the greatest number" (utilitarianism) and "do no harm" (Kantian/deontology), not between a moral requirement and one's own monetary interest (which is the focus of the present study). The analogy would be more appropriate if the task required the participants to trade off between, for example, harming one person in exchange for a charitable donation, as a recent study employed (Siegel et al., 2022, A computational account of how individuals resolve the dilemma of dirty money. Scientific reports). I urge the authors to go through their use of "utilitarian/utilitarianism” in the paper and make sure their usage aligns with the definition of the concept and the philosophical implications.

      We thank the reviewer for prompting us to think over the difference between our task and the trolley dilemma. Indeed, the trolley dilemma refers to a disinterested third-party’s decision between two moral requirements, namely, the utilitarianism and deontology. In our study, when the shock recipient was “other”, our task could be interpreted as either the decision between “moral norm of no harm (deontology) and one’s self-interest maximization (utilitarian)”, or a decision between “greatest good for both parties (utilitarian) vs. do no harm (deontology)”, though the latter interpretation typically requires differential weighing of own benefits versus the benefits of others(Fehr & Schmidt, 1999; Saez et al., 2015). In fact, it could be argued that the utilitarianism account applies not only to the third party’s well-being, but also to our own well-being, or to “that of those near or dear to us” (Kahane et al., 2018).

      We acknowledge that there may lack a direct analogy between our task and the trolley dilemma and therefore have deleted the trolley example in the discussion.

      (5) Related to the above point, the sample size of Study 2 was calculated based on the main effect of oxytocin. However, the authors also reported several regression models that seem to me more like exploratory analyses. Their sample size may not be sufficient for these analyses. The authors should: a) explicitly distinguish between their hypothesis-driven analysis and exploratory analysis; b) report achieved power of their analysis.

      We appreciate the reviewer’s thorough reading of our manuscript. Following the reviewer’s suggestions, we have explicitly stated in the revised manuscript which analyses were exploratory, and which were hypothesis driven. Following the reviewer’s request, we added the achieved power into the main texts (Lines 274-279):

      “The effect size (Cohen’s f<sup>2</sup>) for this exploratory analysis was calculated to be 0.491 and 0.379 for the placebo and oxytocin conditions, respectively. The post hoc power analysis with a significance level of α = 0.05, 7 regressors (IH, IB, EC, decision context, IH×context, IB×context, and EC×context), and sample size of N = 46 yielded achieved power of 0.910 (placebo treatment) and 0.808 (oxytocin treatment).”

      (6) Do the authors collect reaction times (RT) information? Did the decision context and oxytocin modulate RT? Based on their procedure, it seems that the authors adopted a speeded response task, therefore the RT may reflect some psychological processes independent of choice. It is also possible (and recommended) that the authors use the drift-diffusion model to quantify latent psychological processes underlying moral decision-making. It would be interesting to see if their manipulations have any impact on those latent psychological processes, in addition to explicit choice, which is the endpoint product of the latent psychological processes. There are some examples of applying DDM to this task, which the authors could refer to if they decide to go down this route (Yu et al, 2021, How peer influence shapes value computation in moral decision-making. Cognition.)

      We did collect the RT information for this experiment. As demonstrated in the figure below, participants exhibited significantly longer RT in the loss context compared to the gain context (Study1: the main effect of decision context: F<sub>1,79</sub>=20.043, P < 0.001, η<sup>2</sup> =0.202; Study2-placebo: F<sub>1.45</sub>=17.177, P < 0.001, η<sup>2</sup> =0.276). In addition to this effect of context, decisions were significantly slower in the other-condition compared to the self-condition

      (Study1: the main effect of recipient: F<sub>1,79</sub>=4.352, P < 0.040, η<sup>2</sup> =0.052; Study2-placebo: F<sub>1,45</sub>=5.601, P < 0.022, η<sup>2</sup> =0.111) which replicates previous research findings (Crockett et al., 2014). However, the differences in response time between recipients was not modulated by decision context (Study1: context × recipient interaction: F<sub>1,79</sub>=1.538, P < 0.219, η<sup>2</sup> =0.019; Study2-placebo: F<sub>1,45</sub>=2.631, P < 0.112, η<sup>2</sup> =0.055). Additionally, the results in the oxytocin study (study 2) revealed no evidence supporting any effect of oxytocin on reaction time. Neither the main effect (treatment: placebo vs. oxytocin) nor the interaction effect of oxytocin on response time was statistically significant (main effect of OXT treatment: F<sub>1,45</sub>=2.380, P < 0.230, η<sup>2</sup> =0.050; treatment × context: F<sub>1,45</sub>=2.075, P < 0.157η<sup>2</sup> =0.044; treatment × recipient: F<sub>1,45</sub>=0.266, P < 0.609, η<sup>2</sup> =0.006; treatment × context × recipient: F<sub>1,45</sub>=2.909, P < 0.095, η<sup>2</sup> =0.061).;

      Author response image 4.

      We also agree that it would be interesting to also investigate how the OXT might impact the dynamics of the decision process using a drift-diffusion model (DDM). However, we have already showed in the original manuscript that the OXT increased subjects’ relative harm sensitivities. If a canonical DDM is adopted here, then such an OXT effect is more likely to correspond to the increased drift rate for the relative harm sensitivity, which we feel still aligns with the current framework in general. In future studies, including further manipulations such as time pressure might be a more comprehensive approach to investigate the effect of OXT on DDM related decision variables such as attribute drift rate, initial bias, decision threshold and attribute synchrony.

      (7) This is just a personal preference, but I would avoid metaphoric language in a scientific paper (e.g., rescue, salvage, obliterate). Plain, neutral English terms can express the same meaning clearly (e.g., restore, vanish, eliminate).

      Again, we thank the reviewer for the suggestion and have since modified the terms.

      Reviewer #3:

      The primary weakness of the paper concerns its framing. Although it purports to be measuring "hyper-altruism" it does not provide evidence to support why any of the behavior being measured is extreme enough to warrant the modifier "hyper" (and indeed throughout I believe the writing tends toward hyperbole, using, e.g., verbs like "obliterate" rather than "reduce"). More seriously, I do not believe that the task constitutes altruism, but rather the decision to engage, or not engage, in instrumental aggression.

      We agree with the reviewer (and reviewer # 2) that plain and clear English should be used to describe our results and have since modified those terms.

      However, the term “hyperaltruism”, which is the main theme of our study, was originally proposed by a seminal paper (Crockett et al., 2014) and has since been widely adopted in related studies (Crockett et al., 2014, 2015, 2017; Volz et al., 2017; Zhan et al., 2020). The term “hyperaltruism” was introduced to emphasize the difference from altruism (Chen et al., 2024; FeldmanHall et al., 2015; Hu et al., 2021; Hutcherson et al., 2015; Lockwood et al., 2017; Xiong et al., 2020). Hyperaltruism does not indicate extreme altruism. Instead, it simply reflects the fact that “we are more willing to sacrifice gains to spare others from harm than to spare ourselves from harm” (Volz et al., 2017). In other words, altruism refers to people’s unselfish regard for or devotion to the welfare of others, and hyperaltruism concerns subject’s own cost-benefit preference as the reference point and highlights the “additional” altruistic preference when considering other’s welfare. For example, in the altruistic experimental design, altruism is characterized by the degree to which subjects take other people’s welfare into account (left panel). However, in a typical hyperaltruism task design (right panel), hyperaltruistic preference is operationally defined as the difference (κ<sub>other</sub> - κ<sub>self</sub>) between the degrees to which subjects value others’ harm (κ<sub>other</sub>) and their own harm (κ<sub>self</sub>).

      Author response image 5.

      I found it surprising that a paradigm that entails deciding to hurt or not hurt someone else for personal benefit (whether acquiring a financial gain or avoiding a loss) would be described as measuring "altruism." Deciding to hurt someone for personal benefit is the definition of instrumental aggression. I did not see that in any of the studies was there a possibility of acting to benefit the other participant in any condition. Altruism is not equivalent to refraining from engaging in instrumental aggression. True altruism would be to accept shocks to the self for the other's benefit (e.g., money).  The interpretation of this task as assessing instrumental aggression is supported by the fact that only the Instrumental Harm subscale of the OUS was associated with outcomes in the task, but not the Impartial Benevolence subscale. By contrast, the IB subscale is the one more consistently associated with altruism (e.g,. Kahane et al 2018; Amormino at al, 2022) I believe it is important for scientific accuracy for the paper, including the title, to be re-written to reflect what it is testing.

      Again, as we mentioned in the previous response, hyperaltruism is a term coined almost a decade ago and has since been widely adopted in the research field. We are afraid that switching such a term would be more likely to cause confusion (instead of clarity) among audience.

      Also, from the utilitarian perspective, the gain or loss (or harm) occurred to someone else is aligned on the same dimension and there is no discontinuity between gains and losses. Therefore, taking actions to avoid someone else’s loss can also be viewed as altruistic behavior, similar to choices increasing other’s welfare (Liu et al., 2020).

      Relatedly: in the introduction I believe it would be important to discuss the non-symmetry of moral obligations related to help/harm--we have obligations not to harm strangers but no obligation to help strangers. This is another reason I do not think the term "hyper altruism" is a good description for this task--given it is typically viewed as morally obligatory not to harm strangers, choosing not to harm them is not "hyper" altruistic (and again, I do not view it as obviously altruism at all).

      We agree with the reviewer’s point that we have the moral obligations not to harm others but no obligation to help strangers (Liu et al., 2020). In fact, this is exactly what we argued in our manuscript: by switching the decision context from gains to losses, subjects were less likely to perceive the decisions as “harming others”. Furthermore, after the administration of OXT, making decisions in both the gain and loss contexts were more perceived by subjects as harming others (Fig. 6A).

      The framing of the role of OT also felt incomplete. In introducing the potential relevance of OT to behavior in this task, it is important to pull in evidence from non-human animals on origins of OT as a hormone selected for its role in maternal care and defense (including defensive aggression). The non-human animal literature regarding the effects of OT is on the whole much more robust and definitive than the human literature. The evidence is abundant that OT motivates the defensive care of offspring of all kinds. My read of the present OT findings is that they increase participants' willingness to refrain from shocking strangers even when incurring a loss (that is, in a context where the participant is weighing harm to themselves versus harm to the other). It will be important to explain why OT would be relevant to refraining from instrumental aggression, again, drawing on the non-human animal literature.

      We thank the reviewer’s comments and agree that the current understanding of the link between our results of OT with animal literature can be at best described as vague and intriguing. Current literature on OT in animal research suggests that the nucleus accumbens (NAc) oxytocin might play the critical role in social cognition and reinforcing social interactions (Dölen et al., 2013; Dölen & Malenka, 2014; Insel, 2010). Though much insight has already been gained from animal studies, in humans, social interactions can take a variety of different forms, and the consociate recognition can also be rather dynamic. For example, male human participants with self-administered OT showed higher trust and cooperation towards in-group members but more defensive aggression towards out-group members (De Dreu et al., 2010). In another human study, participants administered with OT showed more coordinated out-group attack behavior, suggesting that OT might increase in-group efficiency at the cost of harming out-group members (Zhang et al., 2019). It is worth pointing out that in both experiments, the participant’s group membership was artificially assigned, thus highlighting the context-dependent nature of OT effect in humans.

      In our experiment, more complex and higher-level social cognitive processes such as moral framing and moral perception are involved, and OT seems to play an important role in affecting these processes. Therefore, we admit that this study, like the ones mentioned above, is rather hard to find non-human animal counterpart, unfortunately. Instead of relating OT to instrumental aggression, we aimed to provide a parsimonious framework to explain why the “hyperaltruism” disappeared in the loss condition, and, with the OT administration, reappeared in both the gain and loss conditions while also considering the effects of other relevant variables.  

      We concur with the reviewer’s comments about the importance of animal research and have since added the following paragraph into the revised manuscript (Line 86~90) as well as in the discussion:

      “Oxytocin has been shown to play a critical role in social interactions such as maternal attachment, pair bonding, consociate attachment and aggression in a variety of animal models[42,43]. Humans are endowed with higher cognitive and affective capacities and exhibit far more complex social cognitive patterns[44].”

      Another important limitation is the use of only male participants in Study 2. This was not an essential exclusion. It should be clear throughout sections of the manuscript that this study's effects can be generalized only to male participants.

      We thank the reviewer’s comments. Prior research has shown sex differences in oxytocin’s effects (Fischer-Shofty et al., 2013; Hoge et al., 2014; Lynn et al., 2014; Ma et al., 2016; MacDonald, 2013). Furthermore, with the potential confounds of OT effect due to the menstrual cycles and potential pregnancy in female subjects, most human OT studies have only recruited male subjects (Berends et al., 2019; De Dreu et al., 2010; Fischer-Shofty et al., 2010; Ma et al., 2016; Zhang et al., 2019). We have modified our manuscript to emphasize that study 2 only recruited male subjects.

      Recommendations:

      I believe the authors have provided an interesting and valuable dataset related to the willingness to engage in instrumental aggression - this is not the authors' aim, although also an important aim. Future researchers aiming to build on this paper would benefit from it being framed more accurately.

      Thus, I believe the paper must be reframed to accurately describe the nature of the task as assessing instrumental aggression. This is also an important goal, as well-designed laboratory models of instrumental aggression are somewhat lacking.

      Please see our response above that to have better connections with previous research, we believe that the term hyperaltruism might align better with the main theme for this study.

      The research literature on other aggression tasks should also be brought in, as I believe these are more relevant to the present study than research studies on altruism that are primarily donation-type tasks. It should be added to the limitations of how different aggression in a laboratory task such as this one is from real-world immoral forms of aggression. Arguably, aggression in a laboratory task in which all participants are taking part voluntarily under a defined set of rules, and in which aggression constrained by rules is mutual, is similar to aggression in sports, which is not considered immoral. Whether responses in this task would generalize to immoral forms of aggression cannot be determined without linking responses in the task to some real-world outcome.

      We agree with the reviewer that “aggression in a lab task …. is similar to aggression in sports”. Our starting point was to investigate the boundary conditions for the hyperaltruism (though we don’t deny that there is an aggression component in hyperaltruism, given the experiment design we used). In other words, the dependent variable we were interested in was the difference between “other” and “self” aggression, not the aggression itself. Our results showed that by switching the decision context from the monetary gain environment to the loss condition, human participants were willing to bear similar amounts of monetary loss to spare others and themselves from harm. That is, hyperaltruism disappeared in the loss condition. We interpreted this result as the loss condition prompted subjects to adopt a different moral framework (help vs. harm, Fig. 6A) and subjects were less influenced by their instrumental harm personality trait due to the change of moral framework (Fig. 3C). In the following study (study 2), we further tested this hypothesis and verified that the administration of OT indeed increased subjects’ perception of the task as harming others for both gain and loss conditions (Fig. 6A), and such moral perception mediated the relationship between subject’s personality traits (instrumental harm) and their relative harm sensitivities (the difference of aggression between the other- and self-conditions). We believe the moral perception framework and that OT directly modulates moral perception better account for subjects’ context-dependent choices than hypothesizing OT’s context-dependent modulation effects on aggression.

      The language should also be toned down--the use of phrases like "hyper altruism" (without independent evidence to support that designation) and "obliterate" rather than "reduce" or "eliminate" are overly hyperbolic.

      We have changed terms such as “obliterate” and “eliminate” to plain English, as the reviewer suggested.

      Reference

      Abu-Akel, A., Palgi, S., Klein, E., Decety, J., & Shamay-Tsoory, S. (2015). Oxytocin increases empathy to pain when adopting the other- but not the self-perspective. Social Neuroscience, 10(1), 7–15.

      Barchi-Ferreira, A., & Osório, F. (2021). Associations between oxytocin and empathy in humans: A systematic literature review. Psychoneuroendocrinology, 129, 105268.

      Berends, Y. R., Tulen, J. H. M., Wierdsma, A. I., van Pelt, J., Feldman, R., Zagoory-Sharon, O., de Rijke, Y. B., Kushner, S. A., & van Marle, H. J. C. (2019). Intranasal administration of oxytocin decreases task-related aggressive responses in healthy young males. Psychoneuroendocrinology, 106, 147–154.

      Chen, J., Putkinen, V., Seppälä, K., Hirvonen, J., Ioumpa, K., Gazzola, V., Keysers, C., & Nummenmaa, L. (2024). Endogenous opioid receptor system mediates costly altruism in the human brain. Communications Biology, 7(1), 1–11.

      Crockett, M. J., Kurth-Nelson, Z., Siegel, J. Z., Dayan, P., & Dolan, R. J. (2014). Harm to others outweighs harm to self in moral decision making. Proceedings of the National Academy of Sciences of the United States of America, 111(48), 17320–17325.

      Crockett, M. J., Siegel, J. Z., Kurth-Nelson, Z., Dayan, P., & Dolan, R. J. (2017). Moral transgressions corrupt neural representations of value. Nature Neuroscience, 20(6), 879–885.

      Crockett, M. J., Siegel, J. Z., Kurth-Nelson, Z., Ousdal, O. T., Story, G., Frieband, C., Grosse-Rueskamp, J. M., Dayan, P., & Dolan, R. J. (2015). Dissociable Effects of Serotonin and Dopamine on the Valuation of Harm in Moral Decision Making. Current Biology, 25(14), 1852–1859.

      De Dreu, C. K. W., Greer, L. L., Handgraaf, M. J. J., Shalvi, S., Van Kleef, G. A., Baas, M., Ten Velden, F. S., Van Dijk, E., & Feith, S. W. W. (2010). The Neuropeptide Oxytocin Regulates Parochial Altruism in Intergroup Conflict Among Humans. Science, 328(5984), 1408–1411.

      De Dreu, C. K. W., Gross, J., Fariña, A., & Ma, Y. (2020). Group Cooperation, Carrying-Capacity Stress, and Intergroup Conflict. Trends in Cognitive Sciences, 24(9), 760–776.

      Dölen, G., Darvishzadeh, A., Huang, K. W., & Malenka, R. C. (2013). Social reward requires coordinated activity of nucleus accumbens oxytocin and serotonin. Nature, 501(7466), 179–184.

      Dölen, G., & Malenka, R. C. (2014). The Emerging Role of Nucleus Accumbens Oxytocin in Social Cognition. Biological Psychiatry, 76(5), 354–355.

      Evans, S., Shergill, S. S., & Averbeck, B. B. (2010). Oxytocin Decreases Aversion to Angry Faces in an Associative Learning Task. Neuropsychopharmacology, 35(13), 2502–2509.

      Fehr, E., & Schmidt, K. M. (1999). A Theory of Fairness, Competition, and Cooperation*. The Quarterly Journal of Economics, 114(3), 817–868.

      FeldmanHall, O., Dalgleish, T., Evans, D., & Mobbs, D. (2015). Empathic concern drives costly altruism. Neuroimage, 105, 347–356.

      Fischer-Shofty, M., Levkovitz, Y., & Shamay-Tsoory, S. G. (2013). Oxytocin facilitates accurate perception of competition in men and kinship in women. Social Cognitive and Affective Neuroscience, 8(3), 313–317.

      Fischer-Shofty, M., Shamay-Tsoory, S. G., Harari, H., & Levkovitz, Y. (2010). The effect of intranasal administration of oxytocin on fear recognition. Neuropsychologia, 48(1), 179–184.

      Glenn, A. L., Koleva, S., Iyer, R., Graham, J., & Ditto, P. H. (2010). Moral identity in psychopathy. Judgment and Decision Making, 5(7), 497–505.

      Hoge, E. A., Anderson, E., Lawson, E. A., Bui, E., Fischer, L. E., Khadge, S. D., Barrett, L. F., & Simon, N. M. (2014). Gender moderates the effect of oxytocin on social judgments. Human Psychopharmacology: Clinical and Experimental, 29(3), 299–304.

      Hu, J., Hu, Y., Li, Y., & Zhou, X. (2021). Computational and Neurobiological Substrates of Cost-Benefit Integration in Altruistic Helping Decision. Journal of Neuroscience, 41(15), 3545–3561.

      Hutcherson, C. A., Bushong, B., & Rangel, A. (2015). A Neurocomputational Model of Altruistic Choice and Its Implications. Neuron, 87(2), 451–462.

      Insel, T. R. (2010). The Challenge of Translation in Social Neuroscience: A Review of Oxytocin, Vasopressin, and Affiliative Behavior. Neuron, 65(6), 768–779.

      Kahane, G., Everett, J. A. C., Earp, B. D., Caviola, L., Faber, N. S., Crockett, M. J., & Savulescu, J. (2018). Beyond sacrificial harm: A two-dimensional model of utilitarian psychology. Psychological Review, 125(2), 131–164.

      Kahane, G., Everett, J. A. C., Earp, B. D., Farias, M., & Savulescu, J. (2015). ‘Utilitarian’ judgments in sacrificial moral dilemmas do not reflect impartial concern for the greater good. Cognition, 134, 193–209.

      Kahneman, D., & Tversky, A. (1979). Prospect Theory: An Analysis of Decision under Risk. Econometrica, 47(2), 263.

      Kapetaniou, G. E., Reinhard, M. A., Christian, P., Jobst, A., Tobler, P. N., Padberg, F., & Soutschek, A. (2021). The role of oxytocin in delay of gratification and flexibility in non-social decision making. eLife, 10, e61844.

      Kirsch, P., Esslinger, C., Chen, Q., Mier, D., Lis, S., Siddhanti, S., Gruppe, H., Mattay, V. S., Gallhofer, B., & Meyer-Lindenberg, A. (2005). Oxytocin Modulates Neural Circuitry for Social Cognition and Fear in Humans. The Journal of Neuroscience, 25(49), 11489–11493.

      Liu, J., Gu, R., Liao, C., Lu, J., Fang, Y., Xu, P., Luo, Y., & Cui, F. (2020). The Neural Mechanism of the Social Framing Effect: Evidence from fMRI and tDCS Studies. The Journal of Neuroscience, 40(18), 3646–3656.

      Liu, Y., Li, L., Zheng, L., & Guo, X. (2017). Punish the Perpetrator or Compensate the Victim? Gain vs. Loss Context Modulate Third-Party Altruistic Behaviors. Frontiers in Psychology, 8, 2066.

      Lockwood, P. L., Hamonet, M., Zhang, S. H., Ratnavel, A., Salmony, F. U., Husain, M., & Maj, A. (2017). Prosocial apathy for helping others when effort is required. Nature Human Behaviour, 1(7), 131–131.

      Losecaat Vermeer, A. B., Boksem, M. A. S., & Sanfey, A. G. (2020). Third-party decision-making under risk as a function of prior gains and losses. Journal of Economic Psychology, 77, 102206.

      Lynn, S. K., Hoge, E. A., Fischer, L. E., Barrett, L. F., & Simon, N. M. (2014). Gender differences in oxytocin-associated disruption of decision bias during emotion perception. Psychiatry Research, 219(1), 198–203.

      Ma, Y., Liu, Y., Rand, D. G., Heatherton, T. F., & Han, S. (2015). Opposing Oxytocin Effects on Intergroup Cooperative Behavior in Intuitive and Reflective Minds. Neuropsychopharmacology, 40(10), 2379–2387.

      Ma, Y., Shamay-Tsoory, S., Han, S., & Zink, C. F. (2016). Oxytocin and Social Adaptation: Insights from Neuroimaging Studies of Healthy and Clinical Populations. Trends in Cognitive Sciences, 20(2), 133–145.

      MacDonald, K. S. (2013). Sex, Receptors, and Attachment: A Review of Individual Factors Influencing Response to Oxytocin. Frontiers in Neuroscience, 6. 194.

      Markiewicz, Ł., & Czupryna, M. (2018). Cheating: One Common Morality for Gain and Losses, but Two Components of Morality Itself. Journal of Behavior Decision Making. 33(2), 166-179.

      Pachur, T., Schulte-Mecklenbeck, M., Murphy, R. O., & Hertwig, R. (2018). Prospect theory reflects selective allocation of attention. Journal of Experimental Psychology: General, 147(2), 147–169.

      Radke, S., Roelofs, K., & De Bruijn, E. R. A. (2013). Acting on Anger: Social Anxiety Modulates Approach-Avoidance Tendencies After Oxytocin Administration. Psychological Science, 24(8), 1573–1578.

      Saez, I., Zhu, L., Set, E., Kayser, A., & Hsu, M. (2015). Dopamine modulates egalitarian behavior in humans. Current Biology, 25(7), 912–919.

      Teoh, Y. Y., Yao, Z., Cunningham, W. A., & Hutcherson, C. A. (2020). Attentional priorities drive effects of time pressure on altruistic choice. Nature Communications, 11(1), 3534.

      Tom, S. M., Fox, C. R., Trepel, C., & Poldrack, R. A. (2007). The neural basis of loss aversion in decision-making under risk. Science, 315(5811), 515–518.

      Usher, M., & McClelland, J. L. (2004). Loss Aversion and Inhibition in Dynamical Models of Multialternative Choice. Psychological Review, 111(3), 757–769.

      Volz, L. J., Welborn, B. L., Gobel, M. S., Gazzaniga, M. S., & Grafton, S. T. (2017). Harm to self outweighs benefit to others in moral decision making. Proceedings of the National Academy of Sciences of the United States of America, 114(30), 7963–7968.

      Wu, Q., Mao, J., & Li, J. (2020). Oxytocin alters the effect of payoff but not base rate in emotion perception. Psychoneuroendocrinology, 114, 104608.

      Wu, S., Cai, W., & Jin, S. (2018). Gain or non-loss: The message matching effect of regulatory focus on moral judgements of other-orientation lies. International Journal of Psychology, 53(3), 223-227.

      Xiong, W., Gao, X., He, Z., Yu, H., Liu, H., & Zhou, X. (2020). Affective evaluation of others’ altruistic decisions under risk and ambiguity. Neuroimage, 218, 116996.

      Yechiam, E., & Hochman, G. (2013). Losses as modulators of attention: Review and analysis of the unique effects of losses over gains. Psychological Bulletin, 139(2), 497–518.

      Zhan, Y., Xiao, X., Tan, Q., Li, J., Fan, W., Chen, J., & Zhong, Y. (2020). Neural correlations of the influence of self-relevance on moral decision-making involving a trade-off between harm and reward. Psychophysiology, 57(9), e13590.

      Zhang, H., Gross, J., De Dreu, C., & Ma, Y. (2019). Oxytocin promotes coordinated out-group attack during intergroup conflict in humans. eLife, 8, e40698.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      General Statement

      *Our lab was totally destroyed on June 15th by an Iranian missile. All stocks, equipment and reagents were lost. While we performed many of the experiments requested by the reviewers, unfortunately some were never completed. We thank you for your understanding. *

      We thank the three reviewers for their thoughtful comments and useful suggestions on how to improve our paper. Some of the reviewers claimed that the paper is “preliminary”. We would like to highlight that in our opinion “preliminary” has two possible meanings in this context: 1) the data does not yet support the claims that the authors wrote; 2) the story is short and should be extended. While we totally agree that type 1 “preliminary” should be addressed (and we have addressed that to the best of our abilities), type 2 “preliminary” is a matter of scope, the length of the paper/project and the publication home. We believe that this story, which has been led by an outstanding master’s student (and as such has had a limited timespan) is worthwhile of publication in its current scope.

      2. Point-by-point description of the revisions

      Reviewers’ comments are in BLUE while our responses are in BLACK.

      Reviewer 1 Summary: This study reports a role for matrix metalloproteinases (MMPs) in the developmental pruning of gamma Kenyon cells (KCs) in the fruit fly Mushroom Body during larval-pupal metamorphosis. The authors show through gene expression studies that MMP genes are upregulated in late larval stages as part of the early program for this type of neuronal pruning. They show through cell-targeted RNAi studies of both secreted MMP-1 and membrane-anchored MMP-2, that both genes are required in glial cells and to a lesser extent within KCs.

      Both MMPs have secreted and membrane-anchored isoforms and we did not assess whether the secreted/anchored isoforms are involved; e.g. see LaFever et al. 2017.

      The authors show that MMP secreted from glial is required for normal levels of Mushroom Body developmental neuronal pruning. They mention that MMP genes have been identified in schizophrenic patient screens in patients, and that perhaps a comparable pruning mechanism could be involved in the loss of grey matter (loss of synapses) in patients. The authors propose that MMP levels may be a potential therapeutic marker in the future.

      We thank the reviewer for his comments. We find it important to clarify that we do not think our work suggests that the MMPs levels may be a potential therapeutic marker without much additional work in the future. In the original text we added a claim from another paper suggesting MMPs as therapeutic target. However, due to the arising confusion, we decided to delete this statement from the text (original line 198). We also added a general disclaimer towards the end of the discussion regarding the genetic power of Drosophila but its limited implication into human health (new lines 276-278).

      Major Comments: Overall, the work is of a reasonable standard, but very preliminary

      Please see general note on two types of “preliminary” – we thank the reviewer for helping us substantiate our claims and strengthen our paper but we do not plan to significantly increase its scope.

      The study lacks the substance to completely convince me of any of the results. There is SUBSTANTIAL work that needs to be done to make this publishable. There are a lot of writing mistakes; so many that I do not list them in detail here

      We are not absolutely sure that we understand to which mistakes this reviewer is eluding. However, we carefully rewrote the manuscript, streamlined many of our claims and added many new and more recent references.

      The references citations are fairly old, but I do not list update replacements here

      Thanks – we added many newer and relevant citations.

      The text is very brief, and the overall writing needs to include significantly more description and detail

      We have included more descriptions and details, as will be elaborated later on, but – again - this is a short report and will remain as such.

      This is evident in all aspects of the manuscript, but especially notable in the Methods and Figure Legends

      Thanks for raising this comment, which was reverberated also by other reviewers – we have now included more details, with a particular focus on the genotypes (Table 2), that somehow were erroneously not included in the original submission, as well as more detailed figure legends.

      None of the Figure Legends include full genotypes of any of the fly lines, and these full fly lines are also not included in the Methods. This is vital to compare the experimental lines to the controls

      True – our apologies for this mistake, we now added the full genotypes in Table 2.

      Major points are listed below:

      1. Figure 2: It is important to note of the specific age of animals in these images when talking about the loss of genes in development. Are all the animals age-matched? High levels of synaptic pruning occur post-eclosion), and it is important to understand when these pruning defects occur. It is mentioned that that overlap for the gene expression data is upregulated during 6-18h APF is this when these images are taken? This is very important in the context of pruning as SCZ symptom presentation is very late relative to these early events.

      We thank the reviewer for this comment which suggests we were not clear enough in our description. We do not claim to have generated an SCZ model and have clarified this better in the text (lines 275-278). Furthermore, axon pruning happens during pupal development, but in all the main figures in this manuscript we dissected young adult flies (3-5 days post eclosion) and show the remnants of unpruned axons (as we have done in numerous studies). To make sure that initial development occurred normally, we also include larval brains in the Figure S7. We now clarified the fact that we are imaging adult brains as a readout to investigate whether pruning occurred during metamorphosis or not (line 124-126).

      1. Figure 2: In the figure legend, it is indicated that the arrows are unpruned axons, however in the controls these areas appear to be highly innervated. Further explanation is needed about the context of the arrows, as there are clear visual differences between these images and the controls, but they appear to have a more expansive phenotype than "unpruned axons". The data does not match the visual representation in comparison to the control.

      We apologize for this confusion. Unfortunately, the driver which we use to label the γ-axons, R71G10-QF2, is not absolutely specific to the γ type KCs but also expressed (sometimes) in the ɑ/β KCs. As the ɑ/β axons are very stereotypic in shape and also express high levels of FasII (which we stain for), we can easily distinguish between the ɑ lobe and unpruned γ axons. To clarify this point, we now clearly demarcate all lobes in the control images and specifically the ɑ lobe in all panels. Additionally, we added new schemes in Figure 2A and 2O to better clarify the anatomy and experimental design.

      1. Figure 2: There needs to be more descriptive definitions and clarifications to the defects labeled in panel K. This could be done in the figure legend, but it would be more useful to label the images provided. For example, if Mmp2 is a "mild pruning affect, put that in the pie chart somewhere, to help guide the description of the phenotype to what those confocal images look like.

      We understand that the pie chart in Figure 2 was confusing and therefore simplified it in the current version (Fig. 2B and 2P). Also, thanks to this great point, we now include a new Figure S3 that includes examples for the ranking categories, which were now performed by two independent investigators in a blind manner.

      Figure 3: The time points of the images of the Mushroom Body (MB) are vital to understanding the process and regulation of these genes.

      Please see our comment to point #1 – unless specifically stated otherwise, all images are MBs of adult flies, as now clearly mentioned in the figure legends, in the text and in the Material and Methods section.

      1. Figure 3D: Significant description of this graph needs to be added for clarity. What parameters separate each phenotypic defect? Labeling the images and showing images that belong in different groups would be very helpful and improve the paper significantly.

      We now included a new Figure S3 (also see our response to comment #3).

      1. Figure S1: Additional experiments would help answer the strength of the phenotype for the ALG-Gal 4 driver. The authors need to perform the rescue experiment. Use a MMP-2 null and then drive it back in the ALG-GAL4 to see if this is sufficient to rescue the neuron pruning. This also isolates the mechanisms to one subtype of glia.

      These are excellent suggestions that are, unfortunately, not doable. To perform a rescue experiment, one would need a viable loss-of-function phenotype of an Mmp2 mutant. There is one published Mmp2 loss-of-function null allele which is lethal during pupal development (Page-McCaw et al, 2003). Our previous data, using tissue specific (ts)CRISPR, suggested the involvement of Mmp2 in neurons for their remodeling (Meltzer et al, 2019). We therefore independently generated an Mmp2 germline mutant using CRISPR (harboring an indel resulting in a premature stop codon and predicted to encode a truncated, 77 amino-acid long protein), now described in Fig. S5A (and in the Materials and Methods). This allele is, as expected, unfortunately also lethal. We attempted to overcome lethality by generating MARCM (mosaic) clones in neurons, but as expected, because Mmp2 is largely secreted, there was no pruning defect phenotype (Fig. S5B-C). Unfortunately, it is not yet possible to generate glial clones.

      Figure 3 and 4: The other glial subtypes need to be analyze to make any conclusion about their involvement, as well as the involvement of the astrocytes. Running these exact same experiments on the cortex glial and ensheathing glia will provide essential insight into what glial subtype is involved. The presumed lack of phenotypes in these other glial subtypes will also strengthen the argument that the astrocytes are specifically involved in this process. These are vital experiments.

      We currently limited our analysis (and conclusions) to astrocytes. Despite the fact that this experiment is beyond our initial scope, we obtained reagents and performed preliminary experiments (using the R77A03-Gal4 driver for cortex glia, and the R83E12-Gal4 for ensheathing glia). In both cases, we observed extremely mild pruning defects, not comparable to those with Repo- or Alrm-Gal4. In these preliminary experiments we lacked a proper control, and now, unfortunately, due to the loss of our lab, we are unable to complete these experiments in a reasonable amount of time.

      1. Figure 4: Again, description of the phenotypes and examples of these would improve the quality of this figure substantially.

      Absolutely agree – see our response to comment #3 (and Fig. S3).

      1. Figure 5: An improvement on the quantifications of these phenotypes would strengthen the paper substantially. More detailed description of the phenotypes and how they related to the control would significantly improve the overall quality of the work.

      Thanks again for highlighting that we neglected to include the full genotypes that are now added (Table 2). We also thank the reviewer for raising the point regarding quantification. First, we generated a new Fig. S3A-E to show examples of the ranking by two independent rankers. Second, ranking was performed by looking at TdTomato positive vertical axons that are outside of the ɑ lobe (high FasII) – this is now better explained in the materials and methods. Additionally, while we would love to have a better scoring, and automatic, system – and even published a semi-automated scoring algorithm in Alyagor et al. 2018 (Figure 3O in the Alyagor paper), because the driver also labels vertical axons (ɑ/β) and because unpruned γ axons often express FasII, this quantification method does not always work. What we have done in previous cases, as we have also done here, is to provide independent ranking by two investigators and compare their ranking (Fig. S3F-G). Finally, we are working with our AI hub to develop automatic scoring systems that will not require human ranking – however this is beyond the scope for this manuscript.

      Minor Comments: 1. Figure 1A: I would suggest labeling the KC (gamma) and potentially one of the others (a/B, a'/B') to orient the reader to the differences between these two subsets of the KCs, and to emphasize which neurons are undergoing pruning and where the cell bodies are and where the axons project.

      Thanks for the suggestions – we now better annotated the scheme in Figure 1A as well as additional schematics in Figure 2 and, finally, better annotations in selected panels. Specifically, the ɑ lobe is outlined in magenta throughout all relevant panels.

      1. Figure 1C: This panel needs further labeling to explain the findings in the heat map. Labeling some of the genes that were found and where they were would be helpful. This could also be done in the figure legend, however without any further labeling or context the heatmap is confusing.

      We apologize for the incomplete figure. We did not want to overload the figure with data, which is why we are showing only the important clusters and did not include gene names. To keep the figure simple, but at the same time provide the complete information, we now include the full data in Fig. S1 (that includes the original heatmap with all the dynamic clusters I-IX, and including all the gene names). For the full raw data, including non-dynamic clusters, the reader is referred to look in Supplemental excel file 1. We hope this provides the clarity that this reviewer rightfully asks for.

      1. Figure 3B,C: The full genotypes need to be labeled. What is the exact genotype used for the control?

      The full genotypes of all figure panels are now included in Table 2 in the Materials and Methods.

      1. Figure S1: The stock number for the ALG-GAL4 is missing, there are multiple different drivers, therefore this could be helpful in understanding this phenotype, as some are better than others.

      Indeed, Alrm-Gal4 comes on two chromosomes – we used BDSC #67032, which is on chromosome III and this is now clearly mentioned the Materials and Methods section.

      1. Figures 3 and 4: Labeling needs to remain consistent; Figure 3 "Glia-Gal4", Figure 4 "glia-gal4".

      Thanks, done.

      Reviewer #1 (Significance (Required)):

      General Assessment: An interesting study on MMP function during an unusual type of neural development (axon pruning). Most of the MMP function appears to be in glia, although the MMP role in this context in unclear. The MMP function in the neurons being pruned is unexpected and even less clear. The study is somewhat poorly described in terse language lacking essential information, which gives the overall impression of a preliminary report.

      Advance: Glial MMP function has been described for neuronal clearance mechanisms following injury. The main advance here is to describe a similar function during normal development. Audience: Developmental neuroscientists, MMP biologists, possibly schizophrenia clinician researchers

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      Neuropsychiatric conditions are often influenced by genetic factors. Schizophrenia is a complex mental disorder characterised by a mixture of hallucinations, delusions and disorganised thinking that causes lifelong problems in daily life. GWAS have identified a number of genes associated with the risk of developing schizophrenia, although genetic predisposition alone is not sufficient and additional environmental factors are required. In the current manuscript, the authors aim to exploit the strength of the Drosophila system to explore a link between schizophrenia-associated genes and neuronal remodelling during development. They focus on the mushroom body in the adult brain, where pronounced neuronal remodelling occurs during metamorphosis. To assess the potential role of the genes identified by the GWAS, they performed a targeted RNAi-based screen. They focus on the role of metalloproteases and find that they are required in neurons and in glia for the pruning of mushroom body axons. The study starts with a selection of 32 genes, 29 of which are listed (a bit hidden) in materials and methods and the identification of the Drosophila orthologs. The expression patterns of these genes in Kenyon cells are presented in Figure 1 - but unfortunately no information is given on who is expressed when

      We apologize for the confusion. We attempted to keep Figure 1 simple but this resulted in the absence of critical information, as the reviewer suggests. We now include a Figure S1 that includes the entire heatmap of the dynamically expressed clusters I-IX with all the gene names. Additionally, we now augmented the information in Table 1 to include the screen phenotypes. Finally, Supplemental excel file 1, also included in our original submission, includes all the data, and is now better referred to throughout the text.

      In a next step, Kenyon cell specific RNAi knockdown experiments are shown that identify a pruning phenotype for several genes. They demonstrate that Mmp2 (and similarly Mmp1) is also required in glia. Although Mmp2 was identified by neuronal RNAi-based knockdown, double knockdown experiments led the authors conclude that its primary function is in glia. The study emphasises the use of the advanced genetic model to understand complex human diseases. However, the paper does not go far enough in making use of the excellent genetics available. Basically, the report is about the identification of a few hits in a small RNAi screen, which is fine in itself, but leaves many questions unanswered. Do mmp1/2 mutants have a phenotype?

      This is a very important question that cannot be answered, unfortunately. There is one published Mmp2 loss of function null allele which is lethal during pupal development (Page-MaCaw et al, 2003). Our previous data, using tissue specific (ts)CRISPR, suggested the involvement of Mmp2 in neurons for their remodeling (Meltzer et al, 2019). We therefore independently generated an Mmp2 germline mutant using CRISPR (harboring an indel resulting in a premature stop codon and predicted to encode a truncated, 77 amino-acid long protein), now described in Fig. S5A (and in the Materials and Methods). This allele is, as expected, unfortunately also lethal. We attempted to overcome lethality by generating MARCM (mosaic) clones in neurons, but as expected, because Mmp2 is largely secreted, there was no pruning defect phenotype (Fig. S5B-C). Unfortunately, it is not yet possible to generate glial clones. Additionally, available Mmp1 mutants are, sadly, also homozygous lethal. That said, in our revised manuscript we now include data demonstrating that expression of a dominant negative variant of Mmp1 inhibits pruning (Fig. 3J-K). We strengthened the evidence regarding the reliability of Mmp1 RNAi using an antibody mix (Fig. S4), and for Mmp2 – we refer to a manuscript that tested its efficiency (Harmansa et al., 2023). Lastly, we added new data using an additional RNAi line targeting Mmp2 from the VDRC collection (Fig. 3L).

      Can the phenotype be rescued?

      Unfortunately, without a viable mutant LOF phenotype, a rescue experiment is impossible. Regardless, in an attempt to rescue the RNAi phenotype, we designed and generated an RNAi-resistant Mmp2 overexpression transgene. Unfortunately, due to the destruction of our lab – several days after we received this transgenic line from Bestgene – this experiment is not included in the revision.

      Does TIMP expression lead to similar phenotypes?

      This is an interesting question which we addressed in our experiments but did not include in the text. Unfortunately, overexpression of TIMP did not have any effect on MB development. We are adding this figure here as Reviewer Figure 1, but we think that adding this information to the paper will not improve it for several reasons. The lack of phenotype by overexpression of Timp can result from a technical issue such as low expression or mislocalization of the protein, or a biological issue such as more complicated involvement of TIMP or other MMP inhibitors.

      What is the temporal requirement for Mmp1/2?

      This is an excellent suggestion, not an easy experiment, but one that we initiated, using a temperature sensitive Gal80 to control the expression of the RNAi only during metamorphosis. However, to the unfortunate destruction of our lab, this experiment was never completed.

      What are the target proteins of Mmp2?

      This is the million-dollar question – but unfortunately is beyond the scope of this short report.

      Is Mmp2 still required when astrocyte motility is blocked? What is the morphology of glia after Mmp1/2 knockdown?

      Thank you for this wonderful suggestion. We initiated two types of experiments using sparse labeling techniques (both MARCM and SPARC) to identify the morphology of single astrocytes in WT vs. MMP KD. However, these are complicated crosses that were not completed prior to the destruction of our lab.

      Reviewer #2 (Significance (Required)):

      The strength of the study is to identify a pruning phenotype after RNAi-based knockdown. The limitations is that this study is very superficial, it is the beginning of a paper. The initial claim to use Drosophila because to its advanced genetics is not met. The results section is shorter than the discussion.

      While we agree with much of the reviewer’s statement this also relates to our general comment about “preliminary” type 1 and type 2 – True, this could be the beginning of a big paper and it would definitely be a more comprehensive and deep story. Most of the papers from my lab are indeed a 5 year endeavor. However, this short report (which is now longer, more detailed, and includes additional experiments) is a result of the work of an outstanding master’s student who came up with the idea for the project entirely by herself. Thus – given the data that she has acquired, and the fact that my lab will not continue to study MMPs or schizophrenia, the question needs to be whether the data supports the claims and whether this is an advance of science worthwhile of publication in a respectable journal. Our clear and decisive opinion is that the answer to that question is yes.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      In this work, Schuldiner and colleagues explore the role of Mmp1 and Mmp2 in neuronal remodeling in the mushroom body of Drosophila. Overall, this work is very interesting, but in its current form seems quite preliminary. The biggest limitation of the study is that single RNAi lines are used with no validation that the lines are working, despite the fact that Mmp antibodies are available as are endogenously tagged Mmp lines that could have been used to validate the genetic manipulations. Specific concerns are listed below.

      We thank reviewer 3 for his generally positive assessment of our work and we now performed additional experiments to strengthen and validate the original RNAi findings – for specifics see our reply to the points below.

      Major concerns 1) The scoring system for pruning of mushroom body neurons seems very variable, even in controls (where scoring can range from very mild to moderate), and it is very hard to assess from the images what one is looking at (rather than using our own judgment, we rely on the authors' words). It would be necessary to have better labeling and examples of what phenotypes are considered "mild", "severe", "wild type-like". It would also help to understand how phenotype assessment is guided by the overlap between the signals from TdTomato fluorescence and FasII stain.

      We thank the reviewer for raising this point, that has also been highlighted by other reviewers in some form. First, we have generated Figure S3A-E to show examples of the ranking, which was now performed by two independent investigators. Second, ranking was performed by looking at TdTomato positive vertical axons that are outside of the αlobe (high FasII) – this is now better explained in the materials and methods. Additionally, while we would love to have a better scoring, and automatic, system – and even published a semi-automated scoring algorithm in Alyagor et al. 2018 (Figure 3O in the Alyagor paper), because the driver also labels vertical axons (ɑ/β) and because unpruned γ axons often express FasII, this quantification method does not always work. What we have done in previous cases, as we have also done here, is to provide independent ranking by two investigators and compare their ranking (Fig. S3F-G). Finally, we are working with our AI hub to develop automatic scoring systems that will not require human ranking – however this is beyond the scope for this manuscript.

      2) The biggest limitations of the approach are that single RNAi lines are used to screen, with no accompanying validation of the tool (see above)

      We agree. Unfortunately not all RNAis are “equal” and thus not all of them work. To support the RNAi data, we have better clarified previous experiments that demonstrate the importance of neuronal Mmp2 via tissue specific (ts) CRISPR (Meltzer, et al, 2019). Unfortunately, the Mmp2 null mutant that is available is lethal during pupal development (Page-MaCaw et al, 2003). We therefore independently generated an Mmp2 germline mutant using CRISPR (harboring an indel resulting in a premature stop codon and predicted to encode a truncated, 77 amino-acid long protein), now described in Fig. S5A (and in the Materials and Methods). This allele is, as expected, unfortunately also lethal. We attempted to overcome lethality by generating MARCM (mosaic) clones in neurons, but as expected, because Mmp2 is largely secreted, there was no pruning defect phenotype (Fig. S5B-C). Unfortunately, it is not yet possible to generate glial clones. Additionally, available Mmp1 mutants are, sadly, also homozygous lethal. That said, in our revised manuscript we now include data demonstrating that expression of a dominant negative variant of Mmp1 inhibits pruning (Fig. 3J-K). We strengthened the evidence regarding the reliability of Mmp1 RNAi using an antibody mix (Fig. S4), and for Mmp2 – we refer to a manuscript that tested its efficiency (Harmansa et al., 2023). Lastly, we added new data using an additional RNAi line targeting Mmp2 from the VDRC collection (Fig. 3L).

      3) RNAi-based knockdown is used to infer epistatic information-this is not appropriate as epistasis experiments need to be done with null alleles to make firm conclusions. Additional concerns: ● Even with the same driver, knockdown efficiency for 2 different genes could be variable and dependent of the specific RNAi used. ● The comparison between drivers is even harder, as driver strength varies greatly. ● The knockdown efficiency drops with increasing numbers of RNAi used. ● The specific genotypes used for this experiment should be clarified, as it would be very important to ensure that the UAS dosage is equal across conditions.

      We agree that RNAi is not optimal to assess epistasis. And indeed, we did not mean to claim epistasis relationship between Mmp1 and Mmp2, nor between neurons and glia. We now use better language to clarify this. To define epistatic relationships, the use of mutants would be required, unfortunately the use of nulls is not possible because they are lethal and secreted (thus not enabling mosaic analyses). We agree that increasing the number of RNAi lines is expected to reduce their efficiency – this is why it is even more significant when we see an increased defective phenotype in the double knockdown experiments. Finally, we totally agree about the genotype comment and apologize that it was erroneously omitted in the original submission– all of which have been now added (Table 2 in materials and methods).

      4) To further deepen the rigor of this work, a few simple yet important things could have been done. First, it would be important to rule out that knocking down Mmps does not affect astrocyte numbers and health (could be assessed by counting numbers and observing their morphology). Also, the authors previously showed that astrocytes actively infiltrate the axon bundle prior to pruning to facilitate axon defasciculation and pruning (Marmor-Kollet et al., 2023). It would have provided an important insight to examine if astrocytes can infiltrate the axon bundle if Mmp2 and/or Mmp1 are knocked down.

      Thank you for these wonderful suggestions. We embarked on a few experiments as detailed below, unfortunately these are complicated crosses that were not completed prior to the destruction of our lab. 1) We initiated two types of experiments using sparse labeling techniques (both MARCM and SPARC) to identify the morphology of single astrocytes in WT vs. MMP KD. 2) Testing astrocytic infiltrations requires three binary systems, we obtained and generated stocks required for these experiments, but these were prematurely terminated. 3) We initiated experiments to count the number of glial nuclei in the vicinity of the degenerating axonal lobe (at the onset of pruning). Preliminary experiments with a small n (3 controls, 4 Mmp1 RNAi, and 5 Mmp2 RNAi) suggest that the number of glial nuclei is not significantly different between these conditions.

      Minor The introduction puts big emphasis on the role of glia, but then to narrows down candidate genes for the screen a γ-KCs transcriptional data set is used, and the initial screen is done via knockdown of those candidates in neurons (there is a disconnect between rationale and approach).

      We totally agree with this reviewer which is why we now changed the paper to include both neuronal and glial loss-of-function screens. Figure 1 is now augmented with the glial data.

      Rationale for looking into axon pruning and how that translates into insights about synaptic pruning defects in schizophrenia should be more clearly stated.

      Indeed, our belief that synapse pruning and axon pruning share molecular mechanisms remains yet unproven. However, both are steps during neuronal remodeling, which has been previously implicated in schizophrenia. That said, we now added an additional disclaimer to acknowledge the limitation of our findings in the context of human disease and synapse elimination (lines 275-279).

      Figure 1C: data visualization for this heat map should be improved. Parts of the data are faded, and the differences between gene clusters are unclear.

      We apologize for the incomplete figure. We did not want to overload the figure with data, which is why we are showing only the important clusters and did not include gene names. To keep the figure simple, but at the same time provide the complete information, we now include the full data in Fig. S1 (that includes the original heatmap with all the dynamic clusters I-IX, and including all the gene names). For the full raw data, including non-dynamic clusters, the reader is referred to look in Supplemental excel file 1. We hope this provides the clarity that this reviewer rightfully asks for.

      Reviewer #3 (Significance (Required)):

      In this work, Schuldiner and colleagues explore the role of Mmp1 and Mmp2 in neuronal remodeling in the mushroom body of Drosophila. Overall, this work is very interesting, but in its current form seems quite preliminary. The biggest limitation of the study is that single RNAi lines are used with no validation that the lines are working, despite the fact that Mmp antibodies are available as are endogenously tagged Mmp lines that could have been used to validate the genetic manipulations.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      We thank the reviewers for providing us the opportunity to revise our manuscript titled “Identifying regulators of associative learning using a protein-labelling approach in C. elegans.” We appreciate the insightful feedback that we received to improve this work. In response, we have extensively revised the manuscript with the following changes: we have (1) clarified the criteria used for selecting candidate genes for behavioural testing, presenting additional data from ‘strong’ hits identified in multiple biological replicates (now testing 26 candidates, previously 17), (2) expanded our discussion of the functional relevance of validated hits, including providing new tissue-specific and neuron class-specific analyses, and (3) improved the presentation of our data, including visualising networks identified in the ‘learning proteome’, to better highlight the significance of our findings. We also substantially revised the text to indicate our attempts to address limitations related to background noise in the proteomic data and outlined potential refinements for future studies. All revisions are clearly marked in the manuscript in red font. A detailed, point-by-point response to each comment is provided below.

      1. Point-by-point description of the revisions

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      Summary:

      Rahmani et al., utilize the TurboID method to characterize the global proteome changes in the worm's nervous system induced by a salt-based associative learning paradigm. Altogether, Rahmani et al., uncover 706 proteins that are tagged by the TurboID method specifically in samples extracted from worms that underwent the memory inducing protocol. Next, the authors conduct a gene enrichment analysis that implicates specific molecular pathways in salt-associative learning, such as MAP-kinase and cAMP-mediated pathways. The authors then screen a representative group of the hits from the proteome analysis. The authors find that mutants of candidate genes from the MAP-kinase pathway, namely dlk-1 and uev-3, do not affect the performance in the learning paradigm. Instead multiple acetylcholine signaling mutants significantly affected the performance in the associative memory assay, e.g., acc-1, acc-3, gar-1, and lgc-46. Finally, the authors demonstrate that the acetylcholine signaling mutants did not exhibit a phenotype in similar but different conditioning paradigms, such as aversive salt-conditioning or appetitive odor conditioning, suggesting their effect is specific to appetitive salt conditioning.

      Major comments:

      1. The statistical approach and analysis of the behavior assay: The authors use a 2-way ANOVA test which assumes normal distribution of the data. However, the chemotaxis index used in the study is bounded between -1 and 1, which prevents values near the boundaries to be normally distributed.

      Since most of the control data in this assay in this study is very close to 1, it strongly suggests that the CI data is not normally distributed and therefore 2-way ANOVA is expected to give skewed results.

      I am aware this is a common mistake and I also anticipate that most conclusions will still hold also under a more fitting statistical test.

      We appreciate the point raised by Reviewer 1 and understand the importance of performing the correct statistical tests.

      The statistical tests used in this study were chosen since parametric tests, particularly ANOVA tests to assess differences between multiple groups, are commonly used to assess behaviour in the C. elegans learning and memory field. Below is a summary of the tests used by studies that perform similar behavioural tests cited in this work, as examples:

      Table 1 | A summary for the statistical tests performed by similar studies for chemotaxis assay data. References (listed in the leftmost column) were observed to (A) use parametric tests only or (B) performed either a parametric or non-parametric test on each chemotaxis assay dataset depending on whether the data passed a normality test. Listings for ANOVA tests are in bold to demonstrate their common use in the C. elegans learning and memory field.

      Reference

      Parametric test/s used in the reference

      Non-parametric test/s used in the reference

      Beets et al., 2020

      Two-way ANOVA

      None

      Hiroki & Iino 2022

      One-way ANOVA

      None

      Hiroki et al., 2022

      One-way ANOVA

      None

      Hukema et al., 2006

      T-tests

      None

      Hukema et al., Learn. Mem. 2008

      T-tests

      None

      Jang et al., 2019

      ANOVA

      None

      Kitazono et al., 2017

      Two-way ANOVA and t-tests

      None

      Lans et al., 2004

      One-way ANOVA

      None

      Lim et al., 2018

      Two-way ANOVA

      Wilcoxon rank sum test adjusted with the Benjamini–Hochberg method

      Lin et al., 2010

      Two-way or three-way ANOVA

      None

      Nagashima et al., 2019

      One-way ANOVA

      None

      Ohno et al., 2014

      None

      Sakai et al., 2017

      One-way ANOVA or t-tests

      None

      Stein & Murphy 2014

      Two-way ANOVA and t-tests

      None

      Tang et al., 2023

      One-way ANOVA or t-tests

      None

      Tomioka et al., 2006

      T tests

      None

      Watteyne et al., 2020

      One-way ANOVA

      Two-sided Kruskal–Wallis

      We note Reviewer 1's concern that this may stem from a common mistake. As stated, Two-way ANOVA generally relies on normally distributed data. We used GraphPad Prism to perform the Shapiro-Wilk normality test on our chemotaxis assay data as it is generally appropriate for sample sizes Table 2 | Shapiro-Wilk normality test results for chemotaxis assay data in Figure S8C. Chemotaxis assay data was generated to assess salt associative learning capacity for wild-type (WT) versus lgc-46(-) mutant C. elegans. Three experimental groups were prepared for each C. elegans strain (naïve, high-salt control, and trained). From top-to-bottom, the data below displays the ‘W’ value, ‘P value’, a binary yes/no for whether the data passes the Shapiro-Wilk normality test, and a ‘P value summary’ (ns = non-significant). W values measure the similarity between a normal distribution and the chemotaxis assay data. Data is considered normal in the Shapiro-Wilk normality test when a W value is near 1.0 and the null hypothesis is not rejected (i.e., P value > 0.05).*

      WT naïve

      WT high-salt control

      WT trained

      lgc-46 naïve

      lgc-46 high-salt control

      lgc-46 trained

      W

      0.9196

      0.9114

      0.8926

      0.8334

      0.8151

      0.8769

      P value

      0.5272

      0.4758

      0.3705

      0.1475

      0.1070

      0.2954

      Passed normality test (alpha=0.05)?

      Yes

      Yes

      Yes

      Yes

      Yes

      Yes

      P value summary

      ns

      ns

      ns

      ns

      ns

      ns

      The manuscript now includes the use of the Shapiro-Wilk normality test to assess chemotaxis assay data before using two-way ANOVA on page 51.

      Nevertheless an appropriate statistical analysis should be performed. Since I assume the authors would wish to take into consideration both the different conditions and biological repeats, I can suggest two options:

      • Using a Generalized linear mixed model, one can do with R software.
      • Using a custom bootstrapping approach. We thank Reviewer 1 for suggesting these two options. We carefully considered both approaches and consulted with the in-house statistician at our institution (Dr Pawel Skuza, Flinders University) for expert advice to guide our decision. In summary:

      • Generalised linear mixed models: Generalised linear mixed models (GLMMs) are generally most appropriate for nested/hierarchal data. However, our chemotaxis assay data does not exhibit such nesting. Each biological replicate (N) consists of three technical replicates, which are averaged to yield a single chemotaxis index per N. Our statistical comparisons are based solely on these averaged values across experimental groups, making GLMMs less applicable in this context.

      • __Bootstrapping: __Based on advice from our statistician, while bootstrapping can be a powerful tool, its effectiveness is limited when applied to datasets with a low number of biological replicates (N). Bootstrapping relies on resampling existing data to simulate additional observations, which may artificially inflate statistical power and potentially suggest significance where the biological effect size is minimal or not meaningful. Increasing the number of biological replicates to accommodate bootstrapping could introduce additional variability and compromise the interpretability of the results. The total number of assays, especially controls, varies quite a bit between the tested mutants. For example compare the acc-1 experiment in Figure 4.A., and gap-1 or rho-1 in Figure S4.A and D. It is hard to know the exact N of the controls, but I assume that for example, lowering the wild type control of acc-1 to equivalent to gap-1 would have made it non significant. Perhaps the best approach would be to conduct a power analysis, to know what N should be acquired for all samples.

      We thoroughly evaluated performing the power analysis: however, this is typically performed with the assumption that an N = 1 represents a singular individual/person. An N =1 in this study is one biological replicate that includes hundreds of worms, which is why it is not typically employed in our field for this type of behavioural test.

      Considering these factors, we have opted to continue using a two-way ANOVA for our statistical analysis. This choice aligns with recent publications that employ similar experimental designs and data structures. Crucially, we have verified that our data meet the assumptions of normality, addressing key concerns regarding the suitability of parametric testing. We believe this approach is sufficiently rigorous to support our main conclusions. This rationale is now outlined on page 51.

      To be fully transparent, our aim is to present differences between wild-type and mutant strains that are clearly visible in the graphical data, such that the choice of statistical test does not become a limiting factor in interpreting biological relevance. We hope this rationale is understandable, and we sincerely appreciate the reviewer’s comment and the opportunity to clarify our analytical approach.

      We hope that Reviewer 1 will appreciate these considerations as sufficient justification to retain the statistical tests used in the original manuscript. Nevertheless, to constructively address this comment, we have performed the following revisions:

      1. __Consistent number of biological replicates: __We performed additional biological replicates of the learning assay to confirm the behavioural phenotypes for the key candidates described (KIN-2 , F46H5.3, ACC-1, ACC-3, LGC-46). We chose N = 5 since most studies cited in this paper that perform similar behavioural tests do the same (see the table below). Table 3 | A summary for sample sizes generated by similar studies for chemotaxis assay data. References (listed in the leftmost column) were observed to the sample sizes (N) below corresponding to biological replicates of chemotaxis assay data. N values are in bold when the study uses N ≤ 5.

      Reference

      N used in the study for chemotaxis assay data

      Beets et al., 2020

      8

      Hiroki & Iino 2022

      5-8

      Hiroki et al., 2022

      6-7

      Hukema et al., 2006

      ≥ 4

      Hukema et al., Learn. Mem. 2008

      ≥ 4

      Jang et al., 2019

      ≥ 4

      Kitazono et al., 2017

      ≥ 4

      Kauffman et al., 2010

      ≥ 3

      Kauffman et al., J. Vis. Exp. 2011

      ≥ 3

      Lans et al., 2004

      2

      Lim et al., 2018

      2-4

      Lin et al., 2010

      ≥ 4

      Nagashima et al., 2019

      ≥ 7

      Ohno et al., 2014

      ≥ 11

      Sakai et al., 2017

      ≥ 4

      Stein & Murphy 2014

      3-5

      Tang et al., 2023

      ≥ 9

      Watteyne et al., 2020

      ≥ 10

      __Grouped presentation of behavioural data: __We now present all behavioural data by grouping genotypes tested within the same biological replicate, including wild-type controls, rather than combining genotypes tested separately. This ensures that each graph displays data from genotypes sharing the same N, also an important consideration for performing parametric tests. Accordingly, we re-performed statistical analyses using this reduced Nfor relevant graphs. As anticipated, this rendered some comparisons non-significant. All statistical comparisons are clearly indicated on each graph. Improved clarity of figure legends: __We revised figure legends for __Figures 5, 6, S7, S8, & S9 to make clear how many biological replicates have been performed for each genotype by adding N numbers for each genotype in all figures.

      The authors use the phrasing "a non-significant trend", I find such claims uninterpretable and should be avoided. Examples: Page 16. Line 7 and Page 18, line 16.

      This is an important point. While we were not able to find the specific phrasing "a non-significant trend" from this comment in the original manuscript, we acknowledge that referring to a phenotype as both a trend and non-significant may confuse readers, which was originally stated in the manuscript in two locations.

      The main text has been revised on pages 27 & 28 when describing comparisons between trained groups between two C. elegans lines, by removing mentions of trends and retaining descriptions of non-significance.

      Neuron-specific analysis and rescue of mutants:

      Throughout the study the authors avoid focusing on specific neurons. This is understandable as the authors aim at a systems biology approach, however, in my view this limits the impact of the study. I am aware that the proteome changes analyzed in this study were extracted from a pan neuronally expressed TurboID. Yet, neuron-specific changes may nevertheless be found. For example, running the protein lists from Table S2, in the Gene enrichment tool of wormbase, I found, across several biological replicates, enrichment for the NSM, CAN and RIG neurons. A more careful analysis may uncover specific neurons that take part in this associative memory paradigm. In addition, analysis of the overlap in expression of the final gene list in different neurons, comparing them, looking for overlap and connectivity, would also help to direct towards specific circuits.

      This is an important and useful suggestion. We appreciate the benefit in exploring the data from this study from a neuron class-specific lens, in addition to the systems-level analyses already presented.

      The WormBase gene enrichment tool is indeed valuable for broad transcriptomic analyses (the findings from utilising this tool are now on page 16); however, its use of Anatomy Ontology (AO) terms also contains annotations from more abundant non-neuronal tissues in the worm. To strengthen our analysis and complement the Wormbase tool, we also used the CeNGEN database as suggested by Reviewer 3 Major Comment 1 (Taylor et al., 2021), which uses single cell RNA-Seq data to profile gene expression across the C. elegans nervous system. We input our learning proteome data into CeNGEN as a systemic analysis, identifying neurons highly represented by the learning proteome (on pages 16-20). To do this, we specifically compared genes/proteins from high-salt control worms and trained worms to identify potential neurons that may be involved in this learning paradigm. Briefly, we found:

      • WormBase gene enrichment tool: Enrichment for anatomy terms corresponding to specific interneurons (ADA, RIS, RIG), ventral nerve cord neurons, pharyngeal neurons (M1, M2, M5, I4), PVD sensory neurons, DD motor neurons, serotonergic NSM neurons, and CAN.
      • CeNGEN analysis: Representation of neurons previously implicated in associative learning (e.g., AVK interneurons, RIS interneurons, salt-sensing neuron ASEL, CEP & ADE dopaminergic neurons, and AIB interneurons), as well as neurons not previously studied in this context (pharyngeal neurons I3 & I6, polymodal neuron IL1, motor neuron DA9, and interneuron DVC). Methods are detailed on pages 50 & 51. These data are summarised in the revised manuscript as Table S7 & Figure 4.

      To further address the reviewer’s suggestion, we examined the overlap in expression patterns of the validated learning-associated genes acc-1, acc-3, lgc-46, kin-2, and F46H5.3 across the neuron classes above, using the CeNGEN database. This was done to explore potential neuron classes in which these regulators may act in to regulate learning. This analysis revealed both shared and distinct expression profiles, suggesting potential functional connectivity or co-regulation among subsets of neurons. To summarise, we found:

      • All five learning regulators are expressed in RIM interneurons and DB motor neurons.
      • KIN-2 and F46H5.3 share the same neuron expression profile and are present in many neurons, so they may play a general function within the nervous system to facilitate learning.
      • ACC-3 is expressed in three sensory neuron classes (ASE, CEP, & IL1).
      • In contrast, ACC-1 and LGC-46 are expressed in neuron classes (in brackets) implicated in gustatory or olfactory learning paradigms (AIB, AVK, NSM, RIG, & RIS) (Beets et al., 2012, Fadda et al., 2020, Wang et al., 2025, Zhou et al., 2023, Sato et al., 2021), neurons important for backward or forward locomotion (AVE, DA, DB, & VB) (Chalfie et al., 1985), and neuron classes for which their function is yet detailed in the literature (ADA, I4, M1, M2, & M5). These neurons form a potential neural circuit that may underlie this form of behavioural plasticity, which we now describe in the main text on pages 16-20 & 34-35 and summarise in Figure 4.

      OPTIONAL: A rescue of the phenotype of the mutants by re-expression of the gene is missing, this makes sure to avoid false-positive results coming from background mutations. For example, a pan neuronal or endogenous promoter rescue would help the authors to substantiate their claims, this can be done for the most promising genes. The ideal experiment would be a neuron-specific rescue but this can be saved for future works.

      We appreciate this suggestion and recognise its potential to strengthen our manuscript. In response, we made many attempts to generate pan-neuronal and endogenous promoter re-expression lines. However, we faced several technical issues in transgenic line generation, including poor survival following microinjection likely due to protein overexpression toxicity (e.g., C30G12.6, F46H5.3), and reduced animal viability for chemotaxis assays, potentially linked to transgene-related reproductive defects (e.g., ACC-1). As we have previously successfully generated dozens of transgenic lines in past work (e.g. Chew et al., Neuron 2018; Chew et al., Phil Trans B 2018; Gadenne/Chew et al., Life Science Alliance 2022), we believe the failure to produce most of these lines is not likely due to technical limitations. For transparency, these observations have been included in the discussion section of the manuscript on pages 39 & 40 as considerations for future troubleshooting.

      Fortunately, we were able to generate a pan-neuronal promoter line for KIN-2 that has been tested and included in the revised manuscript. This new data is shown in Figure 5B __and described on __pages 23 & 24. Briefly, this shows that pan-neuronal expression of KIN-2 from the ce179 mutant allele is sufficient to reproduce the enhanced learning phenotype observed in kin-2(ce179) animals, confirming the role of KIN-2 in gustatory learning.

      To address the potential involvement of background mutations (also indicated by Reviewer 4 under ‘cross-commenting’), we have also performed experiments with backcrossed versions of several mutants. These experiments aimed to confirm that salt associative learning phenotypes are due to the expected mutation. Namely, we assessed kin-2(ce179) mutants that had been backcrossed previously by another laboratory, as well as C30G12.6(-) and F46H5.3(-) animals backcrossed in this study. Although not all backcrossed mutants retained their original phenotype (i.e., C30G12.6) (Figure 6D, a newly added figure), we found that backcrossed versions of KIN-2 and F46H5.3 both robustly showed enhanced learning (Figures 5A & 6B). This is described in the text on pages 23-26.

      __Minor comments: __

      1. Lack of clarity regarding the validation of the biotin tagging of the proteome. The authors show in Figure 1 that they validated that the combination of the transgene and biotin allows them to find more biotin-tagged proteins. However there is significant biotin background also in control samples as is common for this method. The authors mention they validated biotin tagging of all their experiments, but it was unclear in the text whether they validated it in comparison to no-biotin controls, and checked for the fold change difference.

      This is an important point: We validated our biotin tagging method prior to mass spectrometry by comparing ‘no biotin’ and ‘biotin’ groups. This is shown in Figure S1 in the revised manuscript, which includes a western blot comparing untreated and biotin treated animals that are non-transgenic or expressing TurboID. As expected, by comparing biotinylated protein signal for untreated and treated lanes within each line, biotin treatment increased the signal 1.30-fold for non-transgenic and 1.70-fold for TurboID C. elegans. This is described on __page 8 __of the revised manuscript.

      To clarify, for mass spectrometry experiments, we tested a no-TurboID (non-transgenic) control, but did not perform a no-biotin control. We included the following four groups: (1) No-TurboID ‘control’ (2) No-TurboID ‘trained’, (3) pan-neuronal TurboID ‘control’ and (4) pan-neuronal TurboID ‘trained’, where trained versus control refers to whether ‘no salt’ was used as the conditioned stimulus or not, respectively (illustrated in Figure 1A). Due to the complexity of the learning assay (which involves multiple washes and handling steps, including a critical step where biotin is added during the conditioning period), and the need to collect sufficient numbers of worms for protein extraction (>3,000 worms per experimental group), adding ‘no-biotin’ controls would have doubled the number of experimental groups, which we considered unfeasible for practical reasons. This is explained on __pages 8 & 9 __of the revised manuscript.

      Also, it was unclear which exact samples were tested per replicate. In Page 9, Lines 17-18: "For all replicates, we determined that biotinylated proteins could be observed ...", But in Page 8, Line 24 : "We then isolated proteins from ... worms per group for both 'control' and 'trained' groups,... some of which were probed via western blotting to confirm the presence of biotinylated proteins".

      • Could the authors specify which samples were verified and clarify how?

      Thank you for pointing out these unclear statements: We have clarified the experimental groups used for mass spectrometry experiments as detailed in the response above on pages 8 &____ 9. In addition, western blots corresponding to each biological replicate of mass spectrometry data described in the main text on page 10 and have been added to the revised manuscript (as Figure S3). These western blots compare biotinylation signal for proteins extracted from (1) No-TurboID ‘control’ (2) No-TurboID ‘trained’, (3) pan-neuronal TurboID ‘control’ and (4) pan-neuronal TurboID ‘trained’. These blots function to confirm that there were biotinylated proteins in TurboID samples, before enrichment by streptavidin-mediated pull-down for mass spectrometry.

      OPTIONAL: include the fold changes of biotinylated proteins of all the ones that were tested. Similar to Figure 1.C.

      This is an excellent suggestion. As recommended by the reviewer, we have included fold-changes for biotinylated protein levels between high-salt control and trained groups (on pages 9 & 10 for replicate #1 and in __Table S2 __for replicates #2-5). This was done by measuring protein levels in whole lanes for each experimental group per biological replicate within western blots (__Figure 1C __for replicate #1 and __Figure S3 __for replicates #2-5) of protein samples generated for mass spectrometry (N = 5).

      Figure 2 does not add much to the reader, it can be summarized in the text, as the fraction of proteins enriched for specific cellular compartments.

      • I would suggest to remove Figure 2 (originally written as figure 3) to text, or transfer it to the supplementry material.

      As noted in cross-comment response to Reviewer 4, there were typos in the original figure references, we have corrected them above. Essentially, this comment is referring to Figure 2.

      We appreciate this feedback from Reviewer 1. We agree that the original __Figure 2 __functions as a visual summary from analysis of the learning proteome at the subcellular compartment level. However, it also serves to highlight the following:

      • Representation for neuron-specific GO terms is relatively low, but even this small percentage represents entire protein-protein networks that are biologically meaningful, but that are difficult to adequately describe in the main text.
      • TurboID was expressed in neurons so this figure supports the relevance of the identified proteome to biological learning mechanisms.
      • Many of these candidates could not be assessed by learning assay using single mutants since related mutations are lethal or substantially affect locomotion. These networks therefore highlight the benefit in using strategies like TurboID to study learning. We have chosen to retain this figure, moving it to the supplementary material as Figure S4 in the revised manuscript, as suggested.

      • OPTIONAL- I would suggest the authors to mark in a pathway summary figure similar to Figure 3 (originally written as Figure 4) the results from the behavior assay of the genetic screen. This would allow the reader to better get the bigger picture and to connect to the systemic approach taken in Figures 2 and 3.

      We think this is a fantastic suggestion and thank Reviewer 1 for this input. In the revised manuscript, we have added Figure 7, which summarises the tested candidates that displayed an effect on learning, mapped onto potential molecular pathways derived from networks in the learning proteome. This figure provides a visual framework linking the behavioural outcomes to the network context. This is described in the main text on pages 32-33.

      Typo in Figure 3: the circle of PPM1: The blue right circle half is bigger than the left one.

      We thank the Reviewer for noticing this, the node size for PPM-1.A has been corrected in what is now Figure 2 in the revised work.

      Unclarity in the discussions. In the discussion Page 24, Line 14, the authors raise this question: "why are the proteins we identified not general learning regulators?. The phrasing and logic of the argumentation of the possible answers was hard to follow. - Can you clarify?

      We appreciate this feedback in terms of unclarity, as we strive to explain the data as clearly and transparently as possible. Our goal in this paragraph was to discuss why some candidates were seen to only affect salt associative learning, as opposed to showing effects in multiple learning paradigms (i.e., which we were defining as a ‘general learning regulator’). We have adjusted the wording in several places in this paragraph now on pages 36 & 37 to address this comment. We hope the rephrased paragraph provides sufficient rationalisation for the discussion regarding our selection strategy used to isolate our protein list of potential learning regulators, and its potential limitations.

      ***Cross-Commenting** *

      Firstly, we would like to express our appreciation for the opportunity for reviewers to cross-comment on feedback from other reviewers. We believe this is an excellent feature of the peer review process, and we are grateful to the reviewers for their thoughtful engagement and collaborative input.

      I would like to thank Reviewer #4 for the great cross comment summary, I find it accurate and helpful.

      I also would like to thank Reviewer #4 for spotting the typos in my minor comments, their page and figure numbers are the correct ones.

      We have corrected these typos in the relevant comments, and have responded to them accordingly.

      Small comment on common point 1 - My feeling is that it is challanging to do quantitative mass spectrometry, especially with TurboID. In general, the nature of MS data is that it hints towards a direction but a followup validation work is required in order to assess it. For example, I am not surprised that the fraction of repeats a hit appeared in does not predict well whether this hit would be validated behavioraly. Given these limitations, I find the authors' approach reasonable.

      We thank Reviewer 1 for this positive and thoughtful feedback. We also appreciate Reviewer 4’s comment regarding quantitative mass spectrometry and have addressed this in detail below (see response to Reviewer 4). However, we agree with Reviewer 1 that there are practical challenges to performing quantitative mass spectrometry with TurboID, primarily due to the enrichment for biotinylated proteins that is a key feature of the sample preparation process.

      Importantly, we whole-heartedly agree with Reviewer 1’s statement that “In general, the nature of MS data is that it hints towards a direction but a follow-up validation work is required in order to assess it”. This is the core of our approach: however, we appreciate that there are limitations to a qualitative ‘absent/present’ approach. We have addressed some of these limitations by clarifying the criteria used for selecting candidate genes, based additionally on the presence of the candidate in multiple biological replicates (categorised as ‘strong’ hits). Based on this method, we were able to validate the role of several novel learning regulators (Figures 5, 6, & S7). We sincerely hope that this manuscript can function as a direction for future research, as suggested by this Reviewer.

      I also would like to highlight this major comment from reviewer 4:

      "In Experimental Procedures, authors state that they excluded data in which naive or control groups showed average CI 0.5499 for N2 (page 36, lines 5-7). "

      This threshold seems arbitrary to me too, and it requires the clarifications requested by reviewer 4.

      As detailed in our response to Reviewer 4, Major Comment 2, data were excluded only in rare cases, specifically when N2 worms failed to show strong salt attraction prior to training, or when trained N2 worms did not exhibit the expected behavioural difference compared to untrained controls – this can largely be attributed to clear contamination or over-population issues, which are visible prior to assessing CTX plates and counting chemotaxis indices.

      These criteria were initially established to provide an objective threshold for excluding biological replicates, particularly when planning to assay a large number of genetic mutants. However, after extensive testing across many replicates, we found that N2 worms (that were not starved, or not contaminated) consistently displayed the expected phenotype, rendering these thresholds unnecessary. We acknowledge that emphasizing these criteria may have been misleading, and have therefore removed them from page 50 in the revised manuscript to avoid confusion and ensure clarity.

      Reviewer #1 (Significance (Required)):

      This study does a great job to effectively utilize the TurboID technique to identify new pathways implicated in salt-associative learning in C. elegans. This technique was used in C. elegans before, but not in this context. The salt-associative memory induced proteome list is a valuable resource that will help future studies on associative memory in worms. Some of the implicated molecular pathways were found before to be involved in memory in worms like cAMP, as correctly referenced in the manuscript. The implication of the acetylcholine pathway is novel for C. elgeans, to the best of my knowledge. The finding that the uncovered genes are specifically required for salt associative memory and not for other memory assays is also interesting.

      However overall I find the impact of this study limited. The premise of this work is to use the Turbo-ID method to conduct a systems analysis of the proteomic changes. The work starts by conducting network analysis and gene enrichment which fit a systemic approach. However, since the authors find that ~30% of the tested hits affect the phenotype, and since only 17/706 proteins were assessed, it is challenging to draw conclusive broad systemic claims. Alternatively, the authors could have focused on the positive hits, and understand them better, find the specific circuits where these genes act. This could have increased the impact of the work. Since neither of these two options are satisfied, I view this work as solid, but not wide in its impact and therefore estimate the audience of this study would be more specialized.

      My expertise is in C. elegans behavior, genetics, and neuronal activity, programming and machine learning.

      We thank the Reviewer for these comments and appreciate the recognition of the value of the proteomic dataset and the identification of novel molecular pathways, including the acetylcholine pathway, as well as the specificity of the uncovered genes to salt-associative memory.

      Regarding the reviewer’s concern about the overall impact and scope of the study, we respectfully offer the following clarification. Our aim was to establish a systems-level approach for investigating learning-related proteomic changes using TurboID, and we acknowledge that only a subset of the identified proteins was experimentally tested (now 26/706 proteins in the revised manuscript). Although only five of the tested single gene mutants showed a robust learning phenotype in the revised work (after backcrossing, more stringent candidate selection, improved statistical analysis in addressing reviewer comments), our proteomic data provides us a unique opportunity to define these candidates within protein-protein networks (as illustrated in Figure 7). Importantly, our functional testing focused on single-gene mutants, which may not reveal phenotypes for genes that act redundantly (now mentioned on pages 28-30). This limitation is inherent to many genetic screens and highlights the value of our proteomic dataset, which enables the identification of broader protein-protein interaction networks and molecular pathways potentially involved in learning.

      To support this systems-level perspective, we have added Figure 7, which visually integrates the tested candidates into molecular pathways derived from the learning proteome for learning regulators KIN-2 and F46H5.3. We also emphasise more explicitly in the text (on pages 32-33) the value of our approach by highlighting the functional protein networks that can be derived from our proteomics dataset.

      We fully acknowledge that the use of TurboID across all neurons limits the resolution needed to pinpoint individual neuron contributions, and understand the benefit in further experiments to explore specific circuits. Many circuits required for salt sensing and salt-based learning are highly explored in the literature and defined explicitly (see Rahmani & Chew, 2021), so our intention was to complement the existing literature by exploring the protein-protein networks involved in learning, rather than on neuron-neuron connectivity. However, we recognise the benefit in integrating circuit-level analyses, given that our proteomic data suggests hundreds of candidates potentially involved in learning. While validating each of these candidates is beyond the scope of the current study, we have taken steps to suggest candidate neurons/circuits by incorporating tissue enrichment analyses and single-cell transcriptomic data (Table S7 & Figure 4). These additions highlight neuron classes of interest and suggest possible circuits relevant to learning.

      We hope this clarification helps convey the intended scope and contribution of our study. We also believe that the revisions made in response to Reviewer 1’s feedback have strengthened the manuscript and enhanced its significance within the field.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      __Summary: __

      In this study by Rahmani in colleagues, the authors sought to define the "learning proteome" for a gustatory associative learning paradigm in C. elegans. Using a cytoplasmic TurboID expressed under the control of a pan-neuronal promoter, the authors labeled proteins during the training portion of the paradigm, followed by proteomics analysis. This approach revealed hundreds of proteins potentially involved in learning, which the authors describe using gene ontology and pathways analysis. The authors performed functional characterization of some of these genes for their requirement in learning using the same paradigm. They also compared the requirement for these genes across various learning paradigms, and found that most hits they characterized appear to be specifically required for the training paradigm used for generating the "learning proteome".

      Major Comments:

      1. The definition of a "hit" from the TurboID approach is does not appear stringent enough. According to the manuscript, a hit was defined as one unique peptide detected in a single biological replicate (out of 5), which could give rise to false positives. In figure S2, it is clear that there relatively little overlap between samples with regards to proteins detected between replicates, and while perhaps unintentional, presenting a single unique peptide appears to be an attempt to inflate the number of hits. Defining hits as present in more than one sample would be more rigorous. Changing the definition of hits would only require the time to re-list genes and change data presented in the manuscript accordingly. We thank Reviewer 2 for this valuable comment, and the following related suggestion. We agree with the statement that “Defining hits as present in more than one sample would be more rigorous”. Therefore, to address this comment, we have now separated candidates into two categories in Table 2 __in the revised manuscript: ‘__strong’ (present in 3 or more biological replicates) and ‘weak’ candidates (present in 2 or fewer biological replicates). However, we think these weaker candidates should still be included in the manuscript, considering we did observe relationships between these proteins and learning. For example, ACC-1, which influences salt associative learning in C. elegans, was detected in one replicate of mass spectrometry as a potential learning regulator (Figure S8A). We describe this classification in the main text on pages 21-22.

      We also agree with Reviewer 2 that the overlap between individual candidate hits is low between biological replicates; the inclusion of Figure S2 __in the original manuscript serves to highlight this limitation. However, it is also important to consider that there is notable overlap for whole molecular pathways between biological replicates of mass spectrometry data as shown in __Figure 2 __in the revised manuscript (this consideration is now mentioned on __pages 13-14). We have included Figure 3 to illustrate representation for two metabolic processes across several biological replicates normally indispensable to animal health, as an example to provide additional visual aid for the overlap between replicates of mass spectrometry. We provide this figure (described on pages 13 & 15) to demonstrate the strength of our approach in that it can detect candidates not easily assessable by conventional forward or reverse genetic screens.

      We also appreciate the opportunity to explain our approach. The criteria of “at least one unique peptide” was chosen based on a previous work for which we adapted for this manuscript (Prikas et al., 2020). It was not intended to inflate the number of hits but rather to ensure sensitivity in detecting low-abundance neuronal proteins. We have clarified this in our Methods (page 46).

      The "hits" that the authors chose to functionally characterize do not seem like strong candidate hits based on the proteomics data that they generated. Indeed, most of the hits are present in a single, or at most 2, biological replicate. It is unclear as to why the strongest hits were not characterized, which if mutant strains are publicly available, would not be a difficult experiment to perform.

      We thank the reviewer for this important suggestion. To address this, we have described two molecular pathways with multiple components that appear in more than one biological replicate of mass spectrometry data in Figure 3 (main text on page 13). In addition, we have included __Figures 6 & S7 __where 9 additional single mutants corresponding to candidates in three or more biological replicates of mass spectrometry were tested for salt associative learning. Briefly, we found the following (number of replicates that a protein was unique to TurboID trained animals is in brackets):

      • Novel arginine kinase F46H5.3 (4 replicates) displays an effect in both salt associative learning and salt aversive learning in the same direction (Figures 6A, 6B, & S9A, pages 31-32 & 37-38).
      • Worms with a mutation for armadillo-domain protein C30G12.6 (3 replicates) only displayed an enhanced learning phenotype when non-backcrossed, not backcrossed. This suggests the enhanced learning phenotype was caused by a background mutation (Figure 6, pages 24-25).
      • We did not observe an effect on salt associative learning when assessing mutations for the ciliogenesis protein IFT-139 (5 replicates), guanyl nucleotide factors AEX-3 or TAG-52 (3 replicates), p38/MAPK pathway interactor FSN-1 (3 replicates), IGCAM/RIG-4 (3 replicates), and acetylcholine components ACR-2 (4 replicates) and ELP-1 (3 replicates) (Figure S7, on pages 27-30). However, we note throughout the section for which these candidates are described that only single gene mutants were tested, meaning that genes that function in redundant or compensatory pathways may not exhibit a detectable phenotype. Because of the lack of strong evidence that these are indeed proteins regulated in the context of learning based on proteomics, including evidence of changes in the proteins (by imaging expression changes of fluorescent reporters or a biochemical approach), would increase confidence that these hits are genuine.

      We thank Reviewer 2 for this suggestion – we agree that it would have been ideal to have additional evidence suggesting that changes in candidate protein levels are associated directly with learning. Ideally, we would have explored this aspect further; however, as outlined in response to Reviewer 1 Major Comment 2 (OPTIONAL), this was not feasible within the scope of the current study due to several practical challenges. Specifically, we attempted to generate pan-neuronal and endogenous promoter rescue lines for several candidates, but encountered significant challenges, including poor survival post-microinjection (likely due to protein overexpression toxicity) and reduced viability for behavioural assays, potentially linked to transgene-related reproductive defects. This information is now described on pages 39 & 40 of the revised work.

      To address these limitations, we performed additional behavioural experiments where possible. We successfully generated a pan-neuronal promoter line for kin-2, which was tested and included in the revised manuscript (Figure 5B, pages 30 & 31). In addition, to confirm that observed learning phenotypes were due to the expected mutations and not background effects, we conducted experiments using backcrossed versions of several mutant lines as suggested by Reviewer 4 Cross Comment 3 (Figure 6, pages 23-24 & 24-26). Briefly, this shows that pan-neuronal expression of KIN-2 from the ce179 mutant allele is sufficient to repeat the enhanced learning phenotype observed in backcrossed kin-2(ce179) animals, providing additional evidence that the identified hits are required for learning. We also confirmed that F46H5.3 modulates salt associative learning, given both non-backcrossed and backcrossed F46H5.3(-) mutants display a learning enhancement phenotype. The revised text now describes this data on the page numbers mentioned above.

      Minor Comments:

      1. The authors highlight that the proteins they discover seem to function uniquely in their gustatory associative paradigm, but this is not completely accurate. kin-2, which they characterize in figure 4, is required for positive butanone association (the authors even say as much in the manuscript) in Stein and Murphy, 2014. We appreciate this correction and thank the Reviewer for pointing this out. We have amended the wording appropriately on page 31 to clarify our meaning.

      2. “Although kin-2(ce179) mutants were not shown to impact salt aversive learning, they have been reported previously to display impaired intermediate-term memory (but intact learning and short-term memory) for butanone appetitive learning (Stein and Murphy, 2014).”*

      Reviewer #2 (Significance (Required)):

      • General Assessment: The approach used in this study is interesting and has the potential to further our knowledge about the molecular mechanisms of associative behaviors. Strengths of the study include the design with carefully thought out controls, and the premise of combining their proteomics with behavioral analysis to better understand the biological significance of their proteomics findings. However, the criteria for defining hits and prioritization of hits for behavioral characterizations were major wweaknesses of the paper.
      • Advance: There have been multiple transcriptomic studies in the worm looking at gene expression changes in the context of behavioral training (Lakhina et al., 2015, Freytag 2017). This study compliments and extends those studies, by examining how the proteome changes in a different training paradigm. This approach here could be employed for multiple different training paradigms, presenting a new technical advance for the field.
      • Audience: This paper would be of interest to the broader field of behavioral and molecular neuroscience. Though it uses an invertebrate system, many findings in the worm regarding learning and memory translate to higher organisms.
      • I am an expert in molecular and behavioral neuroscience in both vertebrate and invertebrate models, with experience in genetics and genomics approaches. We appreciate Reviewer 2’s thoughtful assessment and constructive feedback. In response to concerns regarding definition and prioritisation of hits, we have revised our approach as detailed above to place more consideration on ‘strong’ hits present in multiple biological replicates. We have also added new behavioural data for additional mutants that fall into this category (Figures 6 & S7). We hope these revisions strengthen our study and enhance its relevance to the behavioural/molecular neuroscience community.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      __Summary: __

      In the manuscript titled "Identifying regulators of associative learning using a protein-labelling approach in C. elegans" the authors attempted to generate a snapshot of the proteomic changes that happen in the C. elegans nervous system during learning and memory formation. They employed the TurboID-based protein labeling method to identify the proteins that are uniquely found in samples that underwent training to associate no-salt with food, and consequently exhibited lower attraction to high salt in a chemotaxis assay. Using this system they obtained a list of target proteins that included proteins represented in molecular pathways previously implicated in associative learning. The authors then further validated some of the hits from the assay by testing single gene mutants for effects on learning and memory formation.

      Major Comments:

      In the discussion section, the authors comment on the sources of "background noise" in their data and ways to improve the specificity. They provide some analysis on this aspect in Supplementary figure S2. However, a better visualization of non-specificity in the sample could be a GO analysis of tissue-specificity, and presented as a pie chart as in Figure 2A. Non-neuronal proteins such as MYO-2 or MYO-3 repeatedly show up on the "TurboID trained" lists in several biological replicates (Tables S2 and S3). If a major fraction of the proteins after subtraction of control lists are non-specific, that increases the likelihood that the "hits" observed are by chance. This analysis should be presented in one of the main figures as it is essential for the reader to gauge the reliability of the experiment.

      We agree with this assessment and thank Reviewer 3 for this constructive suggestion. In response, we have now incorporated a comprehensive tissue-specific analysis of the learning proteome in the revised manuscript. Using the single neuron RNA-Seq database CeNGEN, we identified the proportion of neuronal vs non-neuronal proteins from each biological replicate of mass spectrometry data. Specifically, we present Table 1 __on page 17 (which we originally intended to include in the manuscript, but inadvertently left out), which shows that 87-95% (i.e. a large majority) of proteins identified across replicates corresponded to genes detected in neurons, supporting that the TurboID enzyme was able to target the neuronal proteome as expected. __Table 1 is now described in the main text of the revised work on page 16.

      In addition, we performed neuron-specific analyses using both the WormBase gene enrichment tool and the CeNGEN single-cell transcriptomic database, which we describe in detail on our response to Reviewer 1 Major Comment 2. To summarise, these analyses revealed enrichment of several neuron classes, including those previously implicated in associative learning (e.g., ASEL, AIB, RIS, AVK) as well as neurons not previously studied in this context (e.g., IL1, DA9, DVC) (summarised in Table S7). By examining expression overlap across neuron types, we identified shared and distinct profiles that suggest potential functional connectivity and candidate circuits underlying behavioural plasticity (Figure 4). Taken together, these data show that the proteins identified in our dataset are (1) neuronal and (2) expressed in neurons that are known to be required for learning. Methods are detailed on pages 50-51.

      Other than the above, the authors have provided sufficient details in their experimental and analysis procedures. They have performed appropriate controls, and their data has sufficient biological and technical replaictes for statistical analysis.

      We appreciate this positive feedback and thank the Reviewer for acknowledging the clarity of our experimental and analysis procedures.

      Minor Comments:

      There is an error in the first paragraph of the discussion, in the sentences discussing the learning effects in gar-1 mutant worms. The sentences in lines 12-16 on page 22 says that gar-1 mutants have improved salt-associative learning and defective salt-aversive learning, while in fact the data and figures state the opposite.

      We appreciate the Reviewer noting this discrepancy. As clarified in our response to Reviewer 1, Major Comment 1 above, we reanalysed the behavioural data to ensure consistency across genotypes by comparing only those tested within the same biological replicates (thus having the same N for all genotypes). Upon this reanalysis, we found that the previously reported phenotype for gar-1 mutants in salt-associative learning was not statistically different from wild-type controls. Therefore, we have removed references to GAR-1 from the manuscript.

      __Reviewer #3 (Significance (Required)): __Strengths and limitations: This study used neuron-specific TurboID expression with transient biotin exposure to capture a temporally restricted snapshot of the C. elegans nervous system proteome during salt-associative learning. This is an elegant method to identify proteins temporally specific to a certain condition. However, there are several limitations in the way the experiments and analyses were performed which affect the reliability of the data. As the authors themselves have noted in the discussion, background noise is a major issue and several steps could be taken to improve the noise at the experimental or analysis steps (use of integrated C. elegans lines to ensure uniformity of samples, flow cytometry to isolate neurons, quantitative mass spec to detect fold change vs. strict presence/absence). Advance: Several studies have demonstrated the use of proximity labeling to map the interactome by using a bait protein fusion. In fact, expressing TurboID not fused to a bait protein is often used as a negative control in proximity labeling experiments. However, this study demonstrates the use of free TurboID molecules to acquire a global snapshot of the proteome under a given condition. Audience: Even with the significant limitations, this study is specifically of interest to researchers interested in understanding learning and memory formation. Broadly, the methods used in this study could be modified to gain insights into the proteomic profiles at other transient developmental stages. The reviewer's field of expertise: Cell biology of C. elegans neurons.

      We thank the reviewer for their thoughtful evaluation of our work. We appreciate the recognition of the novelty and potential of using neuron-specific TurboID to capture a temporally restricted snapshot of the C. elegans nervous system proteome during learning. We agree that this approach offers a unique opportunity to identify proteins associated with specific behavioural states in future studies.

      We also appreciate the reviewer’s comments regarding limitations in experimental and analytical design. In revising the manuscript, we have taken several steps to address these concerns and improve the clarity, rigour, and interpretability of our data. Specifically:

      • We now provide a frequency-based representation of proteomic hits (Table 2), which helps clarify how candidate proteins were selected and highlights differences between trained and control groups.
      • We have added neuron-specific enrichment analyses using both WormBase and CenGEN databases (Table S7 & Figure 4), which help identify candidate neurons and potential circuits involved in learning (methods on pages 50-51).
      • We have clarified the rationale for using qualitative proteomics in the context of TurboID, in addition to acknowledging the challenges of integrating quantitative mass spectrometry with biotin-based enrichment (page 39). Additional methods for improving sample purity, such as using integrated lines or FACS-enrichment of neurons, could further refine this approach in future studies. For transparency, we did attempt to integrate the TurboID transgenic line to improve the strength and consistency of biotinylation signals. However, despite four rounds of backcrossing, this line exhibited unexpected phenotypes, including a failure to respond reliably to the established training protocol. As a result, we were unable to include it in the current study. Nonetheless, we believe our current approach provides a valuable proof-of-concept and lays the groundwork for future refinement. By addressing the major concerns of peer reviewers, we believe our study makes a significant and impactful contribution by demonstrating the feasibility of using TurboID to capture learning-induced proteomic changes in the nervous system. The identification of novel learning-related mutants, including those involved in acetylcholine signalling and cAMP pathways, provides new directions for future research into the molecular and circuit-level mechanisms of behavioural plasticity.

      Reviewer #4 (Evidence, reproducibility and clarity (Required)):

      Summary:

      In this manuscript, authors used a learning paradigm in C. elegans; when worms were fed in a saltless plate, its chemotaxis to salt is greatly reduced. To identify learning-related proteins, authors employed nervous system-specific transcriptome analysis to compare whole proteins in neurons between high-salt-fed animals and saltless-fed animals. Authors identified "learning-specific genes" which are observed only after saltless feeding. They categorized these proteins by GO analyses and pathway analyses, and further stepped forward to test mutants in selected genes identified by the proteome analysis. They find several mutants that are defective or hyper-proficient for learning, including acc-1/3 and lgc-46 acetylcholine receptors, gar-1 acetylcholine receptor GPCR, glna-3 glutaminase involved in glutamate biosynthesis, and kin-2, a cAMP pathway gene. These mutants were not previously reported to have abnormality in the learning paradigm.

      Major comments:

      1) There are problems in the data processing and presentation of the proteomics data in the current manuscript which deteriorates the utility of the data. First, as the authors discuss (page 24, lines 5-12), the current approach does not consider amount of the peptides. Authors state that their current approach is "conservative", because some of the proteins may be present in both control and learned samples but in different amounts. This reviewer has a concern in the opposite way: some of the identified proteins may be pseudo-positive artifacts caused by the analytical noise. The problem is that authors included peptides that are "present" in "TurboID, trained" sample but "absent" in the "Non-Tg, trained" and "TurboID, control" samples in any one of the biological replicates, to identify "learning proteome" (706 proteins, page 8, last line - page 9, line 8; page 32, line 21-22). The word "present" implies that they included even peptides whose amounts are just above the detection threshold, which is subject to random noise caused by the detector or during sample collection and preparation processes. This consideration is partly supported by the fact that only a small fraction of the proteins are common between biological replicates (honestly and respectably shown in Figure S2). Because of this problem, there is no statistical estimate of the identity in "learning proteome" in the current manuscript. Therefore, the presentation style in Tables S2 and S3 are not very useful for readers, especially because authors already subtracted proteins identified in Non-Tg samples, which must also suffer from stochastic noise. I suggest either quantifying the MS/MS signal, or if authors need to stick to the "present"/"absent" description of the MS/MS data, use the number of appearances in biological replicates of each protein as estimate of the quantity of each protein. For example, found in 2 replicates in "TurboID, learned" and in 0 replicates in "Non-Tg, trained". One can apply statistics to these counts. This said, I would like to stress that proteins related to acquisition of memory may be very rare, especially because learning-related changes likely occur in a small subset of neurons. Therefore, 1 time vs 0 time may be still important, as well as something like 5 times vs 1 time. In summary, quantitative description of the proteomics results is desired.

      We thank the reviewer for these valuable comments and suggestions.

      We acknowledge that quantitative proteomics would provide beneficial information; however, as also indicated by Reviewer 1 (in cross-comment), it is practically challenging to perform with TurboID. We have included discussion of potential future experiments involving quantitative mass spectrometry, as well as a comprehensive discussion of some of the limitations of our approach as summarised by this Reviewer, in the Discussion section (page 39). However, we note that our qualitative approach also provides beneficial knowledge, such as the identification of functional protein networks acting within biological pathways previously implicated in learning (Figure 2), and novel learning regulators ACC-1/3, LGC-46, and F46H5.3.

      We agree with the assessment that the frequency of occurrence for each candidate we test per biological replicate is useful to disclose in the manuscript as a proxy for quantification. This was also highlighted by Reviewer 2 (Major Comment 1). As detailed above in response to R2, we have now separated candidates into two categories: ‘strong’ (present in 3 or more biological replicates) and ‘weak’ candidates (present in 2 or fewer biological replicates). We have also added behavioural data after testing 9 of these strong candidates in Figures 6 & S7.

      We have also added Table 2 to the revised manuscript, which summarises the frequency-based representation of the proteomics results, as suggested. This is described on pages 22-23. Briefly, this shows the range of candidates further explored using single mutant testing. Specifically, this data showed that many of the tested candidates were more frequently detected in trained worms compared to high-salt controls. This includes both strong and weak candidates, providing a clearer view of how proteomic frequency informed our selection for functional testing.

      2) There is another problem in the treatment of the behavioural data. In Experimental Procedures, authors state that they excluded data in which naive or control groups showed average CI 0.5499 for N2 (page 36, lines 5-7). How were these values determined? One common example for judging a data point as an outlier is > mean + 1.5, 2 or 3 SD, or Thank you for pointing this out. As mentioned by both Reviewer 1 and Reviewer 4, the original manuscript states the following: “Data was excluded for salt associative learning experiments when wild-type N2 displayed (1) an average CI ≤ 0.6499 for naïve or control groups and/or (2) an average CI either 0.5499 for trained groups.”

      To clarify, we only excluded experiments in rare cases where N2 worms did not display robust high salt attraction before training, or where trained N2 did not display the expected behavioural difference compared to untrained or high-salt control N2. These anomalies were typically attributable to clear contamination or starvation issues that could clearly be observed prior to counting chemotaxis indices on CTX plates.

      We established these exclusion criteria in advance of conducting multiple learning assays to ensure an objective threshold for identifying and excluding assays affected by these rare but observable issues. However, these criteria were later found to be unnecessary, as N2 worms robustly displayed the expected untrained and trained phenotypes for salt associative learning when not compromised by starvation or contamination.

      We understand that the original criteria may have appeared to introduce arbitrary bias in data selection. To address this concern, we have removed these criteria from the revised manuscript from page 50.

      Minor comments:

      1) Related to Major comments 1), the successful effect of neuron-specific TurboID procedure was not evaluated. Authors obtained both TurboID and Non-Tg proteome data. Do they see enrichment of neuron-specific proteins? This can be easily tested, for example by using the list of neuron-specific genes by Kaletsky et al. (http://dx.doi.org/10.1038/nature16483 or http://dx.doi.org/10.1371/journal.pgen.1007559), or referring to the CenGEN data.

      We thank this Reviewer for this helpful suggestion, which was echoed by Reviewer 3 (Major Comment 1). As indicated in the response to R3 above, the revised manuscript now includes Table 1 as a tissue-specific analysis of the learning proteome, using the single neuron RNA-Seq database CeNGEN to identify the proportion of neuronal proteins from each biological replicate of mass spectrometry data. Generally, we observed a range of 87-95% of proteins corresponded to genes from the CeNGEN database that had been detected in neurons, providing evidence that the TurboID enzyme was able to target the neuronal proteome as expected. Table 1 is now described in the main text of the revised work on pages 16 & 17.

      2) The behavioural paradigm needs to be described accurately. Page 5, line 16-17, "C. elegans normally have a mild attraction towards higher salt concentration": in fact, C. elegans raised on NGM plates, which include approximately 50mM of NaCl, is attracted to around 50mM of NaCl (Kunitomo et al., Luo et al.) but not 100-200 mM.

      We thank the Reviewer for pointing this out. We agree that clarification is necessary. The revised text reads as follows on page 5: “C. elegans are typically grown in the presence of salt (usually ~ 50 mM) and display an attraction toward this concentration when assayed for chemotaxis behaviour on a salt gradient (Kunitomo et al., 2013, Luo et al., 2014). Training/conditioning with ‘no salt + food’ partially attenuates this attraction (group referred to ‘trained’).”

      Authors call this assay "salt associative learning", which refers to the fact that worms associate salt concentration (CS) and either presence or absence of food (appetitive or aversive US) during conditioning (Kunitomo et al., Luo et al., Nagashima et al.) but they are looking at only association with presence of food, and for proteome analysis they only change the CS (NaCl concentration, as discussed in Discussion, p24, lines 4-5). It is better to attempt to avoid confusion to the readers in general.

      Thank you Reviewer 4 for highlighting this clarity issue. We clarify our definition of “salt associative learning” for the purpose of this study in the revised manuscript on page 6 with the following text:

      “Similar behavioural paradigms involving pairings between salt/no salt and food/no food have been previously described in the literature (Nagashima et al. 2019). Here, learning experiments were performed by conditioning worms with either ‘no salt + food’ (referred to as ‘salt associative learning’) or ‘salt + no food’ (called ‘salt aversive learning’).”

      3) page 32, line 23: the wording "excluding" is obscure and misleading because the elo-6 gene was included in the analysis.

      We appreciate this Reviewer for pointing out this misleading comment, which was unintentional. We have now removed it from the text (on page 21).

      4) Typo at page 24, line 18: "that ACC-1" -> "than ACC-1".

      This has been corrected (on page 37).

      5) Reference. In "LEO, T. H. T. et al.", given and sir names are flipped for all authors. Also, the paper has been formally published (http://dx.doi.org/10.1016/j.cub.2023.07.041).

      We appreciate the Reviewer drawing our attention to this – the reference has been corrected and updated.

      I would like to express my modest cross comments on the reviews:

      1) Many of the reviewers comment on the shortage in the quantitative nature of the proteome analysis, so it seems to be a consensus.

      Thank you Reviewer 4 for this feedback. We appreciate the benefit in performing quantitative mass spectrometry, in that it provides an additional way to parse molecular mechanisms in a biological process (e.g., fold-changes in protein expression induced by learning). However, we note that quantitative mass spectrometry is challenging to integrate with TurboID due to the requirement to enrich for biotinylated peptides during sample processing (we now mention this on page 39). Nevertheless, it would be exciting to see this approach performed in a future study.

      To address the limitations of our original qualitative approach and enhance the clarity and utility of our dataset, we have made the following revisions in the manuscript:

      • Candidate selection criteria: We now clearly define how candidates were selected for functional testing, based on their frequency across biological replicates. Specifically, “strong candidates” were detected in three or more replicates, while “weak candidates” appeared in two or fewer.
      • Frequency-based representation (_Table 2_):__We appreciate the suggestion by Reviewer 4 (Major Comment 1) to quantify differences between high-salt control and trained groups. We now provide the frequency-based representation of the candidates tested in this study within our proteomics data in __Table 2. This data showed that many of the tested candidates were more frequently detected in trained worms compared to high-salt controls. This includes both strong and weak candidates We hope these additions help clarify our approach and demonstrate the value of the dataset, even within the constraints of qualitative proteomics.

      2) Also, tissue- or cell-specificity of the identified proteins were commonly discussed. In reviewer #3's first Major comment, appearance of non-neuronal protein in the list was pointed out, which collaborate with my (#4 reviewer's) question on successful identification of neuronal proteins by this method. On the other hand, reviewer #1 pointed out subset neuron-specific proteins in the list. Obviously, these issues need to be systematically described by the authors.

      We agree with Reviewer 4 that these analyses provide a critical angle of analysis that is not explored in the original manuscript.

      Tissue analysis (Reviewer 3 Major Comment 1): We have used the single neuron RNA-Seq database CeNGEN, to identify that 87-95% (i.e. a large majority) of proteins identified across replicates corresponded to genes detected in neurons. These findings support that the TurboID enzyme was able to target the neuronal proteome as expected. Table 1 provides this information as is now described in the main text of the revised work on page 16.

      __Neuron class analyses (Reviewer 1 Major Comment 2): __In response, we have used the suggested Wormbase gene enrichment tool and CeNGEN. We specifically input proteins from the learning proteome into Wormbase, after filtering for proteins unique to TurboID trained animals. For CeNGEN, we compared genes/proteins from control worms and trained worms to identify potential neurons that may be involved in this learning paradigm.

      Briefly, we found highlight a range of neuron classes known in learning (e.g., RIS interneurons), cells that affect behaviour but have not been explored in learning (e.g., IL1 polymodal neurons), and neurons for which their function/s are unknown (e.g., pharyngeal neuron I3). Corresponding text for this new analysis has been added on pages 16-20, with a new table and figure added to illustrate these findings (Table S7 & Figure 4). Methods are detailed on pages 50-51.

      3) Given reviewer #1's OPTIONAL Major comment, as an expert of behavioral assays in C. elegans, I would like to comment based on my experience that mutants received from Caenorhabditis Genetics Center or other labs often lose the phenotype after outcrossing by the wild type, indicating that a side mutation was responsible for the observed behavioral phenotype. Therefore, outcrossing may be helpful and easier than rescue experiments, though the latter are of course more accurate.

      Thank you for this suggestion. To address the potential involvement of background mutations, we have done experiments with backcrossed versions of mutants tested where possible, as shown in Figure 6. We found that F46H5.3(-) mutants maintained enhanced learning capacity after backcrossing with wild type, compared to their non-backcrossed mutant line. This was in contrast to C30G12.6(-) animals which lost their enhanced learning phenotype following backcrossing using wild type worms. This is described in the text on pages 24-26.

      4) Just let me clarify the first Minor comment by reviewer #2. Authors described that the kin-2 mutant has abnormality in "salt associative learning" and "salt aversive learning", according to authors' terminology. In this comment by reviewer #2, "gustatory associative learning" probably refers to both of these assays.

      Reviewer 4 is correct. We have amended the wording appropriately on page 31 to clarify our meaning to address Reviewer 2’s comment.

      • “Although kin-2(ce179) mutants were not shown to impact salt aversive learning, they have been reported previously to display impaired intermediate-term memory (but intact learning and short-term memory) for butanone appetitive learning (Stein and Murphy, 2014).”*

      5) There seem to be several typos in reviewer #1's Minor comments.

      "In Page 9, Lines 17-18" -> "Page 8, Lines 17-18".

      "Page 8, Line 24" -> "Page 7, Line 24".

      "I would suggest to remove figure 3" -> "I would suggest to remove figure 2"

      "summary figure similar to Figure 4" -> "summary figure similar to Figure 3"

      "In the discussion Page 24, Line 14" -> "In the discussion Page 23, Line 14"

      (I note that because a top page was inserted in the "merged" file but not in art file for review, there is a shift between authors' page numbers and pdf page numbers in the former.)

      It would be nice if reviewer #1 can confirm on these because I might be wrong.

      We appreciate Reviewer 4 noting this, and can confirm that these are the correct references (as indicated by Reviewer 1 in their cross-comments)

      Reviewer #4 (Significance (Required)):

      1) Total neural proteome analysis has not been conducted before for learning-induced changes, though transcriptome analysis has been performed for odor learning (Lakhina et al., http://dx.doi.org/10.1016/j.neuron.2014.12.029). This guarantees the novelty of this manuscript, because for some genes, protein levels may change even though mRNA levels remain the same. We note an example in which a proteome analysis utilizing TurboID, though not the comparison between trained/control, has led to finding of learning related proteins (Hiroki et al., http://dx.doi.org/10.1038/s41467-022-30279-7). As described in the Major comments 1) in the previous section, improvement of data presentation will be necessary to substantiate this novelty.

      We appreciate this thoughtful feedback. We agree that while the neuronal transcriptome has been explored in Lakhina et al., 2015 for C. elegans in the context of memory, our study represents the first to examine learning-induced changes in the total neuronal proteome. We particularly agree with the statement that “for some genes, protein levels may change even though mRNA levels remain the same”. This is essential rationale that we now discuss on page 42.

      Additionally, we acknowledge the relevance of the study by Hiroki et al., 2022, which used TurboID to identify learning-related proteins, though not in a trained versus control comparison. Our work builds on this by directly comparing trained and control conditions, thereby offering new insights into the proteomic landscape of learning. This is now clarified on page 36.

      To substantiate the novelty and significance of our approach, we have revised the data presentation throughout the manuscript, including clearer candidate selection criteria, frequency-based representation of proteomic hits (Table 2), and neuron-specific enrichment analyses (Table S7 & Figure 4). We hope these improvements help convey the unique contribution of our study to the field.

      2) Authors found six mutants that have abnormality in the salt learning (Fig. 4). These genes have not been described to have the abnormality, providing novel knowledge to the readers, especially those who work on C. elegans behavioural plasticity. Especially, involvement of acetylcholine neurotransmission has not been addressed. Although site of action (neurons involved) has not been tested in this manuscript, it will open the venue to further determine the way in which acetylcholine receptors, cAMP pathway etc. influences the learning process.

      Thank you Reviewer 4, for this encouraging feedback. To further strengthen the study and expand its relevance, we have tested additional mutants in response to Reviewer 3’s comments, as shown in Figures 6 & S7. These results provide even more candidate genes and pathways for future exploration, enhancing the significance and impact of our study.

  5. www.tripleeframework.com www.tripleeframework.com
    1. where the technology may simply be replacing a traditional method of instruction

      I think it is very important to remember this as an educator and parent. We have to be sure to maximize use and make it beneficial and worthwhile, not just replacing other instruction.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      A) The presentation of the paper must be strengthened. Inconsistencies, mislabelling, duplicated text, typos, and inappropriate colour code should be changed.

      We spotted and corrected several inconsistencies and mislabelling issues throughout the text and figures. Thanks!  

      B) Some claims are not supported by the data. For example, the sentence that says that "adolescent mice showed lower discrimination performance than adults (l.22) should be rewritten, as the data does not show that for the easy task (Figure 1F and Figure 1H).

      We carefully reviewed the specific claims and fixed some of the wording so it adheres to the data shown.

      C) In Figure 7 for example, are the quantified properties not distinct across primary and secondary areas?

      We now carried out additional analysis to test this. We found that while AUDp and AUDv exhibit distinct tuning properties, they show similar differences between adolescent and adult neurons (see Supplementary Table 6, Fig. S7-1a-h). Note that TEa and AUDd could not be evaluated due to low numbers of modulated neurons in this protocol.

      D) Some analysis interpretations should be more cautious. (..) A lower lick rate in general could reflect a weaker ability to withhold licking- as indicated on l.164, but also so many other things, like a lower frustration threshold, lower satiation, more energy, etc).

      That is a fair comment, and we refined our interpretations. Moreover, we also addressed whether impulsiveness impacted lick rates. In the Educage, we found that adolescent mice had shorter ITIs only after FAs (Fig. S2-1). In the head-fixed setup, we examined (1) the proportion of ITIs where licks occurred (Fig. S3-1c) and (2) the number of licks in these ITIs (Fig. S3-1d). We found no differences between adolescents and adults, indicating that the differences observed in the main task are not due to general differences in impulsiveness (Fig. S2-1, Fig. S3-1c, d). Finally, we note that potential differences in satiation were already addressed in the original manuscript by carefully examining the number of trials completed across the session. See also Review 3, comment #1 below.

      Reviewer #2 (Public review):

      A) For some of the analyses that the authors conducted it is unclear what the rationale behind them is and, consequently, what conclusion we can draw from them.

      We reviewed the manuscript carefully and revised the relevant sections to clarify the rationale behind the analyses. See detailed responses to all the reviewer’s specific comments.

      B) The results of optogenetic manipulation, while very interesting, warrant a more in-depth discussion.

      We expanded our discussion on these experiments (L495-511) and also added an additional analysis to strengthen our findings (Fig. S3-2e).

      Reviewer #3 (Public review):

      (1) The authors report that "adolescent mice showed lower auditory discrimination performance compared to adults" and that this performance deficit was due to (among other things) "weaker cognitive control". I'm not fully convinced of this interpretation, for a few reasons. First, the adolescents may simply have been thirstier, and therefore more willing to lick indiscriminately. The high false alarm rates in that case would not reflect a "weaker cognitive control" but rather, an elevated homeostatic drive to obtain water. Second, even the adult animals had relatively high (~40%) false alarm rates on the freely moving version of the task, suggesting that their behavior was not particularly well controlled either. One fact that could help shed light on this would be to know how often the animals licked the spout in between trials. Finally, for the head-fixed version of the task, only d' values are reported. Without the corresponding hit and false alarm rates (and frequency of licking in the intertrial interval), it's hard to know what exactly the animals were doing.

      irst, as requested, we added the Hit rates and FA rates for the head-fixed task (Fig. S3-1a). Second, as requested by the reviewr, we performed additional analyses in both the Educage and head-fixed versions of the task. Specifically, we analyzed the ITI duration following each trial outcome. We found that adolescent mice had shorter ITIs only after Fas (Fig. S2-1). In the head-fixed setup, we examined (1) the proportion of ITIs during which licks occurred (Fig. S3-1c) and (2) the number of licks in these ITIs (Fig. S3-1d). We found no differences between adolescents and adults, indicating that the differences observed in the main task are not due to general differences in impulsiveness (Fig. S2-1, Fig. S3-1c, d). See also comment #D of reviewer #1 above.

      B) There are some instances where the citations provided do not support the preceding claim. For example, in lines 64-66, the authors highlight the fact that the critical period for pure tone processing in the auditory cortex closes relatively early (by ~P15). However, one of the references cited (ref 14) used FM sweeps, not pure tones, and even provided evidence that the critical period for this more complex stimulus occurred later in development (P31-38). Similarly, on lines 72-74, the authors state that "ACx neurons in adolescents exhibit high neuronal variability and lower tone sensitivity as compared to adults." The reference cited here (ref 4) used AM noise with a broadband carrier, not tones.

      We carefully checked the text to ensure that each claim is accurately supported by the corresponding reference.

      C) Given that the authors report that neuronal firing properties differ across auditory cortical subregions (as many others have previously reported), why did the authors choose to pool neurons indiscriminately across so many different brain regions?

      We appreciate the reviewer’s concern. While we acknowledge that pooling neurons across auditory cortical subregions may obscure region-specific effects, our primary focus in this study is on developmental differences between adolescents and adults, which were far more pronounced than subregional differences.

      To address this potential limitation: (1) We analyzed firing differences across subregions during task engagement (see Fig. S4-1, S4-2, S4-3; Supplementary Tables 2 and 3). (2) We have now added new analyses for the passive listening condition in AUDp and AUDv (Fig. S7-1; Supplementary Table 6).

      These analyses support our conclusion that developmental stage has a greater impact on auditory cortical activity than subregional location in the contexts examined. For clarity and cohesion, the main text emphasizes developmental differences, while subregional analyses are presented in the Supplement.

      D) And why did they focus on layers 5/6? (Is there some reason to think that age-related differences would be more pronounced in the output layers of the auditory cortex than in other layers?)

      We agree that other cortical layers, particularly supragranular layers, are important for auditory processing and plasticity. Our focus on layers 5/6 was driven by both methodological and biological considerations. Methodologically, our electrode penetrations were optimized to span multiple auditory cortical areas, and deeper layers provided greater mechanical stability for chronic recordings. Biologically, layers 5/6 contain the principal output neurons of the auditory cortex and are well-positioned to influence downstream decision-making circuits. We acknowledge the limitation of our recordings to these layers in the manuscript (L268; L464-8).

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) The presentation of the paper must be strengthened. As it is now, it makes it difficult to appreciate the strengths of the results. Here are some points that should be addressed:

      a) The manuscript is full of inconsistencies that should be fixed to improve the reader's understanding. For example, the description on l.217 and the Figure. S3-1b, the D' value of 0 rounded to 0.01 on l. 735 (isn't it rather the z-scored value that is rounded? A D' of 0 is not a problem), the definition of lick bias on l. 750 and the values in Fig.2, the legend of Figure 7F and what is displayed on the graph (is it population sparseness or responsiveness?), etc.

      We adjusted the legend and description of former Fig. S3-1b (now Fig. S3-2b).

      We now clarify that the rounded values refer to z-scored hit and false alarm rates that we used in the d’ calculation. We adjusted the definition of the lick bias in Fig. 2 and Fig. S3-1b (L804).

      We replaced ‘population responsiveness’ with ‘population sparseness’ throughout the figures, legend and the text.

      b) References to figures are sometimes wrong (for example on l. 737,739).

      c) Some text is duplicated (for example l. 814 and l. 837).

      d) Typos should be corrected (for example l. 127, 'the', l. 787, 'upto').

      We deleted the incorrect references of this section, removed the duplicated text, and corrected the typos.

      e) Color code should be changed (for example the shades of blue for easy and hard tasks - they are extremely difficult to differentiate).

      After consideration, we decided to retain the blue color code (i.e., Fig. 1d, Fig. 3d, Fig. 4e-g, Fig. 5c, Fig. 6d–g), where the distinction between the shades of blue appears sufficiently clear and maintains visual consistency and aesthetic appeal. We did however, made changes in the other color codes (Fig. 4, Fig. 5, Fig. 6, Fig. 7).

      f) Figure design should be improved. For example, why is a different logic used for displaying Figure 5A or B and Figure 1E?

      We adjusted the color scheme in Fig. 5. We chose to represent the data in Fig. 5 according to task difficulty, as this arrangement best illustrates the more pronounced deficits in population decoding in adolescents during the hard task.

      f) Why use a 3D representation in Figure 4G? (2)

      The 3D representation in Fig. 4g was chosen to illustrate the 3-way interactions between onset-latency, maximal discriminability, and duration of discrimination.

      g) Figure 1A, lower right panel- should "response" not be completed by "lick", "no lick"?

      We changed the labels to “Lick” and “No Lick” in Fig. 1a.

      h) l.18 the age mentioned is misleading, because the learning itself actually started 20 days earlier than what is cited here.

      Corrected.

      i) Explain what AAV5-... is on l.212.

      We added an explanation of virus components (see L216-220).

      (2) The comparison of CV in Figure 2 H-J is interesting. I am curious to know whether the differences in the easy and hard tasks could be due to a decrease in CV in adults, rather than an increase in CV in adolescents? Also, could the difference in J be due to 3 outliers?

      We agree that the observed CV differences may reflect a reduction in variability in adults rather than an increase in adolescents. We have revised the Results section accordingly to acknowledge this interpretation.

      Regarding the concern about potential outliers in Fig. 2J, we tested the data for outliers using the isoutlier function in MATLAB (defining outliers as values exceeding three standard deviations from the mean) and found no such cases.

      (3) Figure 2c shows that there is no difference in perceptual sensitivity between adolescents and adults, whereas the conclusion from Figure 4 is that adolescents exhibit lower discriminability in stimulus-related activity. Aren't these results contradictory?

      This is a nuanced point. The similar slopes of the psychometric functions (Fig. 2c) indicating comparable perceptual sensitivity and the lower AUC observed in the ACx of adolescents (Fig. 4) do not necessarily contradict each other. These two measures capture related but distinct issues: psychometric slopes reflect behavioral output, which integrates both sensory encoding and processing downstream to ACx, while the AUC analysis reflects stimulus-related neural activity in ACx, which may still include decision-related components.<br /> Note that stimulus-related neural discriminability outside the context of the task is not different between adolescent and adult experts (Fig. 7h; p = 0.9374, Kruskal Willis Test after Tukey-Kramer correction for multiple comparisons; not discussed in the manuscript). This suggests that there are differences that emerge when we measure during behavior. Also note that behavior may rely on processing beyond ACx, and it is possible that downstream areas compensate for weaker cortical discriminability in adolescents — but this issue merits further investigation.

      (4) Why do you think that the discrimination in hard tasks decreases with learning (Figure 6D vs Figure 6F)?

      This is another nuanced point, and we can only speculate at this stage. While it may appear counterintuitive that single-neuron discriminability (AUC) for the hard task is reduced after learning (Fig. 6D vs. 6F), we believe this may reflect a shift in sensory coding in expert animals. In a recent study (Haimson et al., 2024; Science Advances), we found that learning alters single-neuron responses in the easy versus hard task in complex and distinct ways, which may account for this result. It is also possible that, in expert mice, top-down mechanisms such as feedback from higher-order areas act to suppress or stabilize sensory responses in auditory cortex, reducing the apparent stimulus selectivity of single neurons (e.g., AUC), even as behaviorally relevant information is preserved or enhanced at the population level.

      Reviewer #2 (Recommendations for the authors):

      This is very interesting work and I enjoyed reading the manuscript. See below for my comments, queries and suggestions, which I hope will help you improve an already very good paper.

      We thank the reviewer for the meticulous and thoughtful review.

      (1) Line 107: x-axis of panel 1e says 'pre-adolescent'.

      (2) Line 130: replace 'less' with 'fewer'.

      (3) Line 153: 'both learned and catch trials': I find the terminology here a bit confusing. I would typically understand a catch trial to be a trial without a stimulus but these 'catch' trials here have a stimulus. It's just that they are not rewarded/punished. What about calling them probe trials instead?

      We corrected the labelling (1), reworded to ‘fewer’ and ‘probe trials’ (2,3).

      (4) Line 210: The results of the optogenetics experiments are very interesting. In particular, because the effect is so dramatic and much bigger than what has been reported in the literature previously, I believe. Lick rates are dramatically reduced suggesting that the mice have pretty much stopped engaging in the task and the authors very rightly state that the 'execution' of the behavior is affected. I think it would be worth discussing the implications of these results more thoroughly, perhaps also with respect to some of the lesion work. Useful discussions on the topic can be found, for instance, in Otchy et al., 2015; Hong et al., 2018; O'Sullivan et al., 2019; Ceballo et al., 2019 and Lee et al., 2024. Are the mice unable to hear anything in laser trials and that is why they stopped licking? If they merely had trouble distinguishing them then we would perhaps expect the psychometric curves to approach chance level, i.e. to be flat near the line indicating a lick rate of 0.5. Could the dramatic decrease in lick rate be a motor issue? Can we rule out spillover of the virus to relevant motor areas? (I understand all of the 200nL of the virus were injected at a single location) Or are the effects much more dramatic than what has been reported previously simply because the GtACR2 is much more effective at silencing the auditory cortex? Could the effect be down to off-target effects, e.g. by removing excitation from a target area of the auditory cortex, rather than the disruption of cortical processing?

      We have now expanded the discussion in the manuscript to more thoroughly consider alternative interpretations of the strong behavioral effect observed during ACx silencing (L495–511). In particular, we acknowledge that the suppression of licking may reflect not only impaired sensory discrimination but also broader disruptions to arousal, motivation, or motor readiness. We also discuss the potential impact of viral spread, circuit-level off-target effects, and the potency of GtACR2 as possible contributors. We highlight the need for future work using more graded or temporally precise manipulations to resolve these issues.

      (5) Line 226: Reference 19 (Talwar and Gerstein 2001) is not particularly relevant as it is mostly concerned with microstimulation-induced A1 plasticity. There are, however, several other papers that should be cited (and potentially discussed) in this context. In particular, O'Sullivan et al., 2019 and Ceballo et al., 2019 as these papers investigate the effects of optogenetic silencing on frequency discrimination in head-fixed mice and find relatively modest impairments. Also relevant may be Kato et al., 2015 and Lee et al., 2024, although they look at sound detection rather than discrimination.

      We changed the references and pointed the reader to the (new section) Discussion.

      (6) Line 253: 'engaged [in] the task.

      (7) Figure 4: It appears that panel S4-1d is not referred to anywhere in the main text.

      Fixed.

      (8) Line 260: Might be useful to explain a bit more about the motivation behind focusing on L5/L6. Are there mostly theoretical considerations, i.e. would we expect the infragranular layers to be more relevant for understanding the difference in task performance? Or were there also practical considerations, e. g. did the data set contain mostly L5/L6 neurons because those were easier to record from given the angle at which the probe was inserted? If those kinds of practical considerations played a role, then there is nothing wrong with that but it would be helpful to explain them for the benefit of others who might try a similar recording approach.

      There were no deep theoretical considerations for targeting L5/6.  Our focus on layers 5/6 was driven by both methodological and biological considerations. Methodologically, our electrode penetrations were optimized to span multiple auditory cortical areas, and deeper layers provided greater mechanical stability for chronic recordings. Biologically, layers 5/6 contain the principal output neurons of the auditory cortex and are well-positioned to influence downstream decision-making circuits. We acknowledge the limitation of our recordings to these layers in the manuscript (L268; L463–467). See also comment D of reviewer 3.

      (9) Supplementary Table 2: The numbers in brackets indicate fractions rather than percentages.

      Fixed.

      (10) Figure S4-3: The figure legend implies that the number of neurons with significant discriminability for the hard stimulus and significant discriminability for choice was identical. (adolescent neurons = 368, mice = 5, recordings = 10; adult n = 544, mice = 6, recordings = 12 in both cases). Presumably, that is not actually the case and rather the result of a copy/paste operation gone wrong. Furthermore, I think it would be helpful to state the fractions of neurons that can discriminate between the stimuli and between the choices that the animal made in the main text.

      Thank you for spotting the mistake. We corrected the n’s and added the percentage of neurons that discriminate stimulus and choice in the main text and the figure legend.

      (11) Line 301: 'We used a ... decoder to quantify hit versus correct reject trial outcomes': I'm not sure I understand the rationale here. For the single unit analysis hit and false alarm trials were compared to assess their ability to discriminate the stimuli. FA and CR trials were compared to assess whether neurons can encode the choice of the mice. But the hit and CR trials which are contrasted here differ in terms of both stimulus and behavior/choice so what is supposed to be decoded here, what is supposed to be achieved with this analysis?

      Thank you for this important point. You're correct that comparing hit and CR trials captures differences in both stimulus and choice, or task-related differences. We chose this contrast for the population decoding analysis to achieve higher trial counts per session and similar number of trials which are necessary for the reliability of the analysis. While this approach does not isolate stimulus from choice encoding, it provides an overall measure of how well population activity distinguishes task-relevant outcomes. We explicitly acknowledge this issue in L313-314.

      (12) Line 332: What do you mean when you say the novice mice were 'otherwise fully engaged' in the task when they were not trained to do the task and are not doing the task?

      By "otherwise fully engaged," we mean that novice mice were actively participating in the task environment, similar to expert mice — they were motivated by thirst and licked the spout to obtain water. The key distinction is that novice mice had not yet learned the task rules and likely relied on trial-and-error strategies, rather than performing the task proficiently.

      (13) Line 334: 'regardless of trial outcome': Why is the trial outcome not taken into account? What is the rationale for this analysis? Furthermore, in novice mice a substantial proportion of the 'go' trials are misses. In expert mice, however, the proportion of 'miss trials' (and presumably false alarms) will by definition be much smaller. Given this, I find it difficult to interpret the results of this section.

      This approach was chosen to reliably decode a sufficient number of trials for each task difficulty (i.e. expert mice predominantly performed CRs on No-Go trials and novice mice often showed FAs). Utilizing all trial outcomes ensured that we had enough trials for each stimulus type to accurately estimate the AUCs. This approach avoids introducing biases due to uneven trial numbers across learning stages.

      (14) Line 378: 'differences between adolescents and adults arise primarily from age': Are there differences in any of the metrics shown in 7e-h between adolescents and adults?

      We confirm that differences between adolescents and adults are indeed present in some metrics but not others in Figure 7e–h. Specifically, while tuning bandwidth was similar in novice animals, it was significantly lower in adult experts (Fig. 7e; novice: p = 0.0882; expert: p = 0.0001 Kruskal Willis Test after Tukey-Kramer correction for multiple comparisons; not discussed in the manuscript). The population sparseness was similar in both novice and expert adolescent and adult neurons (Fig. 7f; novice: p = 0.2873; expert: p = 0.1017, Kruskal Willis Test after Tukey-Kramer correction for multiple comparisons; not discussed in the manuscript). The distance to the easy go stimulus was similar in novice animals, but lower in adult experts (Fig. 7g; novice: p = 0.7727; expert: p = 0.0001, Kruskal Willis Test after Tukey-Kramer correction for multiple comparisons; not discussed in the manuscript). The neuronal d-prime was similar in both novice and expert adolescent and adult neurons (Fig. 7h; novice: p = 0.7727; expert: p = 0.0001, Kruskal Willis Test after Tukey-Kramer correction for multiple comparisons; not discussed in the manuscript).

      (15) Line 475: '...well and beyond...': something seems to be missing in this statement.

      (16) Line 487: 'onto' should be 'into', I think.

      (17) Line 610 and 613: '3 seconds' ... '2.5 seconds': Was the response window 3s or 2.5s?

      (18) Line 638: 'set' should be 'setup', I believe.

      All the mistakes mentioned above, were fixed. Thanks.

      (19) Line 643: 'Reward-reinforcement was delayed to 0.5 seconds after the tone offset': Presumably, if they completed their fifth lick later than 0.5 seconds after the tone, the reward delivery was also delayed?

      Apologies for the lack of clarity. In the head-fixed version, there was no lick threshold. Mice were reinforced after a single lick. If that lick occurred after the 0.5-second reinforcement delay following tone offset, the reward or punishment was delivered immediately upon licking.

      (20) Line 661: 'effect [of] ACx'.

      (21) Line 680: 'a base-station connected to chassis'. The sentence sounds incomplete.

      (22) Line 746: 'infliction', I believe, should say 'inflection'.

      (23) Line 769: 'non-auditory responsive units': Shouldn't that simply say 'non-responsive units'? The way it is currently written I understand it to mean that these units were responsive (to some other modality perhaps) but not to auditory stimulation.

      (24) Line 791: 'bins [of] 50ms'.

      (25) Line 811: 'all of' > 'of all'.

      (26) Line 814: Looks like the previous paragraph on single unit analysis was accidentally repeated under the wrong heading.

      (27) Line 817: 'encoded' should say 'calculated', I believe.

      All the mistakes mentioned above were fixed. Thanks.

      (28) Line 869: 'bandwidth of excited units': Not sure I understand how exactly the bandwidth, i.e. tuning width was measured.

      We acknowledge that our previous answer was unclear and expanded the Methods section. To calculate bandwidth, we identified significant tone-evoked responses by comparing activity during the tone window to baseline firing rates at 62 dB SPL (p < 0.05). For each neuron, we counted the number of contiguous frequencies with significant excitatory responses, subtracting isolated false positives to correct for chance. We then converted this count into an octave-based bandwidth by multiplying the number of frequency bins by the octave spacing between them (0.1661 octaves per step).

      (29) Line 871: 'population sparseness': Is that the fraction of tone frequencies that produced a significant response? I would have thought that this measure is very highly correlated to your measure of bandwidth, to the point of being redundant, but I may have misunderstood how one or the other is calculated. Furthermore, the Y label of Figure 7f says 'responsiveness' rather than sparseness and that would seem to be the more appropriate term because, unless I am misunderstanding this, a larger value here implies that the neuron responded to more frequencies, i.e. in a less sparse manner.

      We have clarified the use of the term "population sparseness" and updated the Y-axis label in Figure 7f to better reflect this measure. This metric reflects the fraction of tone–attenuation combinations that elicited a significant excitatory response across the entire population of neurons, not within individual units.

      While this measure is related to bandwidth, it captures a distinct property of the data. Bandwidth quantifies how broadly or narrowly a single neuron responds across frequencies at a fixed intensity, whereas population sparseness reflects how distributed responsiveness is across the population as a whole. Although the two measures are related, since broadly tuned neurons often contribute to lower population sparseness, they capture distinct aspects of neural coding and are not redundant.

      (30) Line 881: I think this line should refer to Figure 7h rather than 7g.

      Fixed.

      Reviewer #3 (Recommendations for the authors):

      (1) In the Educage, water was only available when animals engaged in the task; however, there is no mention of whether/how animal weight was monitored.

      In the Educage, mice had continuous access to water by voluntarily engaging in the task, which they could perform at any time. Although body weight was not directly monitored, water access was essentially ad libitum, and mice performed hundreds of trials per day, thereby ensuring sufficient daily intake. This approach allowed us to monitor hydration (ad libitum food is supplied in the home cage). The 24/7 setup, including automated monitoring of trial counts and water consumption, was reviewed and approved by our institutional animal care and use committee (IACUC).

      (2) In Figure 2B-C and Figure 2E, the y-axis reads "lick rate". At first glance, I took this to mean "the frequency of licking" (i.e. an animal typically licks at a rate of 5 Hz). However, what the authors actually are plotting here is the proportion of trials on which an animal elicited >= 5 licks during the response window (i.e. the proportion of "yes" responses). I recommend editing the y-axis and the text for clarity.

      We replaced the y-label and adjusted the figure legend (Fig. 2).

      (3) I didn't see any examples of raw (filtered) voltage traces. It would be worth including some to demonstrate the quality of the data.

      We have added an example of a filtered voltage trace aligned to tone onset in Fig. S4-1a to illustrate data quality. In addition, all raw and processed voltage traces, along with relevant analysis code, are available through our GitHub repository and the corresponding dataset on Zenodo.

      (4) The description of the calculation of bias (C) in the methods section (lines 749-750) is incorrect. The correct formula is C = -0.5 * [z(hit rate) + z(fa rate)]. I believe this is the formula that the authors used, as they report negative C values. Please clarify or correct.

      Thanks for spotting this. It is now corrected.

      (5) The authors use the terms 'naïve' and 'novice' interchangeably. I suggest sticking with one term to avoid potential confusion.

      (6) Multiple instances: "less trials/day" should be "fewer trials/day"

      (7) Supplementary Table 2: The values reported are proportions, not percentages. Please correct.

      (8) Line 270: Table 2 does not show the number of neurons in the dataset categorized by region. Perhaps the authors meant Supplementary Table 2?

      Fixed. Thank you for pointing these mistakes out.

      (9) Figure 5C: the data from the hard task are entirely obscured by the data from the easy task. I recommend splitting it into two different plots.

      We agree and split the decoding of the easy and the hard task into two graphs (left: easy task; right: hard task). Thank you!

      (10) How many mice contributed to each analyzed data set? Could the authors provide a breakdown in a table somewhere of how many neurons were recorded in each mouse and which ones were included in which analyses?

      We added an overview of the analyzed datasets in supplementary Table 7. Please note that the number of mice and neurons used in each analysis is also reported in the main text and legends. Importantly, all primary analyses were conducted using LME models, which explicitly account for hierarchical data structure and inter-mouse variability, thereby addressing potential concerns about data imbalance or bias.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Weakness#1: The authors claim to have identified drivers that label single DANs in Figure 1, but their confocal images in Figure S1 suggest that many of those drivers label additional neurons in the larval brain. It is also not clear why only some of the 57 drivers are displayed in Figure S1.

      As described in the Results section, we screened 57 GAL4 driver lines based on previous reports. These included drivers that had been shown to label a single dopaminergic neuron (DAN) or a small subset of DANs in the larval or adult brain hemisphere, suggesting potential for specific DAN labeling in larvae.

      In Figure 1, TH-GAL4 was used to cover all neurons in the DL1 cluster, while R58E02 and R30G08 were well known drivers for pPAM. Fly strains in Figure 1h, k, l, and m were reported as single DAN strains in larvae[1], while strains in Figure 1e, f, g were reported identifying only several DANs in adult brains[2,3]. We examined these strains and only some of them labeled single DANs in 3rd instar larval brain hemisphere (Figure 1f, g, h, l and m). Among them, only strains in Figure 1f and h labeled single DAN in the brain hemisphere, without labeling other non-DANs. Other strains labeled non-DANs in addition to single DANs (Figure 1g, l and m). Taking ventral nerve cord (VNC) into consideration, strain in Figure 1h also labeled neurons in VNC (Figure S1e), while strain in Figure 1f did not (Figure S1c).

      In summary, the driver shown in Figure 1f (R76F02AD;R55C10DBD, labeling DAN-c1) is the only line we identified that labels a single DAN in the 3rd instar larval brain hemisphere without additional labeling. The other lines shown in Figure 1 (g, h, l, m) label a single DAN but also include some non-DANs. Figure 1 focuses on strains that label a single or a pair of DANs.

      Labeling patterns for all 57 driver lines are summarized in Table 1. Figure S1 includes representative examples; full confocal images for all screened strains are available upon request, as stated in the figure legend.

      Weakness #2: Critically, R76F02-AD; R55C10-DBD labels more than one neuron per hemisphere in Figure S1c, and the authors cite Xie et al. (2018) to note that this driver labels two DANs in adult brains. Therefore, the authors cannot argue that the experiments throughout their paper using this driver exclusively target DAN-c1.

      Figure S1c shows a single dopaminergic (DA) neuron in each brain hemisphere. While additional GFP-positive signals were occasionally observed, they did not originate from the cell bodies of DA neurons, as these were not labeled by the tyrosine hydroxylase (TH) antibody. These additional GFP signals primarily appeared to be neurites, including axonal terminals, although we cannot rule out the possibility that some represent false-positive signals or weakly stained non-neuronal cell bodies. This interpretation is based on the analysis of 22 third-instar larval brains.

      To clarify this point in the manuscript, we added the following sentence to the Results section: “Based on the analysis of 22 brain samples, we observed this driver strain labels one neuron per hemisphere in the third-instar larval brain (Figure 2a–d, Figure S1c, Table S3).” Additionally, Table S3 was included to summarize the DAN-c1 labeling pattern across all 22 samples. An enlarged inset highlighting GFP-positive signals was also added to Figure S1c.

      Weakness #3: Missing from the screen of 57 drivers is the driver MB320C, which typically labels only PPL1-γ1pedc in the adult and should label DAN-c1 in the larva. If MB320C labels DAN-c1 exclusively in the larva, then the authors should repeat their key experiments with MB320C to provide more evidence for DAN-c1 involvement specifically.

      We thank the reviewer for this insightful suggestion. The MB320C driver primarily labels the PPL1-γ1pedc neuron in the adult brain, along with one or two additional weakly labeled cells. It would indeed be interesting to examine the expression pattern of this driver in third-instar larval brains. If it is found to label only DAN-c1 at this stage, we could consider using it to knock down D2R and assess whether this recapitulates our current findings.

      While we agree that this is a promising direction for future studies, we believe it is not essential for the current manuscript, given the specificity of the DAN-c1 driver (please see our response to Reviewer #3 for details). Nonetheless, we appreciate the reviewer’s suggestion, and we recognize that MB320C could be a valuable tool for future experiments.

      Weakness #4: The authors claim that the SS02160 driver used by Eschbach et al. (2020) labels other neurons in addition to DAN-c1. Could the authors use confocal imaging to show how many other neurons SS02160 labels? Given that both Eschbach et al. and Weber et al. (2023) found no evidence that DAN-c1 plays a role in larval aversive learning, it would be informative to see how SS02160 expression compares with the driver the authors use to label DAN-c1.

      We did not have our own images showing DANs in brains of SS02160 driver cross line. However, Extended Data Figure 1 in the paper of Eschbach et al. shows strongly labeled four neurons on each brain hemisphere[4], indicating that this driver is not a strain only labeling one neuron, DAN-c1.

      Weakness #5: The claim that DAN-c1 is both necessary and sufficient in larval aversive learning should be reworded. Such a claim would logically exclude any other neuron or even the training stimuli from being involved in aversive learning (see Yoshihara and Yoshihara (2018) for a detailed discussion of the logic), which is presumably not what the authors intended because they describe the possible roles of other DANs during aversive learning in the discussion.

      We agree with the reviewer that the terms “necessary” and “sufficient” may be too exclusive and could unintentionally exclude contributions from other neurons. As noted in the Discussion section, we acknowledge that additional dopaminergic neurons may also play roles in larval aversive learning. To reflect this, we have revised our wording to use “important” and “mediates” instead of the more definitive terms “necessary” and “sufficient,” making our conclusions more accurate and appropriately measured.

      Weakness #6: Moreover, if DAN-c1 artificial activation conveyed an aversive teaching signal irrespective of the gustatory stimulus, then it should not impair aversive learning after quinine training (Figure 2k). While the authors interpret Figure 2k (and Figure 5) to indicate that artificial activation causes excessive DAN-c1 dopamine release, an alternative explanation is that artificial activation compromises aversive learning by overriding DAN-c1 activity that could be evoked by quinine.

      This is an excellent point, and we agree that we cannot rule out the possibility that artificial activation interferes with aversive learning by overriding the natural activity of DAN-c1 that would normally be evoked by quinine. The observed results with TRPA1 could potentially be attributed to dopamine depletion, inactivation due to prolonged depolarization, or neural adaptation. However, we believe that our hypothesis - that over-excitation of DAN-c1 impairs learning - is more consistent with our experimental findings and with previously published data. Our rationale is as follows: (1) Associative learning in larvae occurs only when the conditioned stimulus (CS, e.g., an odor such as pentyl acetate) and unconditioned stimulus (US, e.g., quinine) are paired. In wild-type larvae, the CS depolarizes a subset of Kenyon cells in the mushroom body (MB), while the US induces dopamine (DA) release from DAN-c1 into the lower peduncle (LP) compartment (Figure 7a). When both stimuli coincide, calcium influx from CS activation and Gαs signaling via D1-type dopamine receptors activate the MB-specific adenylyl cyclase, rutabaga, which functions as a coincidence detector (Figure 7d). (2) Rutabaga converts ATP to cAMP, activating the PKA signaling pathway and modifying synaptic strength between Kenyon cells and mushroom body output neurons (MBONs) (Figure 7d). These changes in synaptic strength underlie learned behavioral responses to future presentations of the same odor. (3) Our results show that D2R is expressed in DAN-c1, and that D2R knockdown impairs aversive learning. Since D2Rs typically inhibit neuronal excitability and reduce cAMP levels[5], we hypothesize that D2R acts as an autoreceptor in DAN-c1 to restrict DA release. When D2R is knocked down, this inhibition is lifted, leading to increased DA release in response to the US (quinine). The resulting excess DA, in combination with CS-induced calcium influx, would elevate cAMP levels in Kenyon cells excessively - disrupting normal learning processes (Figure 7b). This is supported by studies showing that dunce mutants, which have elevated cAMP levels, also exhibit aversive learning deficits[6]. (4) The TRPA1 activation results are consistent with our over-excitation model. When DAN-c1 was artificially activated at 34°C in the distilled water group, this mimicked the natural activation by quinine, producing an aversive learning response toward the odor (Figure 2k or new Figure 2i, DW group). Similarly, in the sucrose group, artificial activation mimicked quinine, producing a learning response that reflected both appetitive and aversive conditioning (Figure 2k, SUC group). (5) Over-excitation impairs learning in the quinine group. When DAN-c1 was activated during quinine exposure, both artificial and natural activation combined to produce excessive DA release. This over-excitation likely disrupted the cAMP balance in Kenyon cells, impairing learning and resulting in failure of aversive memory formation (Figure 2k, QUI group). This phenotype closely mirrors the effect of D2R knockdown in DAN-c1. (6) Optogenetic activation of DAN-c1 during aversive training similarly produced elevated DA levels due to both natural and artificial stimulation. This again would result in MBN over-excitation and a corresponding learning deficit. When optogenetic activation occurred during non-training phases (resting or testing), no additional DA was released during training, and aversive learning remained intact (Figure 5b). (7) Notably, when optogenetic activation was applied during training, we observed no aversive learning in the distilled water group and no reduction in the sucrose group (Figure 5c, 5d). We interpret this as evidence that the optogenetic stimulation was strong enough to cause elevated DA release in both groups, impairing learning in a manner similar to D2R knockdown or TRPA1 overactivation. (8) We extended this over-excitation framework to directly activate Kenyon cells (MBNs). Since MBNs are involved in both appetitive and aversive learning, their over-excitation disrupted both types of learning (Figure 6), further supporting our hypothesis. In summary, we propose that DAN-c1 activity is tightly regulated by D2R autoreceptors to ensure appropriate levels of dopamine release during aversive learning. Disruption of this regulation - either through D2R knockdown or artificial overactivation of DAN-c1 - results in excessive DA release, over-excitation of Kenyon cells, and impaired learning. This over-excitation model is consistent with both our experimental results and prior literature.

      Weakness #7: The authors should not necessarily expect that D2R enhancer driver strains would reflect D2R endogenous expression, since it is known that TH-GAL4 does not label p(PAM) dopaminergic neurons.

      Just like the example of TH-GAL4, it is possible that the D2R driver strains may partially reflect the expression pattern of endogenous D2R in larval brains. When we crossed the D2R driver strains with the GFP-tagged D2R strain, however, we observed co-localization in DM1 and DL2b dopaminergic neurons, as well as in mushroom body neurons (Figure S3c to h). In addition, D2R knockdown with D2R-miR directly supported that the GFP-tagged D2R strain reflected the expression pattern of endogenous D2R (Figure 4b to d, signals were reduced in DM1). In summary, we think the D2R driver strains supported the expression pattern we observed from the GFP-tagged D2R strain, especially in DM1 DANs.

      Weakness #8: Their observations of GFP-tagged D2R expression could be strengthened with an anti-D2R antibody such as that used by Lam et al., (1999) or Love et al., (2023).

      Love et al. (2023) used the antibody originally described by Draper et al.[6]. We attempted to use the same antibody in our experiments; however, we were unable to detect clear signals following staining. This may be due to a lack of specificity for neurons in the Drosophila larval brain or incompatibility with our staining protocol. Unfortunately, we were unable to locate a copy of the Lam (1999) paper for further reference.

      Weakness #9: Finally, the authors could consider the possibility other DANs may also mediate aversive learning via D2R. Knockdown of D2R in DAN-g1 appears to cause a defect in aversive quinine learning compared with its genetic control (Figure S4e). It is unclear why the same genetic control has unexpectedly poor aversive quinine learning after training with propionic acid (Figure S5a). The authors could comment on why RNAi knockdown of D2R in DAN-g1 does not similarly impair aversive quinine learning (Figure S5b).

      We re-analyzed the data related to DAN-g1. Interestingly, knockdown of D2R in DAN-g1 larvae trained with quinine (QUI) showed a significant difference in response index (R.I.) compared to the distilled water (DW) control group. However, it also differed significantly from the DAN-g1 genetic control group trained with QUI (two-way ANOVA with Tukey’s multiple comparisons, p = 0.0002), while it was not significantly different from the UAS-D2R-miR genetic control group (p = 0.2724). Furthermore, knockdown of D2R in DAN-g1 did not lead to aversive learning deficits when larvae were trained with a different odorant, propionic acid (ProA; Figure S5a). Similarly, using an RNAi line to knock down D2R in DAN-g1 did not result in learning impairment when larvae were trained with pentyl acetate (PA; Figure S5b). These inconsistencies may stem from differences in stimulus intensity across odorants, as well as the variable efficiency of the knockdown strategies (microRNA vs. RNAi). Based on these results, we propose that D2Rs in DAN-g1 may modulate larval aversive learning in a quantitative manner but do not play as critical a role as those in DAN-c1, where knockdown produces a clear qualitative effect. We have added this paragraph to the Discussion section of the manuscript.

      Reviewer #2 (Public review):

      Weakness#1: Is not completely clear how the system DAN-c1, MB neurons and Behavioral performance work. We can be quite sure that DAN-c1;Shits1 were reducing dopamine release and impairing aversive memory (Figure 2h). Similarly, DAN-c1;ChR2 were increasing dopamine release and also impaired aversive memory (Figure 5b). However, is not clear what is happening with DAN-c1;TrpA1 (Figure 2K). In this case the thermos-induction appears to impair the behavioral performance of all three conditions (QUI, DW and SUC) and the behavior is quite distinct from the increase and decrease of dopamine tone (Figure 2h and 5b).

      The study successfully examined the role of D2R in DAN-c1 and MB neurons in olfactory conditioning. The conclusions are well supported by the data, with the exception of the claim that dopamine release from DAN-c1 is sufficient for aversive learning in the absence of unconditional stimulus (Figure 2K). Alternatively, the authors need to provide a better explanation of this point.

      Please refer to our response to Weakness #6 of Reviewer #1 above.

      Reviewer #3 (Public review):

      Weakness #1: It is a strength of the paper that it analyses the function of dopamine neurons (DANs) at the level of single, identified neurons, and uses tools to address specific dopamine receptors (DopRs), exploiting the unique experimental possibilities available in larval Drosophila as a model system. Indeed, the result of their screening for transgenic drivers covering single or small groups of DANs and their histological characterization provides the community with a very valuable resource. In particular the transgenic driver to cover the DANc1 neuron might turn out useful. However, I wonder in which fraction of the preparations an expression pattern as in Figure 1f/ S1c is observed, and how many preparations the authors have analyzed. Also, given the function of DANs throughout the body, in addition to the expression pattern in the mushroom body region (Figure 1f) and in the central nervous system (Figure S1c) maybe attempts can be made to assess expression from this driver throughout the larval body (same for Dop2R distribution).

      We thank the reviewer for the positive comments and thoughtful suggestions.

      Regarding the R76F02AD; R55C10DBD strain, we examined 22 third instar larval brains expressing GFP, Syt-GFP, or Den-mCherry. All brains clearly labeled DAN-c1. In approximately half of the samples, only DAN-c1 was labeled. In the remaining samples, 1 to 5 additional weakly labeled soma were observed, typically without associated neurites. Only 1 or 2 strongly labeled non-DAN-c1 cells were occasionally detected. These additional labeled neurons were rarely dopaminergic. In the ventral nerve cord (VNC), 8 out of 12 samples showed no labeled cells. The remaining 4 samples had 2–4 strongly labeled cells. These results support our conclusion that the R76F02AD; R55C10DBD combination predominantly and specifically labels DAN-c1 in the third instar larval brain. As for the reviewer’s question about the expression pattern of R76F02AD; R55C10DBD and D2R in the larval body, we agree that this is a very interesting avenue for further investigation. However, our current study is focused on the central nervous system and larval learning behaviors. We hope to explore this question more fully in future work.

      We added the following sentence to the Results section: “Based on analysis of 22 brain samples, we believe this driver strain consistently labels one neuron per hemisphere in the third-instar larval brain (Figure 2a - d, Figure S1c, Table S3).” In addition, we included Table S3 to summarize the DAN-c1 labeling patterns observed across these samples.

      Weakness #2: A first major weakness is that the main conclusion of the paper, which pertains to associative memory (last sentence of the abstract, and throughout the manuscript), is not justified by their evidence. Why so? Consider the paradigm in Figure 2g, and the data in Figure 2h (22 degrees, the control condition), where the assay and the experimental rationale used throughout the manuscript are introduced. Different groups of larvae are exposed, for 30min, to an odour paired with either i) quinine solution (red bar), ii) distilled water (yellow bar), or iii) sucrose solution (blue bar); in all cases this is followed by a choice test for the odour on one side and a distilled-water blank on the other side of a testing Petri dish. The authors observe that odour preference is low after odour-quinine pairing, intermediate after odour-water pairing and high after odour-sucrose pairing. The differences in odour preference relative to the odour-water case are interpreted as reflecting odour-quinine aversive associations and odour-sucrose appetitive associations, respectively. However, these differences could just as well reflect non-associative effects of the 30-min quinine or sucrose exposure per se (for a classical discussion of such types of issues see Rescorla 1988, Annu Rev Neurosci, or regarding Drosophila Tully 1988, Behav Genetics, or with some reference to the original paper by Honjo & Furukubo-Tokunaga 2005, J Neurosci that the authors reference, also Gerber & Stocker 2007, Chem Sens).

      As it stands, therefore, the current 3-group type of comparison does not allow conclusions about associative learning.

      We adopted the single-odor larval learning paradigm from Honjo et al., who first developed and validated this method for studying larval olfactory associative learning7,8. To address the reviewer’s concern regarding potential non-associative effects from 30-minute exposure to quinine or sucrose, we refer to multiple lines of evidence provided in Honjo’s studies: (1) Honjo et al. demonstrated that only larvae receiving paired presentations of odor and unconditioned stimulus (quinine or sucrose) exhibited learned responses. Exposure to either stimulus alone, or temporally dissociated presentations, failed to induce any learning response. (2) When tested with a second, non-trained odorant, larvae only responded to the odorant previously paired with the unconditioned stimulus. This rules out generalized olfactory suppression and confirms odor-specific associative learning. (3) Well-characterized learning mutants (e.g., rutabaga, dunce) that show deficits in adult reciprocal odor learning also failed to exhibit learned responses in this single-odor paradigm, further supporting its validity. (4) In our study, we used two distinct odorants (pentyl acetate and propionic acid) and two independent D2R knockdown approaches (UAS-miR and UAS-RNAi). We consistently observed that D2R knockdown in DAN-c1 impaired aversive learning. Importantly, naïve olfactory, gustatory, and locomotor assays ruled out general sensory or motor defects. Comparisons with control groups (odor paired with distilled water) also ruled out non-associative effects such as habituation. Taken together, these results strongly support that the single-odor paradigm is a robust and reliable assay for assessing larval olfactory associative learning in Drosophila. We have added a section in the Discussion to clarify and defend the use of this paradigm in our study.

      Weakness #3: A second major weakness is apparent when considering the sketch in Figure 2g and the equation defining the response index (R.I.) (line 480). The point is that the larvae that are located in the middle zone are not included in the denominator. This can inflate scores and is not appropriate. That is, suppose from a group of 30 animals (line 471) only 1 chooses the odor side and 29, bedazzled after 30-min quinine or sucrose exposure or otherwise confused by a given opto- or thermogenetic treatment, stay in the middle zone... a P.I. of 1.0 would result.

      We gave 5 min during the testing stage to allow the larvae to wander on the testing plate. Under most conditions, more than half of larvae (>50%) will explore around, and the rest may stay in the middle zone (will not be calculated). We used 25-50 larvae in each learning assay, so finally around 10-30 larvae will locate in two semicircular areas. Indeed, based on our raw data, a R.I. of 1 seldom appears. Most of the R.I.s fall into a region from -0.2 to 0.8. We should admit that the calculation equation of R. I. is not linear, so it would be sharper (change steeply) when it approaches -1 and 1. However, as most of the values fall into the region from -0.2 to 0.8, we think ‘border effects’ can be neglected if we have enough numbers of larvae in the calculation (10-30).

      Weakness #4: Unless experimentally demonstrated, claims that the thermogenetic effector shibire/ts reduces dopamine release from DANs are questionable. This is because firstly, there might be shibire/ts-insensitive ways of dopamine release, and secondly because shibire/ts may affect co-transmitter release from DANs.

      Shibire<sup>ts1</sup> gene encodes a thermosensitive mutant of dynamin, expressing this mutant version in target neurons will block neurotransmitter release at the ambient temperature higher than 30C, as it represses vesicle recycling[7]. It is a widely used tool to examine whether the target neuron is involved in a specific physiological function. We cannot rule out that there might be Shibire<sup>ts1</sup> insensitive ways of dopamine release exist. However, blocking dopamine release from DAN-c1 with Shibire<sup>ts1</sup> has already led to learning responses changing (Figure 2h). This result indicated that the dopamine release from DAN-c1 during training is important for larval aversive learning, which has already supported our hypothesis.

      For the second question about the potential co-transmitter release, we think it is a great question. Recently Yamazaki et al. reported co-neurotransmitters in dopaminergic system modulate adult olfactory memories in Drosophila[9], and we cannot rule out the roles of co-released neurotransmitters/neuropeptides in larval learning. Ideally, if we could observe the real time changes of dopamine release from DAN-c1 in wild type and TH knockdown larvae would answer this question. However, live imaging of dopamine release from one dopaminergic neuron is not practical for us at this time. On the other hand, the roles of dopamine receptors in olfactory associative learning support that dopamine is important for Drosophila learning. D1 receptor, dDA1, has been proven to be involved in both adult and larval appetitive and aversive learning[10,11]. In our work, D2R in the mushroom body showed important roles in both larval appetitive and aversive learning (Figure 6a). All this evidence reveals the importance of dopamine in Drosophila olfactory associative learning. In addition, there is too much unknow information about the co-release neurotransmitter/neuropeptides, as well as their potential complex ‘interaction/crosstalk’ relations. We believe that investigation of co-released neurotransmitter/neuropeptides is beyond the scope of this study at this time.

      Weakness #5: It is not clear whether the genetic controls when using the Gal4/ UAS system are the homozygous, parental strains (XY-Gal4/ XY-Gal4 and UAS-effector/ UAS-effector), or as is standard in the field the heterozygous driver (XY-Gal4/ wildtype) and effector controls (UAS-effector/ wildtype) (in some cases effector controls appear to be missing, e.g. Figure 4d, Figure S4e, Figure S5c).

      Almost all controls we used were homozygous parental strains. They did not show abnormal behaviors in either learnings or naïve sensory or locomotion assays. The only exception is the control for DAN-c1, the larvae from homozygous R76F02AD; R55C10DBD strain showed much reduced locomotion speed (Figure S6). To prevent this reduced locomotion speed affecting the learning ability, we used heterozygous R76F02AD; R55C10DBD/wildtype as control, which showed normal learning, naïve sensory and locomotion abilities (Figure 4e to i).

      For Figure 4d, it is a column graph to quantify the efficiency of D2R knockdown with miR. Because we need to induce and quantify the knockdown effect in specific DANs (DM1), only TH-GAL4 can be used as the control group, rather than UAS-D2R-miR. For the missing control groups in Figure S4e and S5c, we have shown them in other Figures (Figure 4e).

      We described this in the Materials and Methods part, “All control strains used in learning assays were homozygous (except DAN-c1×WT), while all experimental groups (D2R knockdown and thermogenetics) used were heterozygous by crossing the corresponding control strains”.

      We also re-organized the Figure S4e and S5c along with the control groups to make it easier to understand.

      Weakness #6: As recently suggested by Yamada et al 2024, bioRxiv, high cAMP can lead to synaptic depression (sic). That would call into question the interpretation of low-Dop2R leading to high-cAMP, leading to high-dopamine release, and thus the authors interpretation of the matching effects of low-Dop2R and driving DANs.

      We appreciate the reviewer’s suggestion. We read through this literature, which also addresses the question we mentioned in the Discussion section, about the discrepancy between the cAMP elevation in the mushroom body neurons and the reduced MBN-MBON synaptic plasticity after olfactory associative learning in Drosophila. The author gave an explanation to the existing D1R-cAMP elevation-MBN-MBON LTD axis, which is really helpful to our understanding about the learning mechanism. However, unfortunately, we do not think this offers a possible explanation for our D2R-related mechanisms. We added this literature into our citation.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Throughout the behavioral experiments, a defect in aversive learning is defined as a relative increase in the response index (RI) after olfactory training with quinine (red) and a defect in appetitive learning as a relative decrease in RI after training with sucrose (blue). Training with distilled water (yellow) is intended to be a control for comparisons within genotypes/treatment groups but causes interpretation issues if it is also affected by experimental manipulations.

      The authors typically make comparisons between quinine, water, and sucrose within each group, but this often forces readers to infer the key comparisons of interest. For example, the key comparison in Figure 2h is the statistically significant difference between the red groups, which differ only in the temperature used during training. Many other figure panels in the paper would also benefit from more direct statistical comparisons, particularly Figure 2k.

      While I recognize the value of the water control, I strongly recommend that the authors make statistical comparisons directly between genotypes/treatment groups where possible and to interpret results with more caution when the water RI score differs substantially between groups. Also, since the authors are conducting two-way ANOVAs before Dunnett's multiple comparisons tests, they ideally should report the p-value for the main effect of each factor, plus the interaction p-value between the two factors before making multiple comparisons.

      We appreciate the reviewer’s suggestion. In response, we re-analyzed all learning assay data in Figures 2 and 4 using two-way ANOVA followed by Tukey’s multiple comparisons test. Unlike our previous analysis, which only compared each experimental group to its corresponding DW control, we now compared all groups against one another. First, we found that most R.I. values from different temperature conditions (Figure 2) or genotypes (Figure 4) trained with DW were not significantly different, with the exception of the data in Figure 2i (formerly Figure 2k; discussed further below). The R.I. from DAN-c1 × D2R-miR larvae trained with QUI was significantly different from both genotype control groups (DAN-c1 × WT and UAS-D2R-miR), while no significant difference was observed between the two controls trained with QUI. Thus, this more comprehensive statistical approach supports the conclusions we previously reported. Second, as the reviewer noted, the new analysis allows for a more direct interpretation of our findings. For example, in the thermogenetic experiments using the Shibire<sup>ts1</sup> strain, the R.I. of DAN-c1 × UAS-Shibire<sup>ts1</sup> larvae trained with QUI at 34°C was not significantly different from the DW group at 34°C, but was significantly different from the QUI group at 22°C. Both findings support our conclusion that blocking dopamine release from DAN-c1 impairs larval aversive learning (Figure 2f).

      In the dTRPA1 activation experiments, the R.I. of DAN-c1 × UAS-dTRPA1 larvae trained with DW at 34°C was significantly lower than that of the DW group at 22°C and the QUI group at 34°C, but not significantly different from the QUI group at 22°C (Figure 2i). These results indicate that activating DAN-c1 during training is sufficient to drive aversive learning even in the absence of QUI. Interestingly, when DAN-c1 × UAS-dTRPA1 larvae were trained with QUI at 34°C, their R.I. was significantly higher than that of the DW group at 34°C and significantly different from the QUI group at 22°C, but not significantly different from the DW group at 22°C (Figure 2i). We interpret this as evidence that simultaneous activation of DAN-c1 by both QUI and dTRPA1 leads to over-excitation, which in turn impairs aversive learning.

      We have revised the figures (Figures 2, 4, 5, and 6) and updated the corresponding Results sections to reflect this new statistical analysis. Additionally, we now report the p-values for interaction, row factor, and column factor - either in Table S4 (for Figure 2) or in the figure captions for Figures 4, 5, 6, S4, S5, and S7.

      (2) The authors' motivation to find tools that label DANs other than DAN-c1 was unclear until much later in the paper when I saw the screening experiments in Figures S4 and S5. The authors could provide a clearer justification for why they focus on DAN-c1 in Figure 2 rather than another DAN for which they found a specific driver in Figure 1. The motivation for looking at individual pPAM neurons was also unclear.

      We sincerely appreciate the reviewer’s thoughtful suggestion. Our study was initially motivated by the goal of characterizing the expression pattern of D2R in the larval brain. From there, we aimed to identify DAN drivers that label specific pairs of dopaminergic neurons, enabling us to assess the functional role of D2R in distinct DAN subtypes through targeted knockdown experiments. This approach ultimately led us to focus on DAN-c1, as it was the only neuronal population for which D2R knockdown resulted in a learning deficit. We then returned to examine the functional significance of DAN-c1 in aversive learning. While we recognize that a more comprehensive narrative might be desirable, the current structure of our manuscript reflects the most logical progression of our work based on our research priorities and experimental outcomes. We did explore alternative manuscript structures - such as beginning with the D2R expression pattern - but found that the current format best conveys our findings and rtionale.

      Regarding our motivation to study individual PAM neurons: we aimed to identify whether D2R plays a role in a specific pair of pPAM neurons involved in larval appetitive learning. However, we were unable to find a driver that exclusively labels DAN-j1, which we believe to be the key neuron in this context (see Figure 1). As a result, our investigation into appetitive learning did not progress beyond the observation of D2R expression in pPAM neurons (Figure 3d), and we did not proceed with learning assays in this context. While we acknowledge the limitations of our study, we believe that our focus on DAN-c1 is well-justified based on both our findings and the tools currently available. We respectfully note that a major restructuring of the manuscript would not necessarily clarify the rationale for focusing on DAN-c1, and therefore we have maintained the current organization.

      (3) The authors should also double-check and update the expression patterns of the drivers in Table 1 using references such as the FlyLight online resource. For example, MB438B labels PPL1-α'2α2, PPL1-α3, PPL1-γ1pedc according to FlyLight, not just PPL1-γ1pedc as initially reported by Aso and Hattori et al. (2014).

      We appreciate the reviewer’s suggestion. We have double-checked and updated the driver expression patterns in Table 1, using FlyLight data as a reference.

      (4) Interpreting overlaid green-and-red fluorescence confocal images would be difficult for any colorblind readers; I suggest that the authors consider using a more friendly color set.

      We thank the reviewer for the suggestion. In our study, we need three distinct colors to represent different channels. We also tested an alternative color scheme using and cyan , magenta, and yellow (CMY) instead of the standard red, green, and blue (RGB). As a comparison (see below), we used a R76F02AD;R55C10DBD (DAN-c1) GFP-labeled brain as an example. In our evaluation, the RGB combination provided clearer visualization and appeared more natural, while the CMY scheme looked somewhat artificial. Therefore, we decided to retain the original RGB color scheme and did not modify the colors in the figures.

      Author response image 1.

      (5) For Figure 4d, counting each DAN as an individual N would violate the assumption of independence made by the unpaired t test, since multiple DANs are found in each brain and therefore are not independent. Instead, it would be better to count each individual N as the average intensity of the four DANs measured in each brain.

      We revised the analysis of microRNA efficiency by averaging the fluorescence intensity of DANs within each brain, treating each brain as a single sample. Based on this approach, we re-plotted Figure 4d.

      (6) Finally, the authors ought to make it clearer throughout the paper that they have implicated a pair of DAN-c1 neurons in aversive learning, not just a single DAN as currently stated in the title.

      We thank the reviewer for the suggestion about the phrase we are using under this scenario. We have changed all “single neuron” to “a pair of neurons”.

      Reviewer #2 (Recommendations for the authors):

      (1) The results section presents: "Activation of DAN-c1 with dTRPA1 at 34°C during training induced repulsion to PA in the distilled water group (Figure 2k). These data suggested that DAN-c1 excitation and presumably increased dopamine release is sufficient for larval aversive learning in the absence of gustatory pairing."<br /> An alternative interpretation is that 30 min of TrpA activation depletes synaptic vesicle pool, or inactivates neurons because of prolonged depolarization, or DAN shows firing rate adaptation (e.g. see Pulver et al. 2009; doi:10.1152/jn.00071.2009). In such a case DA release would be reduced and not increased. Therefore, the interpretation that DAN-c1 activation is both necessary and sufficient in larval aversive learning is difficult to be sustained.

      In this regard it is important to know how the sensory motor abilities are during a thermos-induction at 34°C during 30 min.

      We thank the reviewer for the thoughtful suggestion. Regarding the concern about potential dopamine depletion or neuronal inactivation, we believe a comparison with the Shibire<sup>ts1</sup> experiments helps clarify the interpretation. Activation of Shibire<sup>ts1</sup> during training with distilled water did not result in aversive learning (Figure 2f), which is a distinct phenotype from that observed with dTRPA1 activation (Figure 2i). This suggests that the phenotypes seen with dTRPA1 activation are not due to reduced dopamine release. Additionally, as the reviewer suggested, we have revised our conclusion to state that “DAN-c1 is important for larval aversive learning,” rather than claiming it is both necessary and sufficient.

      (2) The GRASP system can label the contact of a cell in close proximity like synaptic contacts, but also other situations like no synaptic contact. It would be useful to use a more specific synaptic labelling tool, like the trans-synaptic tracing system (Talay et al., 2017 https://doi.org/10.1016/j.neuron.2017.10.011), which provides a better label of synaptic contact.

      We really appreciate the reviewer’s suggestion. First, we acknowledge that there are four general methods to reveal synaptic connections between neurons: immunohistochemistry (IHC), neuron labeling, viral tracing, GRASP, and electron microscopy (EM). Among these, IHC is not sufficiently convincing, viral tracing is challenging and rarely used in Drosophila, and EM, while the most accurate, is prohibitively expensive for our current goals. For these reasons, we chose the GRASP system to demonstrate the synaptic connections from dopaminergic neurons to the mushroom body. Second, we utilized an activity-dependent version of the GRASP system, linking split-GFP1-10 with synaptic proteins (e.g., synaptobrevin)[12] rather than with cell surface proteins like CD4 or CD8. This version significantly reduces false positive signals compared to the previous version, which was tagged with cell surface proteins. While we admit that this method does not provide as solid evidence of synaptic connections as EM, it is the most efficient method available to us for showing the synaptic connections from dopaminergic neurons to the mushroom body. Finally, we thank the reviewer for suggesting the literature on trans-synaptic tracing methods. Unfortunately, this method is not suitable for our goal, as it labels the entire postsynaptic neuron. In our study, we use GRASP to identify the specific dopaminergic neurons based on the synaptic locations and compartments within the mushroom body lobe. We require a labeling system at the subcellular level because, as noted, DAN-c1 forms synapses specifically in the lower peduncle (LP) of the mushroom body lobe, which is part of the axonal bundles from mushroom body neurons. Using the trans-synaptic tracing method would label the entire mushroom body, making it impossible to distinguish DAN-c1 from other DL1 dopaminergic neurons.

      (3) Previously, Honjo et al (2009) used a petri dish of 8.5 cm and a filter paper for reinforcement of 5.5 cm. In this study the petri dish was 10 cm and the size of the filter paper was not informed. That is important information because it will determine the probability of conditioning.

      A piece of filter paper (0.25cm<sup>2</sup> square) was used to hold odorants in this study. We have added this information to the Materials and Methods.

      (4) Statistic analysis of Behavioral performance of Fig 2H-I was made by ANOVA followed by Dunnett multiple comparisons test. Which was the control group? In each graph 2 independent Dunnett tests were performed against the DW control group?

      We have re-analyzed the data using a two-way ANOVA followed by Tukey’s multiple comparison test, as suggested by Reviewer #1. In Figure 2f-j (previously Figure 2h-l), the DW groups serve as the control groups. In our new analysis, we compared data across all groups using Tukey’s multiple comparison test, with particular focus on comparisons to the corresponding DW control groups.

      (5) The sample size in staining experiments of figures 1-4 were not informed.

      We have added Table S2 in the supplementary materials to provide the N numbers for brain samples used in the figures.

      (6) Color code in Fig 5 is missing, I assumed that is the same as in figure 4e

      We added color code in the figure legend of Figure 5.

      (7) Line 506 "0.1% QH solutions" should be 0.1% QUI solutions

      Changed.

      (8) There is no information on the availability of data

      We added Data Availability Statement: Data will be made available on request.

      Reviewer #3 (Recommendations for the authors):

      (1) Axes of behavioural experiments should better show the full span of possible values (-1;1) to allow a fair assessment.

      We have adjusted the axes in all learning assay graphs to a range from -1 to 1 for consistency and clarity.

      (2) Ns should better be given within the figures.

      We have added Table S2 in the supplementary materials to provide the N numbers for brain samples used in the figures. Additionally, Tables S4 to S6 include the N numbers for the learning assays. While we initially considered including the N numbers within the figure captions, we found it challenging to present this information clearly and efficiently. Therefore, we decided to summarize the N numbers in the tables instead.

      (3) Dot- or box-plots would be better for visualizing the data than means and SEMs.

      We agree with the reviewer’s suggestion. In the behavioral assay graphs, both dot plots and mean ± SEM have been included for better visualization of the data.

      (4) The paper reads as if Dop2R would reduce neuronal activity, rather than "just" cAMP levels. Such a misunderstanding should be avoided.

      We appreciate the reviewer’s comment. Under most conditions, dopamine binding to D2Rs activates the Gαi/o pathway, which inhibits adenylyl cyclase (AC) and reduces cAMP levels. This reduction in cAMP ultimately leads to decreased neuronal activity. In other words, D2R activation typically has an inhibitory effect on neurons. Additionally, D2R can exert inhibitory effects through other signaling pathways, such as the inhibition of voltage-gated associative learning, we continue to emphasize the importance of the D2R-mediated AC-cAMP-PKA signaling pathway. However, we do not rule out the potential involvement of additional signaling pathways, such as inhibition of voltage-gated calcium channels via Gβγ subunits[5]. As noted in the Introduction, dopamine receptors are also involved in other signaling cascades, including PKC, MAPK, and CaMKII pathways. In the context of our study, based on current understanding of molecular signaling in Drosophila olfactory, we still think D2R mediated AC-cAMP-PKA signaling pathway would be the most important one. However, we cannot rule out the involvement of other signaling pathways.

      (5) It would be better if citations were more clearly separated into ones that refer to adult flies versus work on larvae.

      We separated the citations related to adult flies from those working on larvae.

      (6) Line 81-83. DopECR is not found in mammals, is it?

      You are correct. DopECR is not found in mammals. This non-canonical receptor shares structural homology with vertebrate β-adrenergic-like receptors. It can be activated rapidly by dopamine as well as insect ecdysteroids[13,14].

      (7) Line 99: Better "a" learning center (some forms of learning work without mushroom bodies).

      We have revised the text from "the learning center" to "a learning center," as suggested by the reviewer.

      (8) Supplemental figures should be numbered according to the sequence in which they are mentioned in the text.

      We have rearranged the sequence of supplemental figures to match the order in which they are referenced in the text.

      (9) It is striking that dTRPA1-driving DANc1 is punishing in the water condition but that this effect does not summate with quinine punishment (but rather seems to impair it). Maybe you can back this up by ChR- or Chrimson-driving DANc1? Or by silencing DANc1 by GtACR1?

      We appreciate the reviewer’s suggestion. Indeed, we observed similar but not identical results when we used ChR2 to activate DAN-c1 during the training stage (Figure 5b and c). We found that activating DAN-c1 with quinine (QUI) impaired aversive learning (Figure 5b), consistent with our findings using dTRPA1 activation of DAN-c1 when trained in QUI at 34°C (Figure 2i). We propose that the over-excitation of DAN-c1, whether induced by QUI or artificial manipulation (optogenetics and thermogenetics), impairs aversive learning, which aligns with our findings for D2R knockdown (Figure 4e). However, there are some differences between dTRPA1 and ChR2 activation. While dTRPA1 activation induced aversive learning when trained with distilled water (DW) at 34°C (Figure 2i), ChR2 did not induce aversive learning under the same conditions (Figure 5c). We believe this difference is due to the varying activation levels between the two manipulations. Our optogenetic stimulus may have been stronger than the thermogenetic one, potentially leading to over-excitation in the DW group, preventing aversive learning. In the QUI group, the more severe over-excitation impaired aversive learning, producing a phenotype similar to that observed with other over-excitation methods (e.g., thermogenetics or D2R knockdown), where the phenotype reached a maximum level. We have also addressed these points in the Discussion section.

      (10) Unless I got the experimental procedure wrong, isn't it surprising that Figure S7b does not uncover a punishing effect of driving TH-Gals neurons?

      This optogenetic experiment with ChR2 expression in TH-GAL4 neurons was a pioneering attempt to activate DAN-c1 using ChR2. As explained in response to question (9), the failure to observe a punishing effect in the DW group when TH-GAL4 neurons were activated during training may be due to our optogenetic stimulus being too strong. This likely resulted in over-excitation of DAN-c1 (among the neurons labeled by TH-GAL4), impairing aversive learning and preventing the appearance of typical aversive behaviors.

      (11) It seems that Figure1f´ is repeated, in a mirrored manner, in Figure 2e.

      We have removed Figure 2e, as it was deemed redundant and not necessary for this section.

      Reference

      (1) Saumweber, T. et al. Functional architecture of reward learning in mushroom body extrinsic neurons of larval Drosophila. Nat Commun 9, 1104 (2018). https://doi.org/10.1038/s41467-018-03130-1

      (2) Aso, Y. & Rubin, G. M. Dopaminergic neurons write and update memories with cell-type-specific rules. Elife 5 (2016). https://doi.org/10.7554/eLife.16135

      (3) Xie, T. et al. A Genetic Toolkit for Dissecting Dopamine Circuit Function in Drosophila. Cell Rep 23, 652-665 (2018). https://doi.org/10.1016/j.celrep.2018.03.068

      (4) Eschbach, C. et al. Recurrent architecture for adaptive regulation of learning in the insect brain. Nat Neurosci 23, 544-555 (2020). https://doi.org/10.1038/s41593-020-0607-9

      (5) Neve, K. A., Seamans, J. K. & Trantham-Davidson, H. Dopamine receptor signaling. J Recept Signal Transduct Res 24, 165-205 (2004). https://doi.org/10.1081/rrs-200029981

      (6) Draper, I., Kurshan, P. T., McBride, E., Jackson, F. R. & Kopin, A. S. Locomotor activity is regulated by D2-like receptors in Drosophila: an anatomic and functional analysis. Dev Neurobiol 67, 378-393 (2007). https://doi.org/10.1002/dneu.20355

      (7) Honjo, K. & Furukubo-Tokunaga, K. Induction of cAMP response element-binding protein-dependent medium-term memory by appetitive gustatory reinforcement in Drosophila larvae. J Neurosci 25, 7905-7913 (2005). https://doi.org/10.1523/JNEUROSCI.2135-05.2005

      (8) Honjo, K. & Furukubo-Tokunaga, K. Distinctive neuronal networks and biochemical pathways for appetitive and aversive memory in Drosophila larvae. J Neurosci 29, 852-862 (2009). https://doi.org/10.1523/JNEUROSCI.1315-08.2009

      (9) Yamazaki, D., Maeyama, Y. & Tabata, T. Combinatory Actions of Co-transmitters in Dopaminergic Systems Modulate Drosophila Olfactory Memories. J Neurosci 43, 8294-8305 (2023). https://doi.org/10.1523/jneurosci.2152-22.2023

      (10) Selcho, M., Pauls, D., Han, K. A., Stocker, R. F. & Thum, A. S. The role of dopamine in Drosophila larval classical olfactory conditioning. PLoS One 4, e5897 (2009). https://doi.org/10.1371/journal.pone.0005897

      (11) Kim, Y. C., Lee, H. G. & Han, K. A. D1 dopamine receptor dDA1 is required in the mushroom body neurons for aversive and appetitive learning in Drosophila. J Neurosci 27, 7640-7647 (2007). https://doi.org/10.1523/JNEUROSCI.1167-07.2007

      (12) Macpherson, L. J. et al. Dynamic labelling of neural connections in multiple colours by trans-synaptic fluorescence complementation. Nat Commun 6, 10024 (2015). https://doi.org/10.1038/ncomms10024

      (13) Abrieux, A., Duportets, L., Debernard, S., Gadenne, C. & Anton, S. The GPCR membrane receptor, DopEcR, mediates the actions of both dopamine and ecdysone to control sex pheromone perception in an insect. Front Behav Neurosci 8, 312 (2014). https://doi.org/10.3389/fnbeh.2014.00312

      (14) Lark, A., Kitamoto, T. & Martin, J. R. Modulation of neuronal activity in the Drosophila mushroom body by DopEcR, a unique dual receptor for ecdysone and dopamine. Biochim Biophys Acta Mol Cell Res 1864, 1578-1588 (2017). https://doi.org/10.1016/j.bbamcr.2017.05.015